General

Note

Set up redundancy between server zones to ensure continuity of service without interruption in the event of unexpected server malfunctions or scheduled change operations. See Load Balancer overview to set up redundancy.
NAVER Cloud Platform provides a High Availability (HA) structure to prepare for failures in the physical server, such as memory, CPU, and power supply. HA is a policy for preventing hardware failures from expanding to the Virtual Machine (VM) server. It supports Live Migration, which automatically migrates the VM on the host server to another secure host server when a failure occurs in the aforementioned host server. However, the VM server is rebooted when an error occurs where Live Migration can't be initiated. If the service is being operated with a single VM server, set up multiplexing for VM servers as described above to reduce the frequency of failures that may occur as a result of rebooting the VM server.

Note

Company members can create up to 5 GPU Servers. If you need more GPU Servers or if you are an individual member who needs to create a GPU Server, check the FAQs and submit an inquiry to customer support.

Note

The specifications of a GPU Server can be changed only through a server of the same type.
After a GPU Server is created, it cannot be converted into a regular server by removing the GPU. To change to a regular server, you need to create a server image and use the image for newly creating a regular server.
You can use the server image created in a regular server to create a GPU Server.

Note

You may not be able to receive official support with regard to problems occurring after you install drivers other than the default versions provided and installed on NAVER Cloud Platform.
We do not recommend downgrading the driver to a lower version than what is currently provided on NAVER Cloud Platform.
The default versions provided and installed on NAVER Cloud Platform are as follows: (For Windows Server 2016, the R525 version is provided because of NVIDIA's service support end.)

	Linux	Windows Server 2019	Windows Server 2016
GPU Driver	535.161.08	537.13	527.4
CUDA	12.2	12.2	12.0
cuDNN	8.9.7	8.9.7	8.8.0

Note

When you run the nvidia-smi commands, the following information is output:

Item	Description
Driver Version	Version of the installed driver
CUDA Version	CUDA APIs version supported by the driver
Name	GPU model name
Temp	GPU core temperature
Perf	GPU's performance state Changes flexibly according to the GPU temperature and power usage
Pwr:Usage/Cap	Current level of power being used by the GPU
Memory-Usage	Memory usage by the GPU (current usage/GPU memory capacity)
Volatile GPU-Util	GPU usage rate
Uncorr. ECC	Number of Uncorrectable Error Correction Code (ECC) occurrences
MIG M.	Multi Instance GPU (MIG) Mode status Not applicable to P40, T4, and V100 GPU provided on NAVER Cloud Platform
Processes	Information of the processes currently using the GPU GPU: number of GPUs where the process is running GI ID/CI ID: information of GPU instance and compute instance sliced through Multi Instance GPU (MIG) PID, Process name: ID and name of the process Type: C (Compute) for CUDA/OpenCL processes and G (Graphics) for DirectX/OpenGL processes GPU Memory Usage: GPU memory usage by the process

Note

For more information on NTK, see Ncloud Tool Kit (Linux/Windows).

Note

If you are unable to transfer the log file to NAVER Cloud's technical support center due to a network issue, attach and forward the log file stored in the VM.

Log file storage path: /usr/local/etc/ntk/logs/gpu_get_log

Log file name	Commands used	Role
date.log	date	Outputs the log creation date and time
dmesg-xid.log	dmesg grep -i xid	Outputs the kernel message including xid
dmesg.log	dmesg	Outputs the kernel message
free.log	free -m	Outputs the memory usage in MB
last.log	last	Outputs the login and reboot logs
ps.log	ps auxf	Checks the process status
top.log	top -b -n 1	Outputs top (once in batch mode) and system information
uptime.log	uptime	Outputs the uptime result
nvidia-bug-report.log.gz	cell	Runs the nvidia-bug-report.sh script

General

Check server information

Create server

Manage server

Re-install and upgrade the GPU driver and CUDA

Re-install GPU driver (Linux)

Re-install through scripts

Manual re-installation

Re-install CUDA (Linux)

Re-install GPU driver (Windows)

Automatic re-installation

Manual re-installation

Re-install CUDA (Windows)

Collection/forward of diagnostic data through NTK

1. Run NTK

2. Collect GPU debug logs

Transfer created log

GPU debug log file types

Monitoring GPU resources

View dashboard

Add user dashboard

What's Next