General

Prev Next

Available in VPC

This guide describes how to create and manage a GPU Server from the NAVER Cloud Platform console.
For more information on how to create and manage KVM-based GPU servers, see KVM GPU.

Note
  • Set up redundancy between server zones to ensure continuity of service without interruption in the event of unexpected server malfunctions or scheduled change operations. See Load Balancer overview to set up redundancy.
  • NAVER Cloud Platform provides a High Availability (HA) structure to prepare for failures in the physical server, such as memory, CPU, and power supply. HA is a policy for preventing hardware failures from expanding to the Virtual Machine (VM) server. It supports Live Migration, which automatically migrates the VM on the host server to another secure host server when a failure occurs in the aforementioned host server. However, the VM server is rebooted when an error occurs where Live Migration cannot be initiated. If you are operating a service with a single VM server, server restarts can cause service outages. To reduce the frequency of such failures, it is recommended to implement VM server redundancy as described in the guidance above.

Check server information

You can view the GPU Server information in the same way as viewing regular server information. For more information, see Check server information.

Caution

In the case of GPU Servers, fees are charged even when the server is stopped.

Create server

You can create an GPU server in i_menu > Services > Compute > Server on the VPC environment console. For more information, see Create server.

Note
  • Company members can create up to 5 GPU Servers. If you need more GPU Servers or if you are an individual member who needs to create a GPU Server, check the FAQs and submit an inquiry to Customer Support.

Manage server

You can manage a GPU Server and change its settings in the same way as for a regular server. For more information, see Manage server.

Note
  • The specifications of a GPU Server can be changed only through a server of the same type.
  • Once a GPU Server is created, it cannot be converted into a regular server by removing the GPU. To change to a regular server, you need to create a server image and use the image to create a new regular server.
  • You can use the server image created in a regular server to create a GPU Server.

Reinstall and upgrade the GPU driver/CUDA

In the following situations during use of a GPU Server, you can reinstall the GPU driver and CUDA of the server:

  • If the OS kernel version is changed (updated) and is no longer compatible with the current GPU driver, reinstall the GPU driver only.
  • If you need to upgrade an old version GPU driver currently in use to the latest driver provided on NAVER Cloud Platform
  • If you need to upgrade the driver to a specific version
Note
  • You may not be able to receive official support with regard to problems occurring after you install drivers other than the default versions provided and installed by NAVER Cloud Platform.
  • We do not recommend downgrading the driver to a lower version than what is currently provided by NAVER Cloud Platform.
  • The default versions provided and installed by NAVER Cloud Platform are as follows: (For Windows Server 2016, the R525 version is provided because NVIDIA's service support ended.)
Linux Windows Server 2019 Windows Server 2016
GPU Driver 535.161.08 537.13 527.4
CUDA 12.2 12.2 12.0
cuDNN 8.9.7 8.9.7 8.8.0

See the following guides depending on the OS you are using:

Reinstall GPU driver (Linux)

The driver versions provided by NAVER Cloud Platform can be reinstalled through scripts.
If you plan to use a separate version, follow the manual reinstallation procedure.

Reinstall through scripts

To reinstall the GPU driver through scripts, follow these steps:

  1. Enter the wget http://init.ncloud.com/gpu/ncp_gpu_reinstall.sh commands to download the script file.
  2. Enter the ./ncp_gpu_reinstall.sh commands to delete the existing GPU driver.
  3. Reboot the server.
  4. Re-enter the ./ncp_gpu_reinstall.sh commands to reinstall the GPU driver.

Manual reinstallation

If you fail to reinstall the driver through scripts, or if you want to change and install the GPU driver version, follow these steps:

  1. Download the driver file of the version you wish to reinstall or upgrade the driver to.

  2. Enter the following command to delete the existing GPU driver:

    • Example: default version 535.161.08 provided by NAVER Cloud Platform
    # ./NVIDIA-Linux-x86_64-535.161.08.run -s --uninstall
    Verifying archive integrity... OK
    Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64    535.161.08..........................................................................................................
    
  3. Reboot the server.

  4. Enter the following commands to install the new GPU driver:

    # ./NVIDIA-Linux-x86_64-535.161.08.run -a --ui=none --no-questions --accept-license
    Verifying archive integrity... OK
    Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.161.08............................................................................................................................................................
    
    Welcome to the NVIDIA Software Installer for Unix/Linux
    
    (Omitted)
    
    Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 535.161.08) is now complete.
    
  5. Reboot the server.

  6. Enter the nvidia-smi commands to check the version of the successfully installed driver and the model and number of recognized GPU cards.

Reinstall CUDA (Linux)

CUDA operates properly only if cuDNN is reinstalled as well. To install it, follow these steps:

  1. Access the CUDA Toolkit download website.

  2. Select the CUDA Runtime installation file for the version to install to bring up the download link.

    • For the installation type, select runfile (local), which does not depend on the OS.
  3. Check the symbolic link of the existing CUDA path and delete the actual path directory of the existing version.

    • The existing CUDA Toolkit and cuDNN are deleted.
    # ll /usr/local/cuda
    # lrwxrwxrwx 1 root root 21 Apr 16 15:42 /usr/local/cuda -> /usr/local/cuda-12.2/
    # rm -rf /usr/local/cuda-12.2
    
  4. Enter the following command to reinstall CUDA Toolkit:

    • Example: default version 12.2 provided by NAVER Cloud Platform
    # ./cuda_12.2.2_535.104.05_linux.run --toolkit --toolkitpath=/usr/local/cuda-12.2 --samples --samplespath=/usr/local/cuda-12.2/samples --silent
    
  5. Check the version of the reinstalled CUDA.

    # nvcc --version
    
  6. Access the cuDNN download website to bring up the download link.

  7. Download cuDNN from the link.

  8. cuDNN installation does not bring up any installer but is completed simply when the file is unzipped in the directory where CUDA is installed. See the following to install:

    • Example: default version 8.9.7 provided by NAVER Cloud Platform
    # cd /root
    # tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz
    # mv cudnn-linux-x86_64-8.9.7.29_cuda12-archive cuda
    # cp cuda/include/cudnn* /usr/local/cuda/include
    # cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
    # chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
    
  9. Check the version of the cuDNN installed.

    • How to check cuDNN 8.x criteria
    # cat /usr/local/cuda/include/cudnn_version.h | grep -A2 MAJOR
    #define CUDNN_MAJOR 8
    #define CUDNN_MINOR 9
    #define CUDNN_PATCHLEVEL 7
    

Reinstall GPU driver (Windows)

To reinstall the GPU driver, you can simply run the script for auto installation.
If automatic reinstallation fails, you can reinstall the driver manually.

Automatic reinstallation

To download and run a script file for automatic reinstallation of the GPU driver, follow these steps:

  1. Enter the following command to download the script file:
    Start-BitsTransfer -Source "http://init.ncloud.com/win_gpu/install_gpu.exe" -Destination "c:\install_gpu.exe"
    
  2. Run the install_gpu.exe file.
    • The Nvidia GPU driver install popup window appears, and installation takes about 10 to 15 minutes.
  3. When the installation complete popup window appears, reboot the server.
  4. Enter the run - devmgmt.msc command to open the device manager console.
  5. On the device manager console, double click the NVIDIA graphics card under Display Adapters.
  6. Under the [Driver] tab on the properties popup window, check the driver version.
  7. Open the cmd window and enter cd C:\Program Files\NVIDIA Corporation\NVSMI to relocate the driver, and then enter nvidia-smi.
    • You can see that the graphics card has been recognized.

Manual reinstallation

If you are unable to run automatic reinstallation using the script, to reinstall the GPU driver manually, follow these steps:

  1. Download the driver file of the version you wish to reinstall or upgrade the driver to from the GPU driver download website.
  2. Run the downloaded GPU driver EXE file to install the driver.
    • Follow the instructions on the installer popup window.
    • You must agree to the software's terms of service to use it.
    • For the Installation option, select Express.
  3. Reboot the server.
  4. Enter the run - devmgmt.msc command to open the device manager console.
  5. On the device manager console, double click the NVIDIA graphics card under Display Adapters.
  6. Under the [Driver] tab on the properties popup window, check the driver version.

Reinstall CUDA (Windows)

CUDA operates properly only if cuDNN is reinstalled as well. To install it, follow these steps:

  1. Access the CUDA Toolkit download website.

  2. Set the platform and click the link to download the EXE file.

  3. Run the downloaded CUDA EXE file to install CUDA.

    • Follow the instructions on the installer popup window.
    • You must agree to the software's terms of service to use it.
    • For the Installation option, select Express.
  4. Log in to the cuDNN download website and download cuDNN file of the desired version.

    Note

    Only members can download cuDNN. If you have no account, sign up as a member and log in.

  5. Unzip the downloaded ZIP file and replace the bin, include, and lib folders in the CUDA 11.2.2 installation path with folders of the same names from the ZIP file.

  6. Open the cmd window and enter cd C:\Program Files\NVIDIA Corporation\NVSMI to relocate the driver, and then enter nvidia-smi.

    • You can see that the graphics card has been recognized.
Note

When you run the nvidia-smi command, the following information is output:

Item Description
Driver Version Version of the installed driver
CUDA Version CUDA APIs version supported by the driver
Name GPU model name
Temp GPU core temperature
Perf GPU's performance state
  • Changes flexibly according to the GPU temperature and power usage
Pwr:Usage/Cap Current level of power being used by the GPU
Memory-Usage Memory usage by the GPU (current usage/GPU memory capacity)
Volatile GPU-Util GPU usage rate
Uncorr. ECC Number of Uncorrectable Error Correction Code (ECC) occurrences
MIG M. Multi Instance GPU (MIG) Mode status
  • Not applicable to P40, T4, and V100 GPU provided by NAVER Cloud Platform
Processes Information of the processes currently using the GPU
  • GPU: number of GPUs where the process is running
  • GI ID/CI ID: information of GPU instance and compute instance sliced through Multi Instance GPU (MIG)
  • PID, Process name: ID and name of the process
  • Type: C (Compute) for CUDA/OpenCL processes and G (Graphics) for DirectX/OpenGL processes
  • GPU Memory Usage: GPU memory usage by the process

Collect/forward diagnostic data through NTK

You can collect and forward the NVIDIA debug logs of GPU VM through Ncloud Tool Kit (NTK).
The process of collecting and forwarding the debug logs is as follows:

1. Run NTK
2. Collect GPU debug logs

Note

For more information on NTK, see Ncloud Tool Kit (Linux/Windows).

1. Run NTK

To run NTK on a Linux server, follow these steps:

  1. Enter the cd /usr/local/etc command.
    • Move to the path where NTK is located.
  2. Enter the tar zxvf ntk.tar.gz command.
    • The NTK file is unzipped.
    • If no ntk.tar.gz file exists or if you wish to replace the existing file with the latest version, enter wget -P /usr/local/etc http://init.ncloud.com/server/ntk/linux/xen/ntk.tar.gz to download the file.
  3. Enter the /usr/local/etc/ntk/ntk commands to run NTK.

2. Collect GPU debug logs

To collect GPU debug logs in NTK, follow these steps:

  1. On the main screen of NTK, select E EXECUTE - << Run System Apps >>.
    gpu-server-createvpc_guide28_ko

  2. Select G GPU DEBUG COLLECTING - FOR LOG COLLECT >>.
    gpu-server-createvpc_guide29_ko

  3. Select Yes to run the log collection script.
    gpu-server-createvpc_guide30_ko

  4. When a log collection success message and the log file storage path are displayed, check the details and select OK.
    gpu-server-createvpc_guide31_ko

  5. Select whether to transfer the log file to NAVER Cloud's technical support center.

  • If you want to transfer the log file, select Yes. The file transfer starts right away. If the file is transferred, a success message and the ShortURL where you can download the log are displayed.
  • If you do not want to transfer the log file, select No to end it.
    gpu-server-createvpc_guide32_ko

Transfer created logs

To transfer the created logs to NAVER Cloud's technical support center, follow these steps:

Note

If you are unable to transfer the log file to NAVER Cloud's technical support center due to a network issue, attach and forward the log file stored in the VM.

  • Log file storage path: /usr/local/etc/ntk/logs/gpu_get_log
  1. On the main screen of NTK, select V VIEW - << View & Upload Logs >>.
    gpu-server-createvpc_guide34_ko

  2. Select G - GPU DEBUG FILES.
    gpu-server-createvpc_guide35_ko

  3. Check the list of log files created and select the log files to transfer to NAVER Cloud's technical support center.
    gpu-server-createvpc_guide36_ko

  4. Select Yes.

    • The file transfer starts right away. If the file is transferred, a success message and the ShortURL where you can download the log are displayed.
      gpu-server-createvpc_guide37_ko

GPU debug log file types

The following are the types of GPU log files created through NTK:

Log file name Commands used Role
date.log date Outputs the log creation date and time
dmesg-xid.log dmesg grep -i xid Outputs the kernel message including xid
dmesg.log dmesg Outputs the kernel message
free.log free -m Outputs the memory usage in MB
last.log last Outputs the login and reboot logs
ps.log ps auxf Checks the process status
top.log top -b -n 1 Outputs top (once in batch mode) and system information
uptime.log uptime Outputs the uptime result
nvidia-bug-report.log.gz cell Runs the nvidia-bug-report.sh script

Monitor GPU resources

You can use Cloud Insight to monitor the GPU resources. For more information on Cloud Insight, see the Cloud Insight user guides.

View dashboard

In the NAVER Cloud Platform console, navigate to i_menu > Services > Management & Governance > Cloud Insight (Monitoring) > Dashboard and select the Service Dashboard/Server dashboard to view the default metrics related to servers at a glance.

  • Click [Change widget data] to filter the data to be displayed on Widget.
  • The metrics you can check in relation to GPU servers are as follows:
    • Current GPU MEM Activity (GPU/vmem_usage (%)) : GPU memory controller usage rate = GPU/vmem_usage (%)
    • Current GPU MEM Usage (GPU/vmem_usage (MiB)) : GPU memory usage = GPU/vmem_usage (MiB)
    • Current GPU Usage: GPU usage = GPU/usage (%)

For more information on how to view the dashboard, see View Cloud Insight dashboard.

Add user dashboard

You can add user dashboards to monitor only the metrics you want.
Click [Create dashboard] to create a new dashboard, and then click [Add widget] to set the types of widgets and metrics information to be displayed.

  • To create a widget related to the GPU Server, you must select Server as Product Type when setting the data.
  • If you are using a GPU-related metric as setting data, you must add as many dimensions (gpu_idx) as the number of GPUs.

For more information on how to create the dashboard additionally, see Create Cloud Insight dashboard.