KVM GPU

Prev Next

Available in VPC

You can create and manage KVM-based GPU servers in the NAVER Cloud Platform console.

Note
  • To ensure service continuity without interruption in the event of unexpected server failures or scheduled changes, it is recommended to configure servers with inter-zone redundancy by default. To configure redundancy, see Load Balancer overview.
  • NAVER Cloud Platform provides a High Availability (HA) structure to prepare for failures in the physical server, such as memory, CPU, and power supply. HA is a policy for preventing hardware failures from expanding into the virtual machine (VM) server. It supports live migration, which automatically migrates the VM on the host server to another secure host server when a failure occurs in the aforementioned host server. However, the VM server is rebooted when an error occurs where Live Migration cannot be initiated. If the service is operated on a single VM server, a VM restart may cause a service disruption. To reduce the frequency of failures, it is recommended to configure VM server redundancy as described above.

View server information

You can view GPU server information in the same way as general server information. For more information, see View server information.

Caution
  • GPU servers incur full server charges even when stopped.

Create GPU server

You can create an GPU server in i_menu > Services > Compute > Server on the VPC environment console. For more information, see Create server.

Note
  • You can create GPU servers using NCP GPU images with pre-installed drivers and related software.
  • See the following table for Regions available per GPU type:
GPU type Region
NVIDIA A100 KR-1
NVIDIA L4 KR-1, KR-2
NVIDIA L40S KR-2
  • Corporate members can create up to 5 NVIDIA A100 servers.
    If you need more GPU servers or if you are an individual member who needs to create a GPU server, see the FAQ section and submit an inquiry to Customer Support.
  • See the GPU server software installation guide for installation of NVIDIA drivers and required software.

Manage server

You can manage and change GPU server settings in the same way as general servers. For more information, see Manage server.

Note
  • KVM GPU servers cannot be resized.
  • GPU servers cannot be converted to general servers. To change to a general server, you need to create a server image and use the image to create a new general server.
  • You can create GPU servers using server images created from general servers.

Install GPU driver

Select 1 of the 2 following options:

Option 1. Use NCP GPU images with pre-installed drivers

To create an NCP GPU server with NVIDIA drivers and related software pre-installed:

Note
  • For NVIDIA A100 GPU servers, MLNX_OFED is pre-installed to support InfiniBand.
  1. In the VPC environment of the NAVER Cloud Platform console, navigate to i_menu > Services > Compute > Server.
  2. On the server images tab, select the NCP server image tab.
  3. Select KVM GPU as the image type.
  4. Select the Server image name you wish to use.

Option 2. Install GPU driver and CUDA using the NVIDIA guide

Caution
  • When creating a GPU server, set the boot disk size to at least 100 GB.
  • On NVIDIA A100 GPU server, you must install NVIDIA Fabric Manager.
  • If you use InfiniBand on an NVIDIA A100 GPU server, you must install NVIDIA MLNX_OFED.
Note
GPU type Minimum recommended driver release
NVIDIA A100 R530 or later
NVIDIA L4 R535 or later
NVIDIA L40S R535 or later

2.1 Install GPU driver

Verify the minimum recommended driver version for your GPU type, and see NVIDIA driver installation guide to install it.

Note
  • For NVLink support on NVIDIA A100 GPU servers, NVIDIA Fabric Manager must also be installed.
    • See the NVIDIA Fabric Manager documentation and install it.
    • You must install the NVIDIA Fabric Manager that exactly matches your GPU driver version.

2.2 Install CUDA Toolkit

To install CUDA:

  1. Access the NVIDIA CUDA Toolkit website.
  2. Select the CUDA Runtime installation file for the version to install to download.
    • Select runfile (local) as the Installer Type.
  3. Enter the following commands to install CUDA Toolkit:
    # chmod +x [name of downloaded installer file]
    # ./[name of downloaded installer file] --toolkit --toolkitpath=/usr/local/cuda-[version] --samples --samplespath=/usr/local/cuda --silent
    

Check GPU driver and essential software

Check the GPU driver and essential software on the server.

Check driver version

To check the GPU driver version, enter the nvidia-smi command.

  • You can check the version of the installed driver, GPU model, and number of quantity.
    • The following is an example of an NVIDIA A100 GPU server:
# nvidia-smi 
Mon Jun  9 17:23:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:A1:00.0 Off |                  Off |
| N/A   38C    P0             66W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:A2:00.0 Off |                  Off |
| N/A   38C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:B1:00.0 Off |                  Off |
| N/A   35C    P0             62W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:B2:00.0 Off |                  Off |
| N/A   39C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  |   00000000:C1:00.0 Off |                  Off |
| N/A   36C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  |   00000000:C2:00.0 Off |                  Off |
| N/A   38C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  |   00000000:D1:00.0 Off |                  Off |
| N/A   36C    P0             59W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  |   00000000:D2:00.0 Off |                  Off |
| N/A   36C    P0             60W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Check CUDA Toolkit version

To check the CUDA Toolkit version, enter the nvcc --version command.

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Disable MIG

If MIG is enabled, enter the nvidia-smi -mig 0 command.

Note
  • You can check MIG status with either nvidia-smi -L or nvidia-smi -q | grep MIG command.
  • You can also disable MIG using the nvidia-smi -mig disable command.
  • After changing the settings, you must reset the GPU or reboot the system.
  • If MIG is enabled, some GPU metrics may not be displayed.

Check NVLink status

Note
  • The following applies only to NVIDIA A100 GPU servers:

Enter the nvidia-smi topo -m command to check NVLink status between GPUs.

  • Under normal conditions, the NVLink status between NVIDIA A100 GPUs is displayed as NV12.
  • If the NVLink status between GPUs is displayed SYS, check whether NVIDIA Fabric Manager is installed and operating properly.
    • You can check NVIDIA Fabric Manager status with the systemctl status nvidia-fabricmanager command.
# nvidia-smi topo -m
	    GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	    NV12	NV12	NV12	NV12	NV12	NV12	NV12	NODE	PHB	    0-27	0		N/A
GPU1	NV12	 X 	    NV12	NV12	NV12	NV12	NV12	NV12	NODE	PHB	    0-27	0		N/A
GPU2	NV12	NV12	 X 	    NV12	NV12	NV12	NV12	NV12	PHB	    NODE	0-27	0		N/A
GPU3	NV12	NV12	NV12	 X 	    NV12	NV12	NV12	NV12	PHB	    NODE	0-27	0		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	    NV12	NV12	NV12	SYS	    SYS	    28-55	1		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	    NV12	NV12	SYS	    SYS	    28-55	1		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	    NV12	SYS	    SYS	    28-55	1		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	    SYS	    SYS	    28-55	1		N/A
NIC0	NODE	NODE	PHB 	PHB 	SYS 	SYS 	SYS 	SYS 	 X   	NODE				
NIC1	PHB	    PHB	    NODE	NODE	SYS 	SYS 	SYS 	SYS 	NODE	 X 				


Infiniband

This section describes preparation and checks for using Infiniband.

Caution
  • Infiniband is available only on NVIDIA A100 GPU servers.
  • A Fabric Cluster product is required. For more information, see the Fabric Cluster documentation.
Note
  • NVIDIA A100 GPU servers provide 2 200 Gb/s Infiniband HDR interfaces, for a total of 400 Gbps bandwidth.

1. Infiniband prerequisites

The following describes what you need to prepare for infiniband communication between servers:

  • Install the InfiniBand driver (MLNX_OFED).
    • Install the MLNX_OFED version that matches your OS.
  • Ensure all communication target servers are set to Cluster Mode before using Infiniband.
    • Create a cluster from the Fabric Cluster menu, and add the server you want to communicate with to the created Fabric Cluster in Cluster Mode.
    • For more information on how to set it up, see the Fabric Cluster documentation.

2. Check Infiniband connection status

To check the Infiniband connection status, enter the ibstat command and check the port status of the Infiniband connected to the server.

  • The normal status for the Physical state item is LinkUp and that for the State item is Active.
  • The Physical state item may be displayed as Polling status for a short time immediately after server booting, but this is normal because it is in the process of establishing connection.
  • For NVIDIA A100 GPU type servers, the Rate must display as 200.
[root@test01 ~]# ibstat
CA 'mlx5_0'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.41.1000
	Hardware version: 0
	Node GUID: 0x88e9a4ffff667b00
	System image GUID: 0x88e9a4ffff667b00
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 12
		LMC: 0
		SM lid: 14
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff667b00
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.41.1000
	Hardware version: 0
	Node GUID: 0x88e9a4ffff606426
	System image GUID: 0x88e9a4ffff606426
	Port 1:
		State: Down
		Physical state: Polling
		Rate: 10
		Base lid: 15
		LMC: 0
		SM lid: 16
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff606426
		Link layer: InfiniBand

3. Communication test

You can test the Infiniband communication between servers through ibping and must run both servers working as Client and Server that exchange data.
First, check the LID information for the server working as the Server that receives data, and proceed with the ibping test.

3-1. Check LID information

On the server with the Server role, enter the ibstat command to check LID information for mlx5_0 and mlx5_1.

[root@test01 ~]# ibstat mlx5_0 | grep 'Base lid'
                Base lid: 210

[root@test01 ~]# ibstat mlx5_1 | grep 'Base lid'
                Base lid: 209

3-2. mlx5_0 test

To run the mlx5_0 test:

  1. Run the ibping -S -C mlx5_0 command in the server working as the Server.
  2. Run the ibping -c5 -C mlx5_0 -L {verified mlx5_0 LID value} command in the server working as the Client.
    • If the status is normal, the responses to all ping packets are received normally as follows:
    root@test02:~# ibping -c5 -C mlx5_0 -L 210
    Pong from test01.(none) (Lid 210): time 0.027 ms
    Pong from test01.(none) (Lid 210): time 0.013 ms
    Pong from test01.(none) (Lid 210): time 0.012 ms
    Pong from test01.(none) (Lid 210): time 0.017 ms
    Pong from test01.(none) (Lid 210): time 0.013 ms
    
    --- test01.(none) (Lid 210) ibping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 5000 ms
    rtt min/avg/max = 0.012/0.016/0.027 ms
    

3-3. mlx5_1 test

To run the mlx5_0 test:

  1. Run the ibping -S -C mlx5_1 command in the server working as the Server.
  2. Run the ibping -c5 -C mlx5_1 -L {verified mlx5_1 LID value} command in the server working as the Client.
    • If the status is normal, the responses to all ping packets are received normally as follows:
    root@test02~# ibping -c5 -C mlx5_1 -L 209
    Pong from test01.(none) (Lid 209): time 0.024 ms
    Pong from test01.(none) (Lid 209): time 0.013 ms
    Pong from test01.(none) (Lid 209): time 0.009 ms
    Pong from test01.(none) (Lid 209): time 0.009 ms
    Pong from test01.(none) (Lid 209): time 0.014 ms
    
    --- test01.(none) (Lid 209) ibping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 5000 ms
    rtt min/avg/max = 0.009/0.013/0.024 ms
    

3-4. IP over Infiniband (IPoIB) configuration

To set up 3 IPoIB, see NVIDIA's IPoIB configuration guide.