KVM GPU

Prev Next

Available in VPC

This guide describes how to create and manage a KVM-based GPU Server from the NAVER Cloud Platform console.

Note
  • Set up redundancy between server zones to ensure continuity of service without interruption in the event of unexpected server malfunctions or scheduled change operations. See Load Balancer overview to set up redundancy.
  • NAVER Cloud Platform provides a High Availability (HA) structure to prepare for failures in the physical server, such as memory, CPU, and power supply. HA is a policy for preventing hardware failures from expanding to the Virtual Machine (VM) server. It supports Live Migration, which automatically migrates the VM on the host server to another secure host server when a failure occurs in the aforementioned host server. However, the VM server is rebooted when an error occurs where Live Migration cannot be initiated. If you are operating a service with a single VM server, server restarts can cause service outages. To reduce the frequency of such failures, it is recommended to implement VM server redundancy as described in the guidance above.

Check server information

You can view the GPU Server information in the same way as viewing regular server information. For more information, see Check server information.

Caution
  • In the case of GPU Servers, fees are charged even when the server is stopped.

Create GPU server

You can create a GPU Server in Services > Compute > Server on the console. For more information on how to create a server, see Create server.

Note
  • For GPU Servers, you can use NCP GPU images with drivers and related software pre-installed.
  • See the following table for Regions available per GPU type:
GPU type Region
NVIDIA A100 KR-1
NVIDIA L4 KR-2
NVIDIA L40S KR-2
  • Company members can create up to 5 NVIDIA A100 Servers.
    If you need more GPU Servers or if you are an individual member who needs to create a GPU Server, check the FAQs and submit an inquiry to Customer Support.
  • For more information on how to install the NVIDIA driver and required software, see GPU Server software installation guide.

Manage server

You can manage a GPU Server and change its settings in the same way as for a regular server. For more information, see Manage server.

Note
  • You cannot change the KVM GPU Server specifications.
  • GPU Servers cannot be converted into regular servers. To change to a regular server, you need to create a server image and use the image to create a new regular server.
  • You can use the server image created in a regular server to create a GPU Server.

Install GPU driver

Select one of the two following options:

Option 1. Use NCP GPU image with drivers pre-installed

To create an NCP GPU Server with NVIDIA drivers and related software pre-installed, follow these steps:

Note
  • When creating a NVIDIA A100 GPU server, MLNX_OFED for using Infiniband is pre-installed.
  1. Access the NAVER Cloud Platform console.
  2. From the Region menu, click the Region you are using.
  3. From the Platform menu, click the platform you are using.
  4. Click Services > Compute > Server in order.
  5. In the tab that displays the server image, select the NCP Server Image tab.
  6. Under the image type, select KVM GPU.
  7. Select the Server image name you wish to use.

Option 2. Install GPU driver and CUDA using NVIDIA guide

Caution
  • When creating a GPU Server, set the boot disk size to at least 100 GB.
  • On NVIDIA A100 GPU Servers, you must install NVIDIA Fabric Manager.
  • If you plan to use Infiniband on an NVIDIA A100 GPU Server, you must install NVIDIA MLNX_OFED.
Note
GPU type Minimum recommended driver release
NVIDIA A100 R530 or later
NVIDIA L4 R535 or later
NVIDIA L40S R535 or later

2.1 Install GPU driver

Verify the minimum recommended driver version for your GPU type, and see NVIDIA driver installation guide to install it.

Note
  • On NVIDIA A100 GPU Servers, NVIDIA Fabric Manager must be installed to support NVLink.
    • See NVIDIA Fabric Manager document and install it.
    • You must install the NVIDIA Fabric Manager that exactly matches your GPU driver version.

2.2 Install CUDA Toolkit

To install CUDA, follow these steps:

  1. Connect to the NVIDIA CUDA Toolkit website.
  2. Select the CUDA Runtime installation file for the version to install to download.
    • Select runfile (local) as the Installer Type.
  3. Enter the following commands to install CUDA Toolkit:
    # chmod +x [name of downloaded installer file]
    # ./[name of downloaded installer file] --toolkit --toolkitpath=/usr/local/cuda-[version] --samples --samplespath=/usr/local/cuda --silent
    

Check GPU driver and essential software

Check the GPU driver and essential software on the server.

Check driver version

To check the GPU driver's version, enter the nvidia-smi command.

  • You can check the version of the installed driver, and the model and number of GPU.
    • The following is an example of an NVIDIA A100 GPU server:
# nvidia-smi 
Mon Jun  9 17:23:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:A1:00.0 Off |                  Off |
| N/A   38C    P0             66W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:A2:00.0 Off |                  Off |
| N/A   38C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:B1:00.0 Off |                  Off |
| N/A   35C    P0             62W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:B2:00.0 Off |                  Off |
| N/A   39C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  |   00000000:C1:00.0 Off |                  Off |
| N/A   36C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  |   00000000:C2:00.0 Off |                  Off |
| N/A   38C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  |   00000000:D1:00.0 Off |                  Off |
| N/A   36C    P0             59W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  |   00000000:D2:00.0 Off |                  Off |
| N/A   36C    P0             60W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Check CUDA Toolkit version

To check the CUDA Toolkit version, enter the nvcc --version command.

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Check NVLink status

Note
  • The following applies only to NVIDIA A100 GPU Servers:

Enter the nvidia-smi topo -m command to check the NVLink status between GPUs.

  • If normal, the NVLink status between NVIDIA A100 GPUs will display as NV12.
  • If NVLink status appears as SYS between GPUs, check that NVIDIA Fabric Manager is installed and running normally.
    • You can check NVIDIA Fabric Manager status with systemctl status nvidia-fabricmanager command.
# nvidia-smi topo -m
	    GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	    NV12	NV12	NV12	NV12	NV12	NV12	NV12	NODE	PHB	    0-27	0		N/A
GPU1	NV12	 X 	    NV12	NV12	NV12	NV12	NV12	NV12	NODE	PHB	    0-27	0		N/A
GPU2	NV12	NV12	 X 	    NV12	NV12	NV12	NV12	NV12	PHB	    NODE	0-27	0		N/A
GPU3	NV12	NV12	NV12	 X 	    NV12	NV12	NV12	NV12	PHB	    NODE	0-27	0		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	    NV12	NV12	NV12	SYS	    SYS	    28-55	1		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	    NV12	NV12	SYS	    SYS	    28-55	1		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	    NV12	SYS	    SYS	    28-55	1		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	    SYS	    SYS	    28-55	1		N/A
NIC0	NODE	NODE	PHB 	PHB 	SYS 	SYS 	SYS 	SYS 	 X   	NODE				
NIC1	PHB	    PHB	    NODE	NODE	SYS 	SYS 	SYS 	SYS 	NODE	 X 				


Infiniband

The following section describes preparation and checks for using Infiniband:

Caution
  • Infiniband is available only on NVIDIA A100 GPU Servers.
  • A Fabric Cluster product is required. See Fabric Cluster.
Note
  • NVIDIA A100 GPU Servers provide 2 200 Gb/s Infiniband HDR interfaces, for a total of 400 Gbps bandwidth.

1. Infiniband prerequisites

The following describes what to prepare for Infiniband communication between servers:

  • Install the Infiniband driver (MLNX_OFED).
  • Ensure all target servers are set to Cluster Mode before using Infiniband.
    • Create a cluster from the Fabric Cluster menu, and add the server you want to communicate with to the created Fabric Cluster in Cluster Mode.
    • For more information about how to set it up, see Fabric Cluster.

2. Check Infiniband connection status

To check the Infiniband connection status, enter the ibstat command and check the port status of the Infiniband connected to the server.

  • The normal status for the Physical state item is LinkUp and that for the State item is Active.
  • The Physical state item may be displayed as Polling status for a short time immediately after server booting, but this is normal because it is in the process of establishing connection.
  • On NVIDIA A100 type GPU Servers, Rate should appear as 200.
[root@test01 ~]# ibstat
CA 'mlx5_0'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.41.1000
	Hardware version: 0
	Node GUID: 0x88e9a4ffff667b00
	System image GUID: 0x88e9a4ffff667b00
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 12
		LMC: 0
		SM lid: 14
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff667b00
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.41.1000
	Hardware version: 0
	Node GUID: 0x88e9a4ffff606426
	System image GUID: 0x88e9a4ffff606426
	Port 1:
		State: Down
		Physical state: Polling
		Rate: 10
		Base lid: 15
		LMC: 0
		SM lid: 16
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff606426
		Link layer: InfiniBand

3. Test communication

You can test the Infiniband communication between servers through ibping and must run both servers working as Client and Server that exchange data.
First, check the LID information for the server working as the Server that receives data, and proceed with the ibping test.

3-1. Check LID information

Enter the ibstat command in the server working as the Server to check the LID information on mlx5_0 and mlx5_1.

[root@test01 ~]# ibstat mlx5_0 | grep 'Base lid'
                Base lid: 210

[root@test01 ~]# ibstat mlx5_1 | grep 'Base lid'
                Base lid: 209

3-2. mlx5_0 test

To proceed with the mlx5_0 test, follow these steps:

  1. Run the ibping -S -C mlx5_0 command in the server working as the Server.

  2. Run the ibping -c5 -C mlx5_0 -L {verified mlx5_0 LID value} command in the server working as the Client.

    • If the status is normal, the responses to all ping packets are received normally as follows:
    root@test02:~# ibping -c5 -C mlx5_0 -L 210
    Pong from test01.(none) (Lid 210): time 0.027 ms
    Pong from test01.(none) (Lid 210): time 0.013 ms
    Pong from test01.(none) (Lid 210): time 0.012 ms
    Pong from test01.(none) (Lid 210): time 0.017 ms
    Pong from test01.(none) (Lid 210): time 0.013 ms
    
    --- test01.(none) (Lid 210) ibping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 5000 ms
    rtt min/avg/max = 0.012/0.016/0.027 ms
    

3-3. mlx5_1 test

To proceed with the mlx5_0 test, follow these steps:

  1. Run the ibping -S -C mlx5_1 command in the server working as the Server.

  2. Run the ibping -c5 -C mlx5_1 -L {verified mlx5_1 LID value} command in the server working as the Client.

    • If the status is normal, the responses to all ping packets are received normally as follows:
    root@test02~# ibping -c5 -C mlx5_1 -L 209
    Pong from test01.(none) (Lid 209): time 0.024 ms
    Pong from test01.(none) (Lid 209): time 0.013 ms
    Pong from test01.(none) (Lid 209): time 0.009 ms
    Pong from test01.(none) (Lid 209): time 0.009 ms
    Pong from test01.(none) (Lid 209): time 0.014 ms
    
    --- test01.(none) (Lid 209) ibping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 5000 ms
    rtt min/avg/max = 0.009/0.013/0.024 ms
    

3-4. Set IP over Infiniband (IPoIB)

To set up the IPoIB settings, see NVIDIA's IPoIB setting guide.