Available in VPC
You can create and manage KVM-based GPU servers in the NAVER Cloud Platform console.
- To ensure service continuity without interruption in the event of unexpected server failures or scheduled changes, it is recommended to configure servers with inter-zone redundancy by default. To configure redundancy, see Load Balancer overview.
- NAVER Cloud Platform provides a High Availability (HA) structure to prepare for failures in the physical server, such as memory, CPU, and power supply. HA is a policy for preventing hardware failures from expanding into the virtual machine (VM) server. It supports live migration, which automatically migrates the VM on the host server to another secure host server when a failure occurs in the aforementioned host server. However, the VM server is rebooted when an error occurs where Live Migration cannot be initiated. If the service is operated on a single VM server, a VM restart may cause a service disruption. To reduce the frequency of failures, it is recommended to configure VM server redundancy as described above.
View server information
You can view GPU server information in the same way as general server information. For more information, see View server information.
- GPU servers incur full server charges even when stopped.
Create GPU server
You can create an GPU server in
> Services > Compute > Server on the VPC environment console. For more information, see Create server.
- You can create GPU servers using NCP GPU images with pre-installed drivers and related software.
- See the following table for Regions available per GPU type:
| GPU type | Region |
|---|---|
| NVIDIA A100 | KR-1 |
| NVIDIA L4 | KR-1, KR-2 |
| NVIDIA L40S | KR-2 |
- Corporate members can create up to 5 NVIDIA A100 servers.
If you need more GPU servers or if you are an individual member who needs to create a GPU server, see the FAQ section and submit an inquiry to Customer Support. - See the GPU server software installation guide for installation of NVIDIA drivers and required software.
Manage server
You can manage and change GPU server settings in the same way as general servers. For more information, see Manage server.
- KVM GPU servers cannot be resized.
- GPU servers cannot be converted to general servers. To change to a general server, you need to create a server image and use the image to create a new general server.
- You can create GPU servers using server images created from general servers.
Install GPU driver
Select 1 of the 2 following options:
- Option 1. Use NCP GPU images with pre-installed drivers
- Option 2. Manually install GPU drivers and related software
Option 1. Use NCP GPU images with pre-installed drivers
To create an NCP GPU server with NVIDIA drivers and related software pre-installed:
- For NVIDIA A100 GPU servers, MLNX_OFED is pre-installed to support InfiniBand.
- In the VPC environment of the NAVER Cloud Platform console, navigate to
> Services > Compute > Server. - On the server images tab, select the NCP server image tab.
- Select KVM GPU as the image type.
- Select the Server image name you wish to use.
Option 2. Install GPU driver and CUDA using the NVIDIA guide
- When creating a GPU server, set the boot disk size to at least 100 GB.
- On NVIDIA A100 GPU server, you must install NVIDIA Fabric Manager.
- If you use InfiniBand on an NVIDIA A100 GPU server, you must install NVIDIA MLNX_OFED.
- NVIDIA driver documentation
- The minimum recommended driver versions for each GPU type are listed below:
| GPU type | Minimum recommended driver release |
|---|---|
| NVIDIA A100 | R530 or later |
| NVIDIA L4 | R535 or later |
| NVIDIA L40S | R535 or later |
2.1 Install GPU driver
Verify the minimum recommended driver version for your GPU type, and see NVIDIA driver installation guide to install it.
- For NVLink support on NVIDIA A100 GPU servers, NVIDIA Fabric Manager must also be installed.
- See the NVIDIA Fabric Manager documentation and install it.
- You must install the NVIDIA Fabric Manager that exactly matches your GPU driver version.
2.2 Install CUDA Toolkit
To install CUDA:
- Access the NVIDIA CUDA Toolkit website.
- Select the CUDA Runtime installation file for the version to install to download.
- Select runfile (local) as the Installer Type.
- Enter the following commands to install CUDA Toolkit:
# chmod +x [name of downloaded installer file] # ./[name of downloaded installer file] --toolkit --toolkitpath=/usr/local/cuda-[version] --samples --samplespath=/usr/local/cuda --silent
Check GPU driver and essential software
Check the GPU driver and essential software on the server.
Check driver version
To check the GPU driver version, enter the nvidia-smi command.
- You can check the version of the installed driver, GPU model, and number of quantity.
- The following is an example of an NVIDIA A100 GPU server:
# nvidia-smi
Mon Jun 9 17:23:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:A1:00.0 Off | Off |
| N/A 38C P0 66W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:A2:00.0 Off | Off |
| N/A 38C P0 63W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:B1:00.0 Off | Off |
| N/A 35C P0 62W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:B2:00.0 Off | Off |
| N/A 39C P0 63W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | Off |
| N/A 36C P0 64W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:C2:00.0 Off | Off |
| N/A 38C P0 63W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:D1:00.0 Off | Off |
| N/A 36C P0 59W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:D2:00.0 Off | Off |
| N/A 36C P0 60W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Check CUDA Toolkit version
To check the CUDA Toolkit version, enter the nvcc --version command.
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
Disable MIG
If MIG is enabled, enter the nvidia-smi -mig 0 command.
- You can check MIG status with either
nvidia-smi -Lornvidia-smi -q | grep MIGcommand. - You can also disable MIG using the
nvidia-smi -mig disablecommand. - After changing the settings, you must reset the GPU or reboot the system.
- If MIG is enabled, some GPU metrics may not be displayed.
Check NVLink status
- The following applies only to NVIDIA A100 GPU servers:
Enter the nvidia-smi topo -m command to check NVLink status between GPUs.
- Under normal conditions, the NVLink status between NVIDIA A100 GPUs is displayed as
NV12. - If the NVLink status between GPUs is displayed
SYS, check whether NVIDIA Fabric Manager is installed and operating properly.- You can check NVIDIA Fabric Manager status with the
systemctl status nvidia-fabricmanagercommand.
- You can check NVIDIA Fabric Manager status with the
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 NODE PHB 0-27 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 NODE PHB 0-27 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PHB NODE 0-27 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PHB NODE 0-27 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS 28-55 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS 28-55 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS 28-55 1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS 28-55 1 N/A
NIC0 NODE NODE PHB PHB SYS SYS SYS SYS X NODE
NIC1 PHB PHB NODE NODE SYS SYS SYS SYS NODE X
Infiniband
This section describes preparation and checks for using Infiniband.
- Infiniband is available only on NVIDIA A100 GPU servers.
- A Fabric Cluster product is required. For more information, see the Fabric Cluster documentation.
- NVIDIA A100 GPU servers provide 2 200 Gb/s Infiniband HDR interfaces, for a total of 400 Gbps bandwidth.
1. Infiniband prerequisites
The following describes what you need to prepare for infiniband communication between servers:
- Install the InfiniBand driver (MLNX_OFED).
- Install the MLNX_OFED version that matches your OS.
- Ensure all communication target servers are set to Cluster Mode before using Infiniband.
- Create a cluster from the Fabric Cluster menu, and add the server you want to communicate with to the created Fabric Cluster in Cluster Mode.
- For more information on how to set it up, see the Fabric Cluster documentation.
2. Check Infiniband connection status
To check the Infiniband connection status, enter the ibstat command and check the port status of the Infiniband connected to the server.
- The normal status for the
Physical stateitem isLinkUpand that for theStateitem isActive. - The
Physical stateitem may be displayed asPollingstatus for a short time immediately after server booting, but this is normal because it is in the process of establishing connection. - For NVIDIA A100 GPU type servers, the
Ratemust display as200.
[root@test01 ~]# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x88e9a4ffff667b00
System image GUID: 0x88e9a4ffff667b00
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 12
LMC: 0
SM lid: 14
Capability mask: 0xa651e848
Port GUID: 0x88e9a4ffff667b00
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x88e9a4ffff606426
System image GUID: 0x88e9a4ffff606426
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 15
LMC: 0
SM lid: 16
Capability mask: 0xa651e848
Port GUID: 0x88e9a4ffff606426
Link layer: InfiniBand
3. Communication test
You can test the Infiniband communication between servers through ibping and must run both servers working as Client and Server that exchange data.
First, check the LID information for the server working as the Server that receives data, and proceed with the ibping test.
3-1. Check LID information
On the server with the Server role, enter the ibstat command to check LID information for mlx5_0 and mlx5_1.
[root@test01 ~]# ibstat mlx5_0 | grep 'Base lid'
Base lid: 210
[root@test01 ~]# ibstat mlx5_1 | grep 'Base lid'
Base lid: 209
3-2. mlx5_0 test
To run the mlx5_0 test:
- Run the
ibping -S -C mlx5_0command in the server working as the Server. - Run the
ibping -c5 -C mlx5_0 -L{verified mlx5_0 LID value} command in the server working as the Client.- If the status is normal, the responses to all ping packets are received normally as follows:
root@test02:~# ibping -c5 -C mlx5_0 -L 210 Pong from test01.(none) (Lid 210): time 0.027 ms Pong from test01.(none) (Lid 210): time 0.013 ms Pong from test01.(none) (Lid 210): time 0.012 ms Pong from test01.(none) (Lid 210): time 0.017 ms Pong from test01.(none) (Lid 210): time 0.013 ms --- test01.(none) (Lid 210) ibping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 5000 ms rtt min/avg/max = 0.012/0.016/0.027 ms
3-3. mlx5_1 test
To run the mlx5_0 test:
- Run the
ibping -S -C mlx5_1command in the server working as the Server. - Run the
ibping -c5 -C mlx5_1 -L{verified mlx5_1 LID value} command in the server working as the Client.- If the status is normal, the responses to all ping packets are received normally as follows:
root@test02~# ibping -c5 -C mlx5_1 -L 209 Pong from test01.(none) (Lid 209): time 0.024 ms Pong from test01.(none) (Lid 209): time 0.013 ms Pong from test01.(none) (Lid 209): time 0.009 ms Pong from test01.(none) (Lid 209): time 0.009 ms Pong from test01.(none) (Lid 209): time 0.014 ms --- test01.(none) (Lid 209) ibping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 5000 ms rtt min/avg/max = 0.009/0.013/0.024 ms
3-4. IP over Infiniband (IPoIB) configuration
To set up 3 IPoIB, see NVIDIA's IPoIB configuration guide.