Available in VPC
This guide describes how to create and manage a KVM-based GPU Server from the NAVER Cloud Platform console.
- Set up redundancy between server zones to ensure continuity of service without interruption in the event of unexpected server malfunctions or scheduled change operations. See Load Balancer overview to set up redundancy.
- NAVER Cloud Platform provides a High Availability (HA) structure to prepare for failures in the physical server, such as memory, CPU, and power supply. HA is a policy for preventing hardware failures from expanding to the Virtual Machine (VM) server. It supports Live Migration, which automatically migrates the VM on the host server to another secure host server when a failure occurs in the aforementioned host server. However, the VM server is rebooted when an error occurs where Live Migration cannot be initiated. If you are operating a service with a single VM server, server restarts can cause service outages. To reduce the frequency of such failures, it is recommended to implement VM server redundancy as described in the guidance above.
Check server information
You can view the GPU Server information in the same way as viewing regular server information. For more information, see Check server information.
- In the case of GPU Servers, fees are charged even when the server is stopped.
Create GPU server
You can create a GPU Server in Services > Compute > Server on the console. For more information on how to create a server, see Create server.
- For GPU Servers, you can use NCP GPU images with drivers and related software pre-installed.
- See the following table for Regions available per GPU type:
GPU type | Region |
---|---|
NVIDIA A100 | KR-1 |
NVIDIA L4 | KR-2 |
NVIDIA L40S | KR-2 |
- Company members can create up to 5 NVIDIA A100 Servers.
If you need more GPU Servers or if you are an individual member who needs to create a GPU Server, check the FAQs and submit an inquiry to Customer Support. - For more information on how to install the NVIDIA driver and required software, see GPU Server software installation guide.
Manage server
You can manage a GPU Server and change its settings in the same way as for a regular server. For more information, see Manage server.
- You cannot change the KVM GPU Server specifications.
- GPU Servers cannot be converted into regular servers. To change to a regular server, you need to create a server image and use the image to create a new regular server.
- You can use the server image created in a regular server to create a GPU Server.
Install GPU driver
Select one of the two following options:
- Option 1. Use NCP GPU image with drivers pre-installed
- Option 2. Manually install GPU driver and related software
Option 1. Use NCP GPU image with drivers pre-installed
To create an NCP GPU Server with NVIDIA drivers and related software pre-installed, follow these steps:
- When creating a NVIDIA A100 GPU server, MLNX_OFED for using Infiniband is pre-installed.
- Access the NAVER Cloud Platform console.
- From the Region menu, click the Region you are using.
- From the Platform menu, click the platform you are using.
- Click Services > Compute > Server in order.
- In the tab that displays the server image, select the NCP Server Image tab.
- Under the image type, select KVM GPU.
- Select the Server image name you wish to use.
Option 2. Install GPU driver and CUDA using NVIDIA guide
- When creating a GPU Server, set the boot disk size to at least 100 GB.
- On NVIDIA A100 GPU Servers, you must install NVIDIA Fabric Manager.
- If you plan to use Infiniband on an NVIDIA A100 GPU Server, you must install NVIDIA MLNX_OFED.
- NVIDIA driver documentation
- The following table shows the minimum recommended driver version per GPU type:
GPU type | Minimum recommended driver release |
---|---|
NVIDIA A100 | R530 or later |
NVIDIA L4 | R535 or later |
NVIDIA L40S | R535 or later |
2.1 Install GPU driver
Verify the minimum recommended driver version for your GPU type, and see NVIDIA driver installation guide to install it.
- On NVIDIA A100 GPU Servers, NVIDIA Fabric Manager must be installed to support NVLink.
- See NVIDIA Fabric Manager document and install it.
- You must install the NVIDIA Fabric Manager that exactly matches your GPU driver version.
2.2 Install CUDA Toolkit
To install CUDA, follow these steps:
- Connect to the NVIDIA CUDA Toolkit website.
- Select the CUDA Runtime installation file for the version to install to download.
- Select runfile (local) as the Installer Type.
- Enter the following commands to install CUDA Toolkit:
# chmod +x [name of downloaded installer file] # ./[name of downloaded installer file] --toolkit --toolkitpath=/usr/local/cuda-[version] --samples --samplespath=/usr/local/cuda --silent
Check GPU driver and essential software
Check the GPU driver and essential software on the server.
Check driver version
To check the GPU driver's version, enter the nvidia-smi
command.
- You can check the version of the installed driver, and the model and number of GPU.
- The following is an example of an NVIDIA A100 GPU server:
# nvidia-smi
Mon Jun 9 17:23:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:A1:00.0 Off | Off |
| N/A 38C P0 66W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:A2:00.0 Off | Off |
| N/A 38C P0 63W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:B1:00.0 Off | Off |
| N/A 35C P0 62W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:B2:00.0 Off | Off |
| N/A 39C P0 63W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | Off |
| N/A 36C P0 64W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:C2:00.0 Off | Off |
| N/A 38C P0 63W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:D1:00.0 Off | Off |
| N/A 36C P0 59W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:D2:00.0 Off | Off |
| N/A 36C P0 60W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Check CUDA Toolkit version
To check the CUDA Toolkit version, enter the nvcc --version
command.
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
Check NVLink status
- The following applies only to NVIDIA A100 GPU Servers:
Enter the nvidia-smi topo -m
command to check the NVLink status between GPUs.
- If normal, the NVLink status between NVIDIA A100 GPUs will display as
NV12
. - If NVLink status appears as
SYS
between GPUs, check that NVIDIA Fabric Manager is installed and running normally.- You can check NVIDIA Fabric Manager status with
systemctl status nvidia-fabricmanager
command.
- You can check NVIDIA Fabric Manager status with
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 NODE PHB 0-27 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 NODE PHB 0-27 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PHB NODE 0-27 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PHB NODE 0-27 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS 28-55 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS 28-55 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS 28-55 1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS 28-55 1 N/A
NIC0 NODE NODE PHB PHB SYS SYS SYS SYS X NODE
NIC1 PHB PHB NODE NODE SYS SYS SYS SYS NODE X
Infiniband
The following section describes preparation and checks for using Infiniband:
- Infiniband is available only on NVIDIA A100 GPU Servers.
- A Fabric Cluster product is required. See Fabric Cluster.
- NVIDIA A100 GPU Servers provide 2 200 Gb/s Infiniband HDR interfaces, for a total of 400 Gbps bandwidth.
1. Infiniband prerequisites
The following describes what to prepare for Infiniband communication between servers:
- Install the Infiniband driver (MLNX_OFED).
- Install the MLNX_OFED matching your OS.
- Ensure all target servers are set to Cluster Mode before using Infiniband.
- Create a cluster from the Fabric Cluster menu, and add the server you want to communicate with to the created Fabric Cluster in Cluster Mode.
- For more information about how to set it up, see Fabric Cluster.
2. Check Infiniband connection status
To check the Infiniband connection status, enter the ibstat
command and check the port status of the Infiniband connected to the server.
- The normal status for the
Physical state
item isLinkUp
and that for theState
item isActive
. - The
Physical state
item may be displayed asPolling
status for a short time immediately after server booting, but this is normal because it is in the process of establishing connection. - On NVIDIA A100 type GPU Servers,
Rate
should appear as200
.
[root@test01 ~]# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x88e9a4ffff667b00
System image GUID: 0x88e9a4ffff667b00
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 12
LMC: 0
SM lid: 14
Capability mask: 0xa651e848
Port GUID: 0x88e9a4ffff667b00
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x88e9a4ffff606426
System image GUID: 0x88e9a4ffff606426
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 15
LMC: 0
SM lid: 16
Capability mask: 0xa651e848
Port GUID: 0x88e9a4ffff606426
Link layer: InfiniBand
3. Test communication
You can test the Infiniband communication between servers through ibping
and must run both servers working as Client and Server that exchange data.
First, check the LID information for the server working as the Server that receives data, and proceed with the ibping
test.
3-1. Check LID information
Enter the ibstat
command in the server working as the Server to check the LID information on mlx5_0 and mlx5_1.
[root@test01 ~]# ibstat mlx5_0 | grep 'Base lid'
Base lid: 210
[root@test01 ~]# ibstat mlx5_1 | grep 'Base lid'
Base lid: 209
3-2. mlx5_0 test
To proceed with the mlx5_0 test, follow these steps:
-
Run the
ibping -S -C mlx5_0
command in the server working as the Server. -
Run the
ibping -c5 -C mlx5_0 -L
{verified mlx5_0 LID value} command in the server working as the Client.- If the status is normal, the responses to all ping packets are received normally as follows:
root@test02:~# ibping -c5 -C mlx5_0 -L 210 Pong from test01.(none) (Lid 210): time 0.027 ms Pong from test01.(none) (Lid 210): time 0.013 ms Pong from test01.(none) (Lid 210): time 0.012 ms Pong from test01.(none) (Lid 210): time 0.017 ms Pong from test01.(none) (Lid 210): time 0.013 ms --- test01.(none) (Lid 210) ibping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 5000 ms rtt min/avg/max = 0.012/0.016/0.027 ms
3-3. mlx5_1 test
To proceed with the mlx5_0 test, follow these steps:
-
Run the
ibping -S -C mlx5_1
command in the server working as the Server. -
Run the
ibping -c5 -C mlx5_1 -L
{verified mlx5_1 LID value} command in the server working as the Client.- If the status is normal, the responses to all ping packets are received normally as follows:
root@test02~# ibping -c5 -C mlx5_1 -L 209 Pong from test01.(none) (Lid 209): time 0.024 ms Pong from test01.(none) (Lid 209): time 0.013 ms Pong from test01.(none) (Lid 209): time 0.009 ms Pong from test01.(none) (Lid 209): time 0.009 ms Pong from test01.(none) (Lid 209): time 0.014 ms --- test01.(none) (Lid 209) ibping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 5000 ms rtt min/avg/max = 0.009/0.013/0.024 ms
3-4. Set IP over Infiniband (IPoIB)
To set up the IPoIB settings, see NVIDIA's IPoIB setting guide.