GPU Resources

Available in VPC

Here you can learn about the GPU Resources interface. In GPU Resources, you can view the status of the GPU resources allocated to your Workspace.

Note

ML expert Platform is currently in closed beta service (CBT). GPU resources are separately assigned to the pre-arranged users.

View GPU Resource list

You can view the list of GPU resources assigned to your workspace. The GPU Resource list includes the following information:

mlxp_console_gpuresources03_ko

GPU zone: Identifies the type of GPU resource assigned to your Workspace.
Creation date: Time when the GPU resource was allocated.
Status: Current status of the GPU resource.
Ready Nodes: Number of available GPU nodes.
UnReady Nodes: Number of unavailable GPU nodes.
Total Nodes: Total number of GPU resources allocated to the current Workspace.

Note

ML expert Platform currently provides GPU resources in a private zone.
A private zone means your workspace uses dedicated GPU resources without sharing them with other users.

View available GPU zone information

To view information about the available GPU zone:

View information about the GPU zone available in the Workspace.

    kubectl get managementquota default -o yaml -n w-{ workspace name }

  spec:
  type: workspace
  zones:
    privateZones:
      # The Workspace can use GPU resources in the "gpu-private-zone."
      gpu-private-zone: ["*"]

View information about the GPU zone available in the Project.

    kubectl get managementquota default -o yaml -n p-{ project name}

  spec:
  type: project
  zones:
    privateZones:
      # The Project can use GPU resources in the "gpu-private-zone."
      gpu-private-zone: ["*"]

How to use GPU zone information

When using any feature that relies on GPU resources, you need to specify the GPU zone information.

When creating a pod using PyTorchJob, add a nodeSelector to specify the GPU zone where the pod will run.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-nccl"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: my-app
        spec:
          nodeSelector:
            mlx.navercorp.com/zone: ai-infra
          containers:
            - name: pytorch
              image: kubeflow/pytorch-dist-mnist:latest
              args: ["--backend", "nccl"]
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: my-app
        spec:
          containers:
            - name: pytorch
              image: kubeflow/pytorch-dist-mnist:latest
              args: ["--backend", "nccl"]
              resources:
                limits:
                  nvidia.com/gpu: 1
          nodeSelector:
            mlx.navercorp.com/zone: ai-infra

Also for the pod, which is the basis of all workloads, you need to specify the GPU zone in the nodeSelector.

apiVersion: v1
kind: Pod
metadata:
  name: sleep
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  nodeSelector:
    mlx.navercorp.com/zone: ai-infra
  containers:
    - image: reg.navercorp.com/base/ubuntu:22.04
      command: ["/bin/bash", "-c", "sleep inf"]
      imagePullPolicy: IfNotPresent
      name: sleep

View GPU instance details

You can view the status information of GPU instances inside a GPU zone.

mlxp_console_gpuresources03_ko

GPU instance name: Name of the GPU instance assigned to the GPU zone.
Creation date: Time when the GPU instance was created
Status: The GPU instance status.
- Ready: The GPU instance is available.
- NotReady: The GPU instance is unavailable due to hardware failure or other issues.
- Unschedulable: The GPU instance is functioning but excluded from the list of available resources.
- Unknown: The GPU instance requires an inspection due to an issue.
Project: The name of the project to which the GPU instance is assigned.

Manage GPU instance

You can assign GPU instances within a GPU zone to specific projects.

Note

You need to assign GPU instances allocated to your Workspace to each project when you first access.
Only Admin and Writer roles in the Workspace can use the GPU Instance management features.

In the GPU instance management popup, select a project and then select the GPU instance to be used in the project.
Click the Save button to assign the selected GPU instance to the project.
Click the Reset button to reload the GPU instance information assigned to the selected project.

Caution

GPU instances can be assigned to multiple projects at the same time.
The GPU instance management features are for assigning preliminary resources available for each project; it does not reflect the real-time status of pods currently in use. If you assign a GPU instance with pods in use to another project, the pods do not restart because the assigned project information has changed.