GPU Resources

Prev Next

The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

Available in VPC.

This section describes the layout of the GPU Resources interface. In GPU Resources, you can view the status of the GPU resources allocated to your Workspace.

Note

ML expert Platform is currently in CBT (closed beta). GPU resources are assigned separately to pre-approved users.

Viewing GPU Resource List

You can check the list of GPU resources assigned to your Workspace. The GPU Resource list includes the following information:

mlxp_console_gpuresources03_ko

  • GPU Zone: Identifies the type of GPU resource assigned to your Workspace.
  • Creation date: The time when the GPU resource was allocated.
  • Status: The current status of the GPU resource.
  • Ready Nodes: Number of GPU nodes available for use.
  • UnReady Nodes: Number of GPU nodes unavailable for use.
  • Total Nodes: Total number of GPU resources allocated to the current Workspace.
Note
  • ML expert Platform currently provides GPU resources in a Private Zone.
  • A Private Zone means your Workspace uses dedicated GPU resources without sharing them with other users.

Viewing Available GPU Zones

To view information about available GPU Zones:

  • Checking available GPU Zone information in a Workspace
    kubectl get managementquota default -o yaml -n w-{ workspace name }
  spec:
  type: workspace
  zones:
    privateZones:
      # The Workspace can use GPU resources in the "gpu-private-zone."
      gpu-private-zone: ["*"]
  • Checking available GPU Zone information in a Project
    kubectl get managementquota default -o yaml -n p-{ project name}
  spec:
  type: project
  zones:
    privateZones:
      # The Project can use GPU resources in the "gpu-private-zone."
      gpu-private-zone: ["*"]

How to use GPU Zone information

When using any feature that relies on GPU resources, you must specify the GPU Zone.

  • When creating a Pod using PyTorchJob, add a nodeSelector to specify the GPU Zone where the Pod will run.
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-nccl"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: my-app
        spec:
          nodeSelector:
            mlx.navercorp.com/zone: ai-infra
          containers:
            - name: pytorch
              image: kubeflow/pytorch-dist-mnist:latest
              args: ["--backend", "nccl"]
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: my-app
        spec:
          containers:
            - name: pytorch
              image: kubeflow/pytorch-dist-mnist:latest
              args: ["--backend", "nccl"]
              resources:
                limits:
                  nvidia.com/gpu: 1
          nodeSelector:
            mlx.navercorp.com/zone: ai-infra
  • All workloads must also specify the GPU Zone in the nodeSelector of the Pod, which is the foundation of all workloads.
apiVersion: v1
kind: Pod
metadata:
  name: sleep
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  nodeSelector:
    mlx.navercorp.com/zone: ai-infra
  containers:
    - image: reg.navercorp.com/base/ubuntu:22.04
      command: ["/bin/bash", "-c", "sleep inf"]
      imagePullPolicy: IfNotPresent
      name: sleep

Viewing GPU Instance Details

You can view the status information of GPU instances inside a GPU Zone.

mlxp_console_gpuresources03_ko

  • GPU instance name: The name of the GPU instance assigned to the GPU Zone
  • Creation date: The time when the GPU instance was created
  • Status: The GPU instance status
    • Ready: The GPU instance is available
    • NotReady: The GPU instance is unavailable due to hardware failure or other issues
    • Unschedulable: The GPU instance is functioning but excluded from the list of schedulable resources
    • Unknown: The GPU instance requires inspection due to an issue
  • Project: The name of the project to which the GPU instance is assigned

GPU Instance Management

You can assign GPU instances within a GPU Zone to specific projects.

Note

Only Admin and Writer roles in the Workspace can use GPU Instance Management.

  1. In the GPU Instance Management popup, select a project and then select the GPU instances that the project will use.
  2. Click Save button to assign the selected GPU instances to the project.
  3. Click Reset button to reload the GPU instance information assigned to the selected project.
Caution
  • Initially, all GPU instances are assigned to every project. This means all GPU instances allocated to the Workspace are available to all projects.
  • GPU instances can be assigned to multiple projects at the same time.
  • GPU Instance Management defines the pool of available resources per project; it does not reflect the real-time status of Pods currently in use. If a GPU instance with running Pods is reassigned to another project, the Pods will not restart because the assigned project information has changed.