Documentation Index

Fetch the complete documentation index at: https://guide.ncloud-docs.com/llms.txt

Use this file to discover all available pages before exploring further.

Train models using kubectl

Prev Next

Available in VPC

ML expert Platform provides jobs for multiple model training frameworks.
Learn how to train models using kubectl with the most common methods, including single-node training (Job) and distributed node training (PytorchJob).

Job vs PytorchJob

For single-node training, we recommend using Job.
Using PytorchJob for single-node training can lead to unnecessary resource consumption because it is designed for distributed node training and manages deployments in a Master and Worker structure.

Run single-node training (Job)

You can define a Kubernetes Job specification for training as shown in the following example.

Caution
  • When using mounted high-performance storage, the UID and GID in the training image must be set to 500.
  • Do not set fsGroup in the Pod securityContext. Setting fsGroup causes Kubernetes to recursively change the ownership of all files in the mounted volume. For high-performance storage containing large amounts of data, this can make Pod initialization extremely slow.
# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: mnist
  namespace: p-{projectName}
spec:
  backoffLimit: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: Never
      containers:
        - name: main
          image: { training image (e.g. example.com/mnist:latest ) } # Training code written using the NVIDIA Base image
          imagePullPolicy: Always
          resources:
            limits:
              memory: "8Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
          command: ["python"]
          args:
            - /opt/mnist/src/mnist.py
            - --checkpoint_path
            - /opt/mnist/checkpoints/mnist.pt
            - --log_path
            - /opt/mnist/log
            - --data_path
            - /opt/mnist/data
            - --download_data
kubectl apply -f job.yaml
batch/mnist created

Use an external container registry

If the container registry requires Secret information, see Create a Container Secret to create one.
Once created, you can use the Secret as follows:

...
spec:
    imagePullSecrets:
    - name: my-harbor-secret  # Name of the Docker Credential Secret created in advance
...

Use an existing volume

To use a Volume created through Volumes, you can configure it as follows:

...
spec:
    containers:
    - name: main
      ...
      volumeMounts:
      - mountPath: /data
        name: mnist-data # Name specified in spec.volumes below
    volumes:
    - name: mnist-data
      persistentVolumeClaim:
        claimName: mnist-data # Name of the PVC created through Volumes
...

Job lifecycle

When a Job completes, it remains in the list for a certain period of time to preserve container logs and status.
Since there is a limit on the maximum number of Jobs that can remain in the list, you must manage the lifecycle using TTL. For more information, see Kubernetes Job API.

When creating a Job, you can set a custom time to live (TTL). When a Job finishes, the time to live (TTL) is activated, whether the Job succeeds or fails. After the TTL passes, the Job and all Pods that belong to the Job are automatically deleted.

The following example applies a TTL of 3 weeks.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist
  namespace: p-{projectName}
spec:
  ttlSecondsAfterFinished: 1814400 # Field for setting TTL manually (unit: sec)

Run distributed node training (PyTorchJob)

Using PyTorchJob provides the following benefits:

  • It creates Master and Worker Pods as appropriate based on the container specification you define.
  • It automatically sets environment variables commonly required for PyTorch distributed training, such as WORLD_SIZE, RANK, and MASTER_ADDR.
  • It automatically creates a K8s Service to facilitate communication between training Pods. You can access the Master as <pytorch-job-name>-master-0 and the Workers as <pytorch-job-name>-worker-<idx>.
  • If needed, you can use Elastic Policy to create environment variables for torchrun arguments, such as --nnodes, --nproc-per-node, and --rdzv-endpoint.

Define a PyTorchJob

For distributed training with PyTorch, we recommend using torchrun (Elastic Launch). When using torchrun, do not specify a Master Pod explicitly. In Torch Elastic, the RANK=0 Pod that acts as the master node can change during execution.

  • spec.elasticPolicy - Configures torchrun-related settings. Settings specified here are injected as environment variables. For details, see Use Elastic Policy.
  • spec.runPolicy - Specifies parameters for PyTorchJob execution and post-completion handling. For details, see Use Run Policy.
  • spec.pytorchReplicaSpecs.Worker - Configures the Worker Pod that performs distributed training.
Caution
  • The container name for training, such as spec.pytorchReplicaSpecs.Worker.template.spec.containers[*].name, must be set to pytorch.
  • To prevent Istio Sidecar injection and ensure stable distributed training, spec.pytorchReplicaSpecs.Worker.template.metadata.annotations is automatically set in sidecar.istio.io/inject: "false". If this annotation is not set, communication errors between nodes, such as RuntimeError: Connection reset by peer, may occur.
  • When using mounted high-performance storage, the UID and GID in the training image must be set to 500.
  • Do not set fsGroup in the Pod securityContext. Setting fsGroup causes Kubernetes to recursively change the ownership of all files in the mounted volume. For high-performance storage containing large amounts of data, this can make Pod initialization extremely slow.

You can define a PyTorchJob specification for training as shown in the following example.

# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
    name: pytorch-mnist-dist-nccl
    namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
    pytorchReplicaSpecs:
        Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
                metadata:
                    annotations:
                        sidecar.istio.io/inject: "false"  # Automatically set. Can be omitted.
                spec:
                    nodeSelector:
                        mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
                    containers:
                    - name: pytorch  # Must set the container name for PyTorchJob to pytorch
                      image: examples.com/pytorch-mnist-dist:23.03-py3
                      imagePullPolicy: Always
                      command: ["bash", "-c"]
                      args:
                      - >
                        torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
                        /opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
                      env:
                      - name: NCCL_DEBUG
                        value: INFO
                      - name: GLOO_SOCKET_FAMILY
                        value: AF_INET
                      securityContext:  # securityContext configuration is required for InfiniBand
                        capabilities:
                            add: ["IPC_LOCK"]
                      # shared memory
                      volumeMounts:
                      - mountPath: /dev/shm
                        name: shared-memory
                    volumes:
                    - emptyDir:
                        medium: Memory
                      name: shared-memory
kubectl apply -f pytorchjob.yaml
pytorchjob.kubeflow.org/pytorch-elastic-mnist-nccl created

Use Elastic Policy

Define elasticPolicy to use torchrun.

...
spec:
    ...
    elasticPolicy:
        rdzvId: mnist
        rdzvBackend: c10d
        minReplicas: 2
        maxReplicas: 2
        nProcPerNode: 8
    ...

Environment variables used for PyTorchJob are set based on each field value in elasticPolicy. These environment variables can be used to set torchrun arguments. For more information about the arguments used by torchrun, see the official documentation.

elasticPolicy field Environment variable torchrun argument Description
rdzvId PET_RDZV_ID --rdzv-id Rendezvous Job ID
rdzvBackend PET_RDZV_BACKEND --rdzv-backend Rendezvous backend, e.g., c10d
minReplicas, maxReplicas PET_NNODES --nnodes Number of nodes
nProcPerNode PET_NPROC_PER_NODE --nproc-per-node Number of GPUs per node
maxRestarts PET_MAX_RESTARTS --max-restarts Maximum number of restarts

Use Run Policy

Define runPolicy to use torchrun.

...
spec:
  runPolicy:
    cleanPodPolicy: None
    ttlSecondsAfterFinished: 1814400 # Field for setting TTL manually (unit: sec)
...

spec.runPolicy lets you specify parameters related to PyTorchJob execution and cleanup. If not specified, the default values are used. The parameters you can specify under spec.runPolicy include the following: (Reference: Kubeflow Trainer API Reference v1.9)

  • cleanPodPolicy - Determines how to clean up Pods after the PytorchJob completes.
    • Default: None
    • None: Does not delete Pods after the Job completes, which helps you check logs later.
    • All: Deletes all Pods after the Job completes.
    • Running deletes running Pods after the Job completes. This is rarely used except in special cases.
  • ttlSecondsAfterFinished - Determines how many seconds after Job completion the Job is deleted.
  • activeDeadlineSeconds - Specifies the maximum execution time for the Job. After the specified time passes, the Job is marked as failed. If not set, there is no limit on the Job execution time.
  • backoffLimit - Specifies the maximum number of retries when the Job fails.

Use Infiniband

Note
  • No separate RDMA resource settings, such as requests or limits, are required.
  • For GPU Zone information, see View available GPU Zones.

You can speed up communication between nodes by running distributed training on nodes connected through an InfiniBand network.
To adapt the example shown in the previous section for an InfiniBand environment, add the following specifications:

  • Specify the name as an annotation, such as mlx.navercorp.com/zone=ai-infra.
  • Enable InfiniBand by adding the IPC_LOCK capability to securityContext.
  • Configure shared memory in volumes to support distributed training.

To use InfiniBand, you can configure it as follows:

...
metadata:
  annotations:
    mlx.navercorp.com/zone="ai-infra"
...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                   securityContext:  # securityContext configuration is required for InfiniBand
                        capabilities:
                            add: ["IPC_LOCK"]
                    # shared memory
                    volumeMounts:
                    - mountPath: /dev/shm
                       name: shared-memory
                volumes:
                - emptyDir:
                    medium: Memory
                  name: shared-memory
...

Debug PyTorchJob

To debug issues in PytorchJob, you can set the following environment variables to log the required information:

  • NCCL_DEBUG: Enables NCCL-related debugging.
  • TORCH_DISTRIBUTED_DEBUG, TORCH_CPP_LOG_LEVEL: Enables distributed training debugging. For details, see the official PyTorch documentation.

For debugging, you can configure it as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                  ...
                  env:
                  - name: NCCL_DEBUG
                    value: "INFO"
                  - name: TORCH_DISTRIBUTED_DEBUG
                    value: "DETAIL"
                  - name: TORCH_CPP_LOG_LEVEL
                    value: "INFO"
                  ...

Use an external container registry

If the container registry requires Secret information, see Create a Container Secret to create one.
Once created, you can use the Secret as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                  imagePullSecrets:
                  - name: my-harbor-secret  # Name of the Docker Credential Secret created in advance
...

Use an existing volume

To use a Volume created through Volumes, you can configure it as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                    volumeMounts:
                    - mountPath: /data
                       name: mnist-data # Name specified in spec.volumes below
                volumes:
                - name: mnist-data
                  persistentVolumeClaim:
                      claimName: mnist-data # Name of the PVC created through Volumes
...

Check PytorchJob status

You can check the status of a PyTorchJob using kubectl get and kubectl describe.

kubectl get pytorchjob pytorch-elastic-mnist-nccl
NAME                         STATE     AGE
pytorch-elastic-mnist-nccl   Running   12s
kubectl describe pytorchjob pytorch-elastic-mnist-nccl

Status:
  Completion Time:  2024-11-22T09:16:58Z
  Conditions:
    Last Transition Time:  2024-11-22T09:15:43Z
    Last Update Time:      2024-11-22T09:15:43Z
    Message:               PyTorchJob pytorch-elastic-mnist-nccl is created.
    Reason:                PyTorchJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2024-11-22T09:15:48Z
    Last Update Time:      2024-11-22T09:15:48Z
    Message:               PyTorchJob nb12706/pytorch-elastic-mnist-nccl is running.
    Reason:                PyTorchJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2024-11-22T09:16:58Z
    Last Update Time:      2024-11-22T09:16:58Z
    Message:               PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
    Reason:                PyTorchJobSucceeded
    Status:                True
    Type:                  Succeeded
  Last Reconcile Time:     2024-11-22T09:15:43Z
  Replica Statuses:
    Worker:
      Selector:   training.kubeflow.org/job-name=pytorch-elastic-mnist-nccl,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker
      Succeeded:  2
  Start Time:     2024-11-22T09:15:44Z
Events:
  Type    Reason                    Age                  From                   Message
  ----    ------                    ----                 ----                   -------
  Normal  SuccessfulCreatePod       4m21s                pytorchjob-controller  Created pod: pytorch-elastic-mnist-nccl-worker-0
  Normal  SuccessfulCreatePod       4m21s                pytorchjob-controller  Created pod: pytorch-elastic-mnist-nccl-worker-1
  Normal  SuccessfulCreateService   4m21s                pytorchjob-controller  Created service: pytorch-elastic-mnist-nccl-worker-0
  Normal  SuccessfulCreateService   4m21s                pytorchjob-controller  Created service: pytorch-elastic-mnist-nccl-worker-1
  Normal  ExitedWithCode            3m7s (x3 over 3m8s)  pytorchjob-controller  Pod: nb12706.pytorch-elastic-mnist-nccl-worker-1 exited with code 0
  Normal  ExitedWithCode            3m7s (x2 over 3m8s)  pytorchjob-controller  Pod: nb12706.pytorch-elastic-mnist-nccl-worker-0 exited with code 0
  Normal  PyTorchJobSucceeded       3m7s                 pytorchjob-controller  PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
  Normal  JobTerminated             3m6s (x4 over 3m7s)  pytorchjob-controller  Job has been terminated. Deleting PodGroup
  Normal  SuccessfulDeletePodGroup  3m6s (x4 over 3m7s)  pytorchjob-controller  Deleted PodGroup: pytorch-elastic-mnist-nccl

If a training Pod is not created properly, you can identify the cause through Events as follows:

kubectl describe pytorchjob pytorch-elastic-mnist-nccl

...
Events:
  Type     Reason           Age                 From                   Message
  ----     ------           ----                ----                   -------
  Warning  FailedCreatePod  47m (x3 over 103m)  pytorchjob-controller  Error creating: Pods "job-worker-1" is forbidden: exceeded quota: normal-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=2, limited: requests.nvidia.com/gpu=2

Guidelines for distributed training

This section provides guidelines for distributed training in MLXP.

Use a CUDA version compatible with the NVIDIA driver

The CUDA Runtime version must be compatible with the NVIDIA driver installed on the node. Failing to meet the minimum driver version requirements may result in CUDA initialization or runtime errors, preventing the training job from executing correctly.

Set GLOO_SOCKET_FAMILY=AF_INET when IPv6 is unavailable

GLOO frequently establishes TCP connections to check the status between nodes and manage ranks. When running large-scale training in an environment where IPv6 is unavailable, each connection attempt first tries and fails to connect over IPv6 before retrying over IPv4. If these repeated connection attempts continue, resource issues such as file descriptor exhaustion may occur and cause training to stop.

To prevent this issue, we recommend setting the GLOO_SOCKET_FAMILY=AF_INET environment variable to use only IPv4.

To force GLOO to use IPv4, you can configure it as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                 - name: pytorch
                   ... 
                   env:
                   - name: GLOO_SOCKET_FAMILY
                     value: AF_INET 
...

Limit the PyTorchJob name length (recommended)

Since the PyTorchJob name is used as a prefix when creating Kubernetes resources, such as Pods and Services, we recommend keeping it to 50 characters or fewer. If the name exceeds 50 characters, derived resource names may also exceed the length limit, causing creation failures or issues with worker index parsing.

Remove the memory limit setting when configuring shared memory

We recommend not setting a memory limit for shared memory (/dev/shm) or setting a sufficiently large value. Insufficient /dev/shm can cause DataLoader multiprocessing workers to terminate abnormally and lead to a significant decrease in NCCL communication performance.

The following example shows a shared memory configuration with the memory limit removed:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                    # shared memory
                    volumeMounts:
                    - mountPath: /dev/shm
                      name: shared-memory
                volumes:
                - emptyDir:
                    medium: Memory
                  name: shared-memory