Train model using kubectl

Available in VPC

Jobs of several frameworks are provided for model training.
You can learn single node training and distributed node training, the basic forms of training, in their most frequently used forms: Single node training (Job) and distributed node training (PytorchJob).

Job vs PytorchJob

For single node training, it is recommended to use Job.
If you use PytorchJob, resources may be wasted unnecessarily because the training process is deployed and managed in a master-worker format due to the features for distributed node training.

Run single node training (Job)

For training, you can write Kubernetes Job specifications as the following example:

# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: mnist
  namespace: p-{projectName}
spec:
  backoffLimit: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: Never
      containers:
        - name: main
          image: { training image (e.g., example.com/mnist:latest ) } # Training code written based on NVIDIA Base images
          imagePullPolicy: Always
          resources:
            limits:
              memory: "8Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
          command: ["python"]
          args:
            - /opt/mnist/src/mnist.py
            - --checkpoint_path
            - /opt/mnist/checkpoints/mnist.pt
            - --log_path
            - /opt/mnist/log
            - --data_path
            - /opt/mnist/data
            - --download_data

kubectl apply -f job.yaml
batch/mnist created

Use external Container Registry

If you need Secret information for Container Registry, see Create Container Secret and create Secret information.
You can use the Secret created as follows:

...
spec:
    imagePullSecrets:
    - name: my-harbor-secret  # Previously created name of Docker Credential Secret
...

Use existing volume

You can use the volume created through Volumes as follows:

...
spec:
    containers:
    - name: main
      ...
      volumeMounts:
      - mountPath: /data
        name: mnist-data # Name written in spec.volumes at the bottom
    volumes:
    - name: mnist-data
      persistentVolumeClaim:
        claimName: mnist-data # Name of the PVC created through Volumes
...

Job life cycle

If a Job ends, it is not deleted but remains in the list for a certain period of time to keep container logs and status.
There is a limit on the maximum number of Jobs that can remain in the list, so you must manage the lifecycle through TTL. For more information, see the Kubernetes Job APIs documentation.

When you create a Job, you can set Time To Live (TTL) arbitrarily. TTL is enabled after a Job is complete (succeeded/failed). After TTL expires, the Job and its Pod are automatically deleted.

The following is an example of applying TTL for 3 weeks:

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist
  namespace: p-{projectName}
spec:
  ttlSecondsAfterFinished: 1814400 # Field where you set TTL manually (unit: seconds)

Run distributed node training (PytorchJob)

The following are the advantages of using PyTorchJob:

Properly creates master and worker Pods based on the written container specifications.
Automatically sets environment variables generally required for PyTorch distributed training, such as WORLD_SIZE, RANK, and MASTER_ADDR.
Automatically creates K8s Service to enable Pods used for training to communicate each other. You can access the Master with a name of<pytorch-job-name>-master-0 and Workers with a name of <pytorch-job-name>-worker-<idx>.
If needed, creates arguments as an environment variable to be used in torchrun through Use Elastic Policy (such as --nnodes, --nproc-per-node, and --rdzv-endpoint).

Write PytorchJob

For distributed training using Pytorch, it is recommended to use torchrun (Elastic Launch). Additionally, when you use torchrun, the master Pod is not specified separately. In Torch Elastic, RANK=0 Pod, which serves as the master node, may change during running.

spec.elasticPolicy - Settings related to torchrun. The settings specified here are injected into environment variables. For more information, see Use Elastic Policy.
spec.runPolicy - You can specify parameters related to running and post-processing a PyTorchJob. For more information, see Use Run Policy.
spec.pytorchReplicaSpecs.Worker - Settings for the Worker Pod that performs distributed training.

For training, you can write PytorchJob specifications as the following example:

You must comply with the following.

The name of the container that performs training (e.g., spec.pytorchReplicaSpecs.Worker.template.spec.containers[*].name) must be pytorch.
For seamless distributed training, you must specify sidecar.istio.io/inject: "false" in spec.pytorchReplicaSpecs.Worker.template.metadata.annotations. If the annotation is not set, you may view errors related to communication between nodes such as RuntimeError: Connection reset by peer.

# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
    name: pytorch-mnist-dist-nccl
    namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
    pytorchReplicaSpecs:
        Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
                metadata:
                    annotations:
                        sidecar.istio.io/inject: "false"  # Required to disable Istio sidecar injection
                spec:
                    nodeSelector:
                        mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources 
                containers:
                - name: pytorch  # Must set PyTorchJob's container name to pytorch
                   image: examples.com/pytorch-mnist-dist:23.03-py3
                   imagePullPolicy: Always
                   command: ["bash", "-c"]
                   args:
                   - >
                      torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
                      /opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
                    env:
                    - name: NCCL_DEBUG
                       value: INFO

kubectl apply -f pytorchjob.yaml
pytorchjob.kubeflow.org/pytorch-elastic-mnist-nccl created

Use Elastic Policy

Write elasticPolicy to use torchrun.

...
spec:
    ...
    elasticPolicy:
        rdzvId: mnist
        rdzvBackend: c10d
        minReplicas: 2
        maxReplicas: 2
        nProcPerNode: 8
    ...

Environment variables are set to be used in PyTorchJob based on each field value of elasticPolicy. Environment variables can be used to set the torchrun argument. For more information about arguments used in torchrun, see the official documentation.

`elasticPolicy` Field	Corresponding environment variable	Related `torchrun` argument	Description
`rdzvId`	`PET_RDZV_ID`	`--rdzv-id`	Job ID for rendezvous
`rdzvBackend`	`PET_RDZV_BACKEND`	`--rdzv-backend`	Rendezvous backend (i.e., c10d)
`minReplicas`, `maxReplicas`	`PET_NNODES`	`--nnodes`	Number of nodes
`nProcPerNode`	`PET_NPROC_PER_NODE`	`--nproc-per-node`	Number of GPUs per node
`maxRestarts`	`PET_MAX_RESTARTS`	`--max-restarts`	Maximum number of retries

Use Run Policy

Write runPolicy to use torchrun.

...
spec:
  runPolicy:
    cleanPodPolicy: None
    ttlSecondsAfterFinished: 1814400 # Field where you set TTL manually (unit: seconds)
...

You can specify parameters related to running and arranging PyTorchJob in spec.runPolicy. If you do not specify parameters, the default values apply. The parameters you can specify under spec.runPolicy are as follows (see Kubeflow Trainer API Reference v1.9):

cleanPodPolicy - Arranges Pods after PytorchJob is complete.
- Default value: None.
- None: Helps to view logs later because after the Job is complete, Pods are not deleted.
- All: Deletes all Pods after the Job is complete.
- After the Job is complete, Running deletes running Pods. Except in special cases, it is not normally used.
ttlSecondsAfterFinished - Decides how many seconds later the Job should be deleted after it is complete.
activeDeadlineSeconds - Maximum running time of the Job. Once the specified time expires, it is marked as failed. If it is not set, there is no limit on the running time of the Job.
backoffLimit - Maximum number of retries when the Job is failed.

Use InfiniBand

If you run distributed training with nodes connected to the InfiniBand network, you can accelerate communication between nodes.

The following summary are some specifications you must add to match the example shown in the previous section with the InfiniBand environment:

Specify the name of the zone where the InfiniBand network is configured, using the annotation (i.e., mlx.navercorp.com/zone=a100-ib).
To use InfiniBand, add the IPC_LOCK capability to securityContext.
When you write ResourceRequest, set the resources for InfiniBand (rdma/hca_shared_devices_a: 1) to be assigned.
Set Shared Memory for distributed training as volumes.

You can use InfiniBand as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                   securityContext:  # securityContext is required to use Infiniband.
                        capabilities:
                            add: ["IPC_LOCK"]
                   resources:
                        limits:
                            ...
                            rdma/hca_shared_devices_a: 1
                            ...
                        requests:
                            ...
                            rdma/hca_shared_devices_a: 1
                            ...
                    # shared memory
                    volumeMounts:
                    - mountPath: /dev/shm
                       name: shared-memory
                volumes:
                - emptyDir:
                    medium: Memory
                  name: shared-memory
...

PytorchJob debugging

When an error occurs upon using PytorchJob, you need debugging. In this case, you can set environment variables as follows to log the information you need:

NCCL_DEBUG: NCCL-related debugging
TORCH_DISTRIBUTED_DEBUG and TORCH_CPP_LOG_LEVEL: Debugging for distributed training. For more information, see the PyTorch official documentation.

You can use them for debugging as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                  ...
                  env:
                  - name: NCCL_DEBUG
                    value: "INFO"
                  - name: TORCH_DISTRIBUTED_DEBUG
                    value: "DETAIL"
                  - name: TORCH_CPP_LOG_LEVEL
                    value: "INFO"
                  ...

Use external Container Registry

If you need Secret information for Container Registry, see Create Container Secret and create Secret information.
You can use the Secret created as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                  imagePullSecrets:
                  - name: my-harbor-secret  # Previously created name of Docker Credential Secret
...

Use existing volume

You can use the volume created through Volumes as follows:

...
spec:
    ...
    pytorchReplicaSpecs:
        Worker:
            template:
                containers:
                - name: pytorch
                    volumeMounts:
                    - mountPath: /data
                       name: mnist-data # Name written in spec.volumes at the bottom
                volumes:
                - name: mnist-data
                  persistentVolumeClaim:
                      claimName: mnist-data # Name of the PVC created through Volumes
...

Check PytorchJob status

You can check the status of PyTorchJob using kubectl get and kubectl describe.

kubectl get pytorchjob pytorch-elastic-mnist-nccl
NAME                         STATE     AGE
pytorch-elastic-mnist-nccl   Running   12s

kubectl describe pytorchjob pytorch-elastic-mnist-nccl

Status:
  Completion Time:  2024-11-22T09:16:58Z
  Conditions:
    Last Transition Time:  2024-11-22T09:15:43Z
    Last Update Time:      2024-11-22T09:15:43Z
    Message:               PyTorchJob pytorch-elastic-mnist-nccl is created.
    Reason:                PyTorchJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2024-11-22T09:15:48Z
    Last Update Time:      2024-11-22T09:15:48Z
    Message:               PyTorchJob nb12706/pytorch-elastic-mnist-nccl is running.
    Reason:                PyTorchJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2024-11-22T09:16:58Z
    Last Update Time:      2024-11-22T09:16:58Z
    Message:               PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
    Reason:                PyTorchJobSucceeded
    Status:                True
    Type:                  Succeeded
  Last Reconcile Time:     2024-11-22T09:15:43Z
  Replica Statuses:
    Worker:
      Selector:   training.kubeflow.org/job-name=pytorch-elastic-mnist-nccl,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker
      Succeeded:  2
  Start Time:     2024-11-22T09:15:44Z
Events:
  Type    Reason                    Age                  From                   Message
  ----    ------                    ----                 ----                   -------
  Normal  SuccessfulCreatePod       4m21s                pytorchjob-controller  Created pod: pytorch-elastic-mnist-nccl-worker-0
  Normal  SuccessfulCreatePod       4m21s                pytorchjob-controller  Created pod: pytorch-elastic-mnist-nccl-worker-1
  Normal  SuccessfulCreateService   4m21s                pytorchjob-controller  Created service: pytorch-elastic-mnist-nccl-worker-0
  Normal  SuccessfulCreateService   4m21s                pytorchjob-controller  Created service: pytorch-elastic-mnist-nccl-worker-1
  Normal  ExitedWithCode            3m7s (x3 over 3m8s)  pytorchjob-controller  Pod: nb12706.pytorch-elastic-mnist-nccl-worker-1 exited with code 0
  Normal  ExitedWithCode            3m7s (x2 over 3m8s)  pytorchjob-controller  Pod: nb12706.pytorch-elastic-mnist-nccl-worker-0 exited with code 0
  Normal  PyTorchJobSucceeded       3m7s                 pytorchjob-controller  PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
  Normal  JobTerminated             3m6s (x4 over 3m7s)  pytorchjob-controller  Job has been terminated. Deleting PodGroup
  Normal  SuccessfulDeletePodGroup  3m6s (x4 over 3m7s)  pytorchjob-controller  Deleted PodGroup: pytorch-elastic-mnist-nccl

If a training Pod is not created properly, you can identify the causes through Events as follows:

kubectl describe pytorchjob pytorch-elastic-mnist-nccl

...
Events:
  Type     Reason           Age                 From                   Message
  ----     ------           ----                ----                   -------
  Warning  FailedCreatePod  47m (x3 over 103m)  pytorchjob-controller  Error creating: Pods "job-worker-1" is forbidden: exceeded quota: normal-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=2, limited: requests.nvidia.com/gpu=2