Available in VPC
Jobs of several frameworks are provided for model training.
You can learn single node training and distributed node training, the basic forms of training, in their most frequently used forms: Single node training (Job) and distributed node training (PytorchJob).
For single node training, it is recommended to use Job.
If you use PytorchJob, resources may be wasted unnecessarily because the training process is deployed and managed in a master-worker format due to the features for distributed node training.
Run single node training (Job)
For training, you can write Kubernetes Job specifications as the following example:
# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: mnist
namespace: p-{projectName}
spec:
backoffLimit: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
containers:
- name: main
image: { training image (e.g., example.com/mnist:latest ) } # Training code written based on NVIDIA Base images
imagePullPolicy: Always
resources:
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
command: ["python"]
args:
- /opt/mnist/src/mnist.py
- --checkpoint_path
- /opt/mnist/checkpoints/mnist.pt
- --log_path
- /opt/mnist/log
- --data_path
- /opt/mnist/data
- --download_data
kubectl apply -f job.yaml
batch/mnist created
Use external Container Registry
If you need Secret information for Container Registry, see Create Container Secret and create Secret information.
You can use the Secret created as follows:
...
spec:
imagePullSecrets:
- name: my-harbor-secret # Previously created name of Docker Credential Secret
...
Use existing volume
You can use the volume created through Volumes as follows:
...
spec:
containers:
- name: main
...
volumeMounts:
- mountPath: /data
name: mnist-data # Name written in spec.volumes at the bottom
volumes:
- name: mnist-data
persistentVolumeClaim:
claimName: mnist-data # Name of the PVC created through Volumes
...
Job life cycle
If a Job ends, it is not deleted but remains in the list for a certain period of time to keep container logs and status.
There is a limit on the maximum number of Jobs that can remain in the list, so you must manage the lifecycle through TTL. For more information, see the Kubernetes Job APIs documentation.
When you create a Job, you can set Time To Live (TTL) arbitrarily. TTL is enabled after a Job is complete (succeeded/failed). After TTL expires, the Job and its Pod are automatically deleted.
The following is an example of applying TTL for 3 weeks:
apiVersion: batch/v1
kind: Job
metadata:
name: mnist
namespace: p-{projectName}
spec:
ttlSecondsAfterFinished: 1814400 # Field where you set TTL manually (unit: seconds)
Run distributed node training (PytorchJob)
The following are the advantages of using PyTorchJob:
- Properly creates master and worker Pods based on the written container specifications.
- Automatically sets environment variables generally required for PyTorch distributed training, such as WORLD_SIZE, RANK, and MASTER_ADDR.
- Automatically creates K8s Service to enable Pods used for training to communicate each other. You can access the Master with a name of
<pytorch-job-name>-master-0and Workers with a name of<pytorch-job-name>-worker-<idx>. - If needed, creates arguments as an environment variable to be used in torchrun through Use Elastic Policy (such as
--nnodes,--nproc-per-node, and--rdzv-endpoint).
Write PytorchJob
For distributed training using Pytorch, it is recommended to use torchrun (Elastic Launch). Additionally, when you use torchrun, the master Pod is not specified separately. In Torch Elastic, RANK=0 Pod, which serves as the master node, may change during running.
spec.elasticPolicy- Settings related to torchrun. The settings specified here are injected into environment variables. For more information, see Use Elastic Policy.spec.runPolicy- You can specify parameters related to running and post-processing a PyTorchJob. For more information, see Use Run Policy.spec.pytorchReplicaSpecs.Worker- Settings for the Worker Pod that performs distributed training.
For training, you can write PytorchJob specifications as the following example:
- The name of the container that performs training (e.g.,
spec.pytorchReplicaSpecs.Worker.template.spec.containers[*].name) must bepytorch. - For seamless distributed training, you must specify
sidecar.istio.io/inject: "false"inspec.pytorchReplicaSpecs.Worker.template.metadata.annotations. If the annotation is not set, you may view errors related to communication between nodes such asRuntimeError: Connection reset by peer.
# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist-dist-nccl
namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false" # Required to disable Istio sidecar injection
spec:
nodeSelector:
mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources
containers:
- name: pytorch # Must set PyTorchJob's container name to pytorch
image: examples.com/pytorch-mnist-dist:23.03-py3
imagePullPolicy: Always
command: ["bash", "-c"]
args:
- >
torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
/opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
env:
- name: NCCL_DEBUG
value: INFO
kubectl apply -f pytorchjob.yaml
pytorchjob.kubeflow.org/pytorch-elastic-mnist-nccl created
Use Elastic Policy
Write elasticPolicy to use torchrun.
...
spec:
...
elasticPolicy:
rdzvId: mnist
rdzvBackend: c10d
minReplicas: 2
maxReplicas: 2
nProcPerNode: 8
...
Environment variables are set to be used in PyTorchJob based on each field value of elasticPolicy. Environment variables can be used to set the torchrun argument. For more information about arguments used in torchrun, see the official documentation.
elasticPolicy Field |
Corresponding environment variable | Related torchrun argument |
Description |
|---|---|---|---|
rdzvId |
PET_RDZV_ID |
--rdzv-id |
Job ID for rendezvous |
rdzvBackend |
PET_RDZV_BACKEND |
--rdzv-backend |
Rendezvous backend (i.e., c10d) |
minReplicas, maxReplicas |
PET_NNODES |
--nnodes |
Number of nodes |
nProcPerNode |
PET_NPROC_PER_NODE |
--nproc-per-node |
Number of GPUs per node |
maxRestarts |
PET_MAX_RESTARTS |
--max-restarts |
Maximum number of retries |
Use Run Policy
Write runPolicy to use torchrun.
...
spec:
runPolicy:
cleanPodPolicy: None
ttlSecondsAfterFinished: 1814400 # Field where you set TTL manually (unit: seconds)
...
You can specify parameters related to running and arranging PyTorchJob in spec.runPolicy. If you do not specify parameters, the default values apply. The parameters you can specify under spec.runPolicy are as follows (see Kubeflow Trainer API Reference v1.9):
cleanPodPolicy- Arranges Pods afterPytorchJobis complete.- Default value:
None. None: Helps to view logs later because after the Job is complete, Pods are not deleted.All: Deletes all Pods after the Job is complete.- After the Job is complete,
Runningdeletes running Pods. Except in special cases, it is not normally used.
- Default value:
ttlSecondsAfterFinished- Decides how many seconds later the Job should be deleted after it is complete.activeDeadlineSeconds- Maximum running time of the Job. Once the specified time expires, it is marked as failed. If it is not set, there is no limit on the running time of the Job.backoffLimit- Maximum number of retries when the Job is failed.
Use InfiniBand
If you run distributed training with nodes connected to the InfiniBand network, you can accelerate communication between nodes.
The following summary are some specifications you must add to match the example shown in the previous section with the InfiniBand environment:
- Specify the name of the zone where the InfiniBand network is configured, using the annotation (i.e.,
mlx.navercorp.com/zone=a100-ib). - To use InfiniBand, add the
IPC_LOCKcapability tosecurityContext. - When you write ResourceRequest, set the resources for InfiniBand (
rdma/hca_shared_devices_a: 1) to be assigned. - Set Shared Memory for distributed training as
volumes.
You can use InfiniBand as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
securityContext: # securityContext is required to use Infiniband.
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
...
rdma/hca_shared_devices_a: 1
...
requests:
...
rdma/hca_shared_devices_a: 1
...
# shared memory
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
volumes:
- emptyDir:
medium: Memory
name: shared-memory
...
PytorchJob debugging
When an error occurs upon using PytorchJob, you need debugging. In this case, you can set environment variables as follows to log the information you need:
NCCL_DEBUG: NCCL-related debuggingTORCH_DISTRIBUTED_DEBUGandTORCH_CPP_LOG_LEVEL: Debugging for distributed training. For more information, see the PyTorch official documentation.
You can use them for debugging as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
...
env:
- name: NCCL_DEBUG
value: "INFO"
- name: TORCH_DISTRIBUTED_DEBUG
value: "DETAIL"
- name: TORCH_CPP_LOG_LEVEL
value: "INFO"
...
Use external Container Registry
If you need Secret information for Container Registry, see Create Container Secret and create Secret information.
You can use the Secret created as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
imagePullSecrets:
- name: my-harbor-secret # Previously created name of Docker Credential Secret
...
Use existing volume
You can use the volume created through Volumes as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
volumeMounts:
- mountPath: /data
name: mnist-data # Name written in spec.volumes at the bottom
volumes:
- name: mnist-data
persistentVolumeClaim:
claimName: mnist-data # Name of the PVC created through Volumes
...
Check PytorchJob status
You can check the status of PyTorchJob using kubectl get and kubectl describe.
kubectl get pytorchjob pytorch-elastic-mnist-nccl
NAME STATE AGE
pytorch-elastic-mnist-nccl Running 12s
kubectl describe pytorchjob pytorch-elastic-mnist-nccl
Status:
Completion Time: 2024-11-22T09:16:58Z
Conditions:
Last Transition Time: 2024-11-22T09:15:43Z
Last Update Time: 2024-11-22T09:15:43Z
Message: PyTorchJob pytorch-elastic-mnist-nccl is created.
Reason: PyTorchJobCreated
Status: True
Type: Created
Last Transition Time: 2024-11-22T09:15:48Z
Last Update Time: 2024-11-22T09:15:48Z
Message: PyTorchJob nb12706/pytorch-elastic-mnist-nccl is running.
Reason: PyTorchJobRunning
Status: False
Type: Running
Last Transition Time: 2024-11-22T09:16:58Z
Last Update Time: 2024-11-22T09:16:58Z
Message: PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
Reason: PyTorchJobSucceeded
Status: True
Type: Succeeded
Last Reconcile Time: 2024-11-22T09:15:43Z
Replica Statuses:
Worker:
Selector: training.kubeflow.org/job-name=pytorch-elastic-mnist-nccl,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker
Succeeded: 2
Start Time: 2024-11-22T09:15:44Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 4m21s pytorchjob-controller Created pod: pytorch-elastic-mnist-nccl-worker-0
Normal SuccessfulCreatePod 4m21s pytorchjob-controller Created pod: pytorch-elastic-mnist-nccl-worker-1
Normal SuccessfulCreateService 4m21s pytorchjob-controller Created service: pytorch-elastic-mnist-nccl-worker-0
Normal SuccessfulCreateService 4m21s pytorchjob-controller Created service: pytorch-elastic-mnist-nccl-worker-1
Normal ExitedWithCode 3m7s (x3 over 3m8s) pytorchjob-controller Pod: nb12706.pytorch-elastic-mnist-nccl-worker-1 exited with code 0
Normal ExitedWithCode 3m7s (x2 over 3m8s) pytorchjob-controller Pod: nb12706.pytorch-elastic-mnist-nccl-worker-0 exited with code 0
Normal PyTorchJobSucceeded 3m7s pytorchjob-controller PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
Normal JobTerminated 3m6s (x4 over 3m7s) pytorchjob-controller Job has been terminated. Deleting PodGroup
Normal SuccessfulDeletePodGroup 3m6s (x4 over 3m7s) pytorchjob-controller Deleted PodGroup: pytorch-elastic-mnist-nccl
If a training Pod is not created properly, you can identify the causes through Events as follows:
kubectl describe pytorchjob pytorch-elastic-mnist-nccl
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePod 47m (x3 over 103m) pytorchjob-controller Error creating: Pods "job-worker-1" is forbidden: exceeded quota: normal-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=2, limited: requests.nvidia.com/gpu=2