Available in VPC
ML expert Platform provides jobs for multiple model training frameworks.
Learn how to train models using kubectl with the most common methods, including single-node training (Job) and distributed node training (PytorchJob).
For single-node training, we recommend using Job.
Using PytorchJob for single-node training can lead to unnecessary resource consumption because it is designed for distributed node training and manages deployments in a Master and Worker structure.
Run single-node training (Job)
You can define a Kubernetes Job specification for training as shown in the following example.
- When using mounted high-performance storage, the UID and GID in the training image must be set to 500.
- Do not set
fsGroupin the PodsecurityContext. SettingfsGroupcauses Kubernetes to recursively change the ownership of all files in the mounted volume. For high-performance storage containing large amounts of data, this can make Pod initialization extremely slow.
# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: mnist
namespace: p-{projectName}
spec:
backoffLimit: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
containers:
- name: main
image: { training image (e.g. example.com/mnist:latest ) } # Training code written using the NVIDIA Base image
imagePullPolicy: Always
resources:
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
command: ["python"]
args:
- /opt/mnist/src/mnist.py
- --checkpoint_path
- /opt/mnist/checkpoints/mnist.pt
- --log_path
- /opt/mnist/log
- --data_path
- /opt/mnist/data
- --download_data
kubectl apply -f job.yaml
batch/mnist created
Use an external container registry
If the container registry requires Secret information, see Create a Container Secret to create one.
Once created, you can use the Secret as follows:
...
spec:
imagePullSecrets:
- name: my-harbor-secret # Name of the Docker Credential Secret created in advance
...
Use an existing volume
To use a Volume created through Volumes, you can configure it as follows:
...
spec:
containers:
- name: main
...
volumeMounts:
- mountPath: /data
name: mnist-data # Name specified in spec.volumes below
volumes:
- name: mnist-data
persistentVolumeClaim:
claimName: mnist-data # Name of the PVC created through Volumes
...
Job lifecycle
When a Job completes, it remains in the list for a certain period of time to preserve container logs and status.
Since there is a limit on the maximum number of Jobs that can remain in the list, you must manage the lifecycle using TTL. For more information, see Kubernetes Job API.
When creating a Job, you can set a custom time to live (TTL). When a Job finishes, the time to live (TTL) is activated, whether the Job succeeds or fails. After the TTL passes, the Job and all Pods that belong to the Job are automatically deleted.
The following example applies a TTL of 3 weeks.
apiVersion: batch/v1
kind: Job
metadata:
name: mnist
namespace: p-{projectName}
spec:
ttlSecondsAfterFinished: 1814400 # Field for setting TTL manually (unit: sec)
Run distributed node training (PyTorchJob)
Using PyTorchJob provides the following benefits:
- It creates Master and Worker Pods as appropriate based on the container specification you define.
- It automatically sets environment variables commonly required for PyTorch distributed training, such as WORLD_SIZE, RANK, and MASTER_ADDR.
- It automatically creates a K8s Service to facilitate communication between training Pods. You can access the Master as
<pytorch-job-name>-master-0and the Workers as<pytorch-job-name>-worker-<idx>. - If needed, you can use Elastic Policy to create environment variables for torchrun arguments, such as
--nnodes,--nproc-per-node, and--rdzv-endpoint.
Define a PyTorchJob
For distributed training with PyTorch, we recommend using torchrun (Elastic Launch). When using torchrun, do not specify a Master Pod explicitly. In Torch Elastic, the RANK=0 Pod that acts as the master node can change during execution.
spec.elasticPolicy- Configures torchrun-related settings. Settings specified here are injected as environment variables. For details, see Use Elastic Policy.spec.runPolicy- Specifies parameters for PyTorchJob execution and post-completion handling. For details, see Use Run Policy.spec.pytorchReplicaSpecs.Worker- Configures the Worker Pod that performs distributed training.
- The container name for training, such as
spec.pytorchReplicaSpecs.Worker.template.spec.containers[*].name, must be set topytorch. - To prevent Istio Sidecar injection and ensure stable distributed training,
spec.pytorchReplicaSpecs.Worker.template.metadata.annotationsis automatically set insidecar.istio.io/inject: "false". If this annotation is not set, communication errors between nodes, such asRuntimeError: Connection reset by peer, may occur. - When using mounted high-performance storage, the UID and GID in the training image must be set to 500.
- Do not set
fsGroupin the PodsecurityContext. SettingfsGroupcauses Kubernetes to recursively change the ownership of all files in the mounted volume. For high-performance storage containing large amounts of data, this can make Pod initialization extremely slow.
You can define a PyTorchJob specification for training as shown in the following example.
# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist-dist-nccl
namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false" # Automatically set. Can be omitted.
spec:
nodeSelector:
mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
containers:
- name: pytorch # Must set the container name for PyTorchJob to pytorch
image: examples.com/pytorch-mnist-dist:23.03-py3
imagePullPolicy: Always
command: ["bash", "-c"]
args:
- >
torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
/opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
env:
- name: NCCL_DEBUG
value: INFO
- name: GLOO_SOCKET_FAMILY
value: AF_INET
securityContext: # securityContext configuration is required for InfiniBand
capabilities:
add: ["IPC_LOCK"]
# shared memory
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
volumes:
- emptyDir:
medium: Memory
name: shared-memory
kubectl apply -f pytorchjob.yaml
pytorchjob.kubeflow.org/pytorch-elastic-mnist-nccl created
Use Elastic Policy
Define elasticPolicy to use torchrun.
...
spec:
...
elasticPolicy:
rdzvId: mnist
rdzvBackend: c10d
minReplicas: 2
maxReplicas: 2
nProcPerNode: 8
...
Environment variables used for PyTorchJob are set based on each field value in elasticPolicy. These environment variables can be used to set torchrun arguments. For more information about the arguments used by torchrun, see the official documentation.
elasticPolicy field |
Environment variable | torchrun argument |
Description |
|---|---|---|---|
rdzvId |
PET_RDZV_ID |
--rdzv-id |
Rendezvous Job ID |
rdzvBackend |
PET_RDZV_BACKEND |
--rdzv-backend |
Rendezvous backend, e.g., c10d |
minReplicas, maxReplicas |
PET_NNODES |
--nnodes |
Number of nodes |
nProcPerNode |
PET_NPROC_PER_NODE |
--nproc-per-node |
Number of GPUs per node |
maxRestarts |
PET_MAX_RESTARTS |
--max-restarts |
Maximum number of restarts |
Use Run Policy
Define runPolicy to use torchrun.
...
spec:
runPolicy:
cleanPodPolicy: None
ttlSecondsAfterFinished: 1814400 # Field for setting TTL manually (unit: sec)
...
spec.runPolicy lets you specify parameters related to PyTorchJob execution and cleanup. If not specified, the default values are used. The parameters you can specify under spec.runPolicy include the following: (Reference: Kubeflow Trainer API Reference v1.9)
cleanPodPolicy- Determines how to clean up Pods after thePytorchJobcompletes.- Default:
None None: Does not delete Pods after the Job completes, which helps you check logs later.All: Deletes all Pods after the Job completes.Runningdeletes running Pods after the Job completes. This is rarely used except in special cases.
- Default:
ttlSecondsAfterFinished- Determines how many seconds after Job completion the Job is deleted.activeDeadlineSeconds- Specifies the maximum execution time for the Job. After the specified time passes, the Job is marked as failed. If not set, there is no limit on the Job execution time.backoffLimit- Specifies the maximum number of retries when the Job fails.
Use Infiniband
- No separate RDMA resource settings, such as requests or limits, are required.
- For GPU Zone information, see View available GPU Zones.
You can speed up communication between nodes by running distributed training on nodes connected through an InfiniBand network.
To adapt the example shown in the previous section for an InfiniBand environment, add the following specifications:
- Specify the name as an annotation, such as
mlx.navercorp.com/zone=ai-infra. - Enable InfiniBand by adding the
IPC_LOCKcapability tosecurityContext. - Configure shared memory in
volumesto support distributed training.
To use InfiniBand, you can configure it as follows:
...
metadata:
annotations:
mlx.navercorp.com/zone="ai-infra"
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
securityContext: # securityContext configuration is required for InfiniBand
capabilities:
add: ["IPC_LOCK"]
# shared memory
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
volumes:
- emptyDir:
medium: Memory
name: shared-memory
...
Debug PyTorchJob
To debug issues in PytorchJob, you can set the following environment variables to log the required information:
NCCL_DEBUG: Enables NCCL-related debugging.TORCH_DISTRIBUTED_DEBUG,TORCH_CPP_LOG_LEVEL: Enables distributed training debugging. For details, see the official PyTorch documentation.
For debugging, you can configure it as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
...
env:
- name: NCCL_DEBUG
value: "INFO"
- name: TORCH_DISTRIBUTED_DEBUG
value: "DETAIL"
- name: TORCH_CPP_LOG_LEVEL
value: "INFO"
...
Use an external container registry
If the container registry requires Secret information, see Create a Container Secret to create one.
Once created, you can use the Secret as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
imagePullSecrets:
- name: my-harbor-secret # Name of the Docker Credential Secret created in advance
...
Use an existing volume
To use a Volume created through Volumes, you can configure it as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
volumeMounts:
- mountPath: /data
name: mnist-data # Name specified in spec.volumes below
volumes:
- name: mnist-data
persistentVolumeClaim:
claimName: mnist-data # Name of the PVC created through Volumes
...
Check PytorchJob status
You can check the status of a PyTorchJob using kubectl get and kubectl describe.
kubectl get pytorchjob pytorch-elastic-mnist-nccl
NAME STATE AGE
pytorch-elastic-mnist-nccl Running 12s
kubectl describe pytorchjob pytorch-elastic-mnist-nccl
Status:
Completion Time: 2024-11-22T09:16:58Z
Conditions:
Last Transition Time: 2024-11-22T09:15:43Z
Last Update Time: 2024-11-22T09:15:43Z
Message: PyTorchJob pytorch-elastic-mnist-nccl is created.
Reason: PyTorchJobCreated
Status: True
Type: Created
Last Transition Time: 2024-11-22T09:15:48Z
Last Update Time: 2024-11-22T09:15:48Z
Message: PyTorchJob nb12706/pytorch-elastic-mnist-nccl is running.
Reason: PyTorchJobRunning
Status: False
Type: Running
Last Transition Time: 2024-11-22T09:16:58Z
Last Update Time: 2024-11-22T09:16:58Z
Message: PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
Reason: PyTorchJobSucceeded
Status: True
Type: Succeeded
Last Reconcile Time: 2024-11-22T09:15:43Z
Replica Statuses:
Worker:
Selector: training.kubeflow.org/job-name=pytorch-elastic-mnist-nccl,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker
Succeeded: 2
Start Time: 2024-11-22T09:15:44Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 4m21s pytorchjob-controller Created pod: pytorch-elastic-mnist-nccl-worker-0
Normal SuccessfulCreatePod 4m21s pytorchjob-controller Created pod: pytorch-elastic-mnist-nccl-worker-1
Normal SuccessfulCreateService 4m21s pytorchjob-controller Created service: pytorch-elastic-mnist-nccl-worker-0
Normal SuccessfulCreateService 4m21s pytorchjob-controller Created service: pytorch-elastic-mnist-nccl-worker-1
Normal ExitedWithCode 3m7s (x3 over 3m8s) pytorchjob-controller Pod: nb12706.pytorch-elastic-mnist-nccl-worker-1 exited with code 0
Normal ExitedWithCode 3m7s (x2 over 3m8s) pytorchjob-controller Pod: nb12706.pytorch-elastic-mnist-nccl-worker-0 exited with code 0
Normal PyTorchJobSucceeded 3m7s pytorchjob-controller PyTorchJob nb12706/pytorch-elastic-mnist-nccl successfully completed.
Normal JobTerminated 3m6s (x4 over 3m7s) pytorchjob-controller Job has been terminated. Deleting PodGroup
Normal SuccessfulDeletePodGroup 3m6s (x4 over 3m7s) pytorchjob-controller Deleted PodGroup: pytorch-elastic-mnist-nccl
If a training Pod is not created properly, you can identify the cause through Events as follows:
kubectl describe pytorchjob pytorch-elastic-mnist-nccl
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePod 47m (x3 over 103m) pytorchjob-controller Error creating: Pods "job-worker-1" is forbidden: exceeded quota: normal-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=2, limited: requests.nvidia.com/gpu=2
Guidelines for distributed training
This section provides guidelines for distributed training in MLXP.
Use a CUDA version compatible with the NVIDIA driver
The CUDA Runtime version must be compatible with the NVIDIA driver installed on the node. Failing to meet the minimum driver version requirements may result in CUDA initialization or runtime errors, preventing the training job from executing correctly.
Set GLOO_SOCKET_FAMILY=AF_INET when IPv6 is unavailable
GLOO frequently establishes TCP connections to check the status between nodes and manage ranks. When running large-scale training in an environment where IPv6 is unavailable, each connection attempt first tries and fails to connect over IPv6 before retrying over IPv4. If these repeated connection attempts continue, resource issues such as file descriptor exhaustion may occur and cause training to stop.
To prevent this issue, we recommend setting the GLOO_SOCKET_FAMILY=AF_INET environment variable to use only IPv4.
To force GLOO to use IPv4, you can configure it as follows:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
...
env:
- name: GLOO_SOCKET_FAMILY
value: AF_INET
...
Limit the PyTorchJob name length (recommended)
Since the PyTorchJob name is used as a prefix when creating Kubernetes resources, such as Pods and Services, we recommend keeping it to 50 characters or fewer. If the name exceeds 50 characters, derived resource names may also exceed the length limit, causing creation failures or issues with worker index parsing.
Remove the memory limit setting when configuring shared memory
We recommend not setting a memory limit for shared memory (/dev/shm) or setting a sufficiently large value. Insufficient /dev/shm can cause DataLoader multiprocessing workers to terminate abnormally and lead to a significant decrease in NCCL communication performance.
The following example shows a shared memory configuration with the memory limit removed:
...
spec:
...
pytorchReplicaSpecs:
Worker:
template:
containers:
- name: pytorch
# shared memory
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
volumes:
- emptyDir:
medium: Memory
name: shared-memory