Available in VPC
ML expert Platform provides efficient and reliable services across the entire AI/ML service workflow, from data management and processing to large-scale distributed training, as well as serving and deployment for models ranging from small models to ultra-large AI models. Follow this step-by-step guide to train the FashionMNIST dataset using ML expert Platform. This example performs distributed node training, stores data using a persistent volume claim (PVC), and runs training using PyTorchJob.
1. Create a Workspace and Project
- Create a Workspace
- Register Workspace members
- Create a Project
- Register Project members
- View GPU Zone information and assign GPU instances
- Download kubeconfig
2. Prepare for training
Prepare the FashionMNIST dataset
This guide is based on the FashionMNIST dataset provided by Huggingface.
We assume that the dataset is prepared in advance as follows:
- Data Manager
- Object Storage, Ncloud Storage
| Data location | Recommended when | Notes |
|---|---|---|
| Data Manager |
|
|
| Object Storage, Ncloud Storage |
|
To manage the dataset with Data Manager, see Upload dataset.
Use training data
You can access training data by:
- Remotely reading the data using Huggingface DataLoader.
- Copying the dataset to the selected storage, then reading it from there.
For more information on remotely reading data using Huggingface DataLoader, see Read datasets. The following sections describe only the volume data method.
PVC creation
To use the selected storage in your ML expert Platform environment, create and mount a persistent volume claim (PVC). For details, see Volumes. The following examples show how to create PVCs for each storage type supported by ML expert Platform.
High-performance storage (DDN) supports ReadWriteMany (RWM). You can configure the storage as shown in the examples below.
- When using mounted high-performance storage, the UID and GID in the training image must be set to 500.
- Do not set
fsGroupin the PodsecurityContext. SettingfsGroupcauses Kubernetes to recursively change the ownership of all files in the mounted volume. For high-performance storage containing large amounts of data, this can make Pod initialization extremely slow.
#exa-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: exa-pvc
namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
storageClassName: { high-performance storageClassName } # StorageClass name for high-performance storage
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi # Capacity for high-performance storage
kubectl -n {namespace} apply -f exa-pvc.yaml
Because local storage (NVMe) is bound to specific GPU servers, you must create a separate PVC for each node. For local storage (NVMe), we recommend using EmptyDir with initContainer or EmptyDir with Data Manager DataLoader.
#local-path-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-path-pvc
namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
storageClassName: { local storage storageClassName } # StorageClass name for local storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi # (default value) Does not work for local storage.
kubectl -n {namespace} apply -f local-path-pvc.yaml
Download data
ML expert Platform provides a storage-initializer image for downloading datasets. Use a Kubernetes Job to mount the created PVC, and then download the dataset with the provided image.
The following shows an example of downloading data using a Job.
Data Manager usage example
# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-job
namespace: p-{ projectName } # Kubernetes Namespace name for the project
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
nodeSelector:
mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
containers:
- name: storage-initializer
image: mlx-public.kr.ncr.ntruss.com/mlx/mdm-storage-initializer:0.0.5
env:
- name: MLX_APIKEY # (3)!
value: '{ API Key }' # MLXP API Key
args:
- "mlx+data-manager://{ MLX endpoint url }/{workspace}/{dataset}"
- "/data/dataset" # Path for saving the dataset within the spec.volumeMounts.mountPath
volumeMounts:
- mountPath: "/data"
name: storage-volume
volumes:
- name: storage-volume
persistentVolumeClaim:
claimName: { PVC name to mount } # Enter the name of the PVC created, e.g. exa-pvc, local-path-pvc)
kubectl -n {namespace} apply -f download-job.yaml
Object Storage / Ncloud Storage usage example
# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-job
namespace: p-{ projectName } # Kubernetes Namespace name for the project
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
nodeSelector:
mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
containers:
- name: storage-initializer
image: mlx-public.kr.ncr.ntruss.com/mlx/kserve/storage-initializer:v0.13.0
env:
- name: AWS_ENDPOINT_URL
value: { S3 Endpoint } # S3 Endpoint
- name: AWS_ACCESS_KEY_ID
value: { S3 Access Key } # S3 Access Key
- name: AWS_SECRET_ACCESS_KEY
value: { S3 Secret Key } # S3 Secret Key
- name: AWS_DEFAULT_REGION
value: { S3 Region } # S3 Region
args:
- "s3://{ Object Storage dataset URL }"
- "/data/dataset" # Path for saving the dataset within the spec.volumeMounts.mountPath
volumeMounts:
- mountPath: "/data"
name: storage-volume
volumes:
- name: storage-volume
persistentVolumeClaim:
claimName: { PVC name to mount } # Enter the name of the PVC created, e.g. exa-pvc, local-path-pvc)
kubectl -n {namespace} apply -f download-job.yaml
Prepare training code
To use InfiniBand, the image used in your PyTorchJob must have libibverbs.so installed. Because all necessary libraries are already installed, we recommend building your custom image based on the base image provided by ML expert Platform.
To use training in ML expert Platform, you need to write code based on the official NVIDIA PyTorch base image.
The example code below is designed for use with high-performance storage (DDN).
Example training code
The example training code is as follows:
# mnist_distributed.py
from __future__ import print_function
import argparse
import os
import time
from torch.utils.tensorboard import SummaryWriter
from torchvision.transforms import transforms
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from mlx.sdk.data import login, load_dataset
WORLD_SIZE = int(os.environ.get("WORLD_SIZE"))
LOCAL_RANK = int(os.environ.get("LOCAL_RANK", 0))
GLOBAL_RANK = int(os.environ.get("RANK"))
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4 * 4 * 50, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4 * 4 * 50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
def train(args, model, device, train_loader, optimizer, epoch, writer):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if GLOBAL_RANK == 0 and batch_idx % args.log_interval == 0:
print(
"Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch,
batch_idx * len(data),
len(train_loader.dataset) // WORLD_SIZE,
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)
niter = epoch * len(train_loader) + batch_idx
writer.add_scalar("loss", loss.item(), niter)
def test(args, model, device, test_loader, writer, epoch):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(
output, target, reduction="sum"
).item() # sum up batch loss
pred = output.max(1, keepdim=True)[
1
] # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
accuracy = float(correct) / (len(test_loader.dataset) / WORLD_SIZE)
if GLOBAL_RANK == 0:
print("\naccuracy={:.4f}\n".format(accuracy))
writer.add_scalar("accuracy", accuracy, epoch)
def main():
parser = argparse.ArgumentParser(description="PyTorch Distributed MNIST Example")
parser.add_argument(
"--batch-size",
type=int,
default=64,
metavar="N",
help="input batch size for training (default: 64)",
)
parser.add_argument(
"--test-batch-size",
type=int,
default=1000,
metavar="N",
help="input batch size for testing (default: 1000)",
)
parser.add_argument(
"--epochs",
type=int,
default=5,
metavar="N",
help="number of epochs to train (default: 10)",
)
parser.add_argument(
"--lr",
type=float,
default=0.01,
metavar="LR",
help="learning rate (default: 0.01)",
)
parser.add_argument(
"--momentum",
type=float,
default=0.5,
metavar="M",
help="SGD momentum (default: 0.5)",
)
parser.add_argument(
"--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
)
parser.add_argument(
"--log-interval",
type=int,
default=10,
metavar="N",
help="how many batches to wait before logging training status",
)
parser.add_argument(
"--checkpoint_path",
default=f"/data/result/mnist_distributed_{int(time.time())}.pt",
help="Path to save checkpoint",
)
parser.add_argument(
"--data_path", default="/data/mnist/data", help="Path for training/test data"
)
parser.add_argument(
"--log_path",
default="/data/log",
metavar="L",
help="Directory pathwhere summary logs are stored",
)
parser.add_argument(
"--backend",
type=str,
help="Distributed backend",
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
default=dist.Backend.NCCL,
)
args = parser.parse_args()
login("{ ML expert Platform API Key }")
writer = SummaryWriter(args.log_path)
torch.manual_seed(args.seed)
print("Using distributed PyTorch with {} backend".format(args.backend))
dist.init_process_group(backend=args.backend)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
def preprocess(examples):
"""Correct transform for HuggingFace datasets"""
if isinstance(examples['image'], list):
# Batch processing
examples['image'] = [transform(img) for img in examples['image']]
examples['label'] = torch.tensor(examples['label'])
else:
# Single item
examples['image'] = transform(examples['image'])
examples['label'] = torch.tensor(examples['label'])
return examples['image'], examples['label']
train_dataset = load_dataset(args.data_path, split="train")
train_dataset.set_transform(preprocess)
test_dataset = load_dataset(args.data_path, split="test")
test_dataset.set_transform(preprocess)
train_loader = DataLoader(
train_dataset,
batch_size=args.batch_size,
shuffle=False,
num_workers=1,
pin_memory=True,
sampler=DistributedSampler(train_dataset),
)
test_loader = torch.utils.data.DataLoader(
test_dataset,
batch_size=args.test_batch_size,
shuffle=False,
num_workers=1,
pin_memory=True,
sampler=DistributedSampler(test_dataset),
)
model = Net().to(LOCAL_RANK)
# Wrap the model with DistributedDataParallel if needed.
model = DDP(model, device_ids=[LOCAL_RANK])
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
for epoch in range(1, args.epochs + 1):
train(args, model, LOCAL_RANK, train_loader, optimizer, epoch, writer)
test(args, model, LOCAL_RANK, test_loader, writer, epoch)
if GLOBAL_RANK == 0:
dir_name = os.path.dirname(args.checkpoint_path)
if dir_name:
os.makedirs(dir_name, exist_ok=True)
torch.save(model.state_dict(), args.checkpoint_path)
print(f"Checkpoint saved at {args.checkpoint_path}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
Example Dockerfile
FROM nvcr.io/nvidia/pytorch:23.03-py3
# Install tensorboardX
USER root
RUN pip install --no-cache-dir tensorboardX==2.6.2
RUN mkdir -p /opt/mnist/src
WORKDIR /opt/mnist/src
USER 500:500 # UID 500 and GID 500 are required for high-performance storage access
COPY mnist_distributed.py /opt/mnist/src/mnist_distributed.py
Configure container registry access information
To pull images from the container registry to ML expert Platform, you need to configure access information.
3. Start training
We recommend using Elastic Launch, which is officially provided by PyTorch. The following example is based on torchrun and uses high-performance storage (DDN).
Create a PyTorchJob
For distributed training, configure the PyTorchJob as follows:
# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist-dist-nccl
namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
elasticPolicy:
rdzvId: mnist
rdzvBackend: c10d
minReplicas: 2
maxReplicas: 2
nProcPerNode: 8
runPolicy:
cleanPodPolicy: None
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false" # Must disable Istio sidecar injection
spec:
nodeSelector:
mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
containers:
- name: pytorch # Must set the container name for PyTorchJob to pytorch
image: examples.com/pytorch-mnist-dist:23.03-py3 # Container image for the example code
imagePullPolicy: Always
securityContext: # securityContext configuration is required for InfiniBand
capabilities:
add: ["IPC_LOCK"]
command: ["bash", "-c"]
args:
- >
torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
/opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
env:
- name: NCCL_DEBUG
value: INFO
resources:
limits:
memory: "1Ti"
cpu: 120
nvidia.com/gpu: 8
requests:
memory: "8Gi"
cpu: 120
nvidia.com/gpu: 8
# shared memory
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
- mountPath: "/data"
name: storage-volume
volumes:
- emptyDir:
medium: Memory
name: shared-memory
- name: storage-volume
persistentVolumeClaim:
claimName: exa-pvc # Name of the high-performance storage PVC created previously
Run the PytorchJob
kubectl -n { namespace } apply -f pytorchjob.yaml
4. Check status and results
To check the status and results of your training:
Check using Pod logs
You can check the logs of a training Pod using the kubectl logs command.
kubectl -n { namespace } logs pytorch-elastic-mnist-nccl-worker-0 pytorch
Check using Tensorboard
If your training code logs data for Tensorboard, you can view the information through Tensorboard provided by ML expert Platform.
In the example code, Tensorboard logs are saved in /data/logs.
Save training results
Model Registry lets you store and manage training parameters.
To save training parameters, you can upload them automatically through Model Registry SDKs or upload only the necessary parameters manually through Notebook.