Available in VPC
ML expert Platform provides efficient and stable services across the entire process for AI/ML services, from data management/processing through large-scale distributed training to serving deployment spanning from small models to hyperscale AI models. You can learn step by step how to train models on the FashionMNIST dataset using ML expert Platform. The following example performs training on multiple nodes, uses the Persistent Volume Claim (PVC) to save the data, and runs training with PytorchJob.
1. Create workspace and project
2. Prepare training
Prepare FashionMNIST dataset
This dataset is written based on the FashionMNIST dataset provided by Huggingface.
The dataset is assumed to have been prepared in advance as follows:
- Data Manager
- Object Storage, Ncloud Storage
| Data management location | Recommendations | Note |
|---|---|---|
| Data Manager |
|
|
| Object Storage, Ncloud Storage |
|
For more information about how to manage the dataset with Data Manager, see Upload dataset.
Use training data
To use training data:
- Method for remote reading using Huggingface DataLoader
- Method for storage reading after copying the dataset into the selected storage
For more information about Huggingface DataLoader-based remote reading, see Read dataset. The following describes the volume data method only.
Create PVC
If you want to use the selected storage in the workspace within ML expert Platform, you must create and mount PersistentVolumeClaim (PVC). For more information, see Volumes. The following are PVC creation examples according to the type of storage provided by ML expert Platform:
The DataDirect Networks (DDN) high-performance storage supports Read Write Many (RWM), so you can configure it as follows:
#exa-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: exa-pvc
namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
storageClassName: { High-performance storage's storageClassName } # StorageClass name of the high-performance storage
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi # Size of the high-performance storage to be created
kubectl -n {namespace} apply -f exa-pvc.yaml
The local storage (NVMe) is bound to GPU Server, so PVCs must be created equal to the number of nodes. If you use the local storage (NVMe), it is recommended to use a method using EmptyDir and initContainer or a method using EmptyDir and Data Manager DataLoader.
#local-path-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-path-pvc
namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
storageClassName: { Local storage's storageClassName } # StorageClass name of the local storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi # (default value) Do not work for the local storage
kubectl -n {namespace} apply -f local-path-pvc.yaml
Download data
ML expert Platform provides the storage-initializer image for downloading the dataset. Use a Kubernetes Job to mount the created PVC and download the dataset using the images provided.
The following are examples of using a job to download:
Data Manager usage example
# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-job
namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
nodeSelector:
mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources
containers:
- name: storage-initializer
image: mlx-public.kr.ncr.ntruss.com/mlx/mdm-storage-initializer:v0.0.1
env:
- name: MLX_APIKEY # (3)!
value: '{ API Key }' # MLXP API Key
args:
- "mlx+data-manager://{ MLX endpoint url }/{workspace}/{dataset}"
- "/data/dataset" # Path to save the dataset in mountPath of spec.volueMounts
volumeMounts:
- mountPath: "/data"
name: storage-volume
volumes:
- name: storage-volume
persistentVolumeClaim:
claimName: { name of the PVC to mount} # Enter the name of the created PVC (e.g., exa-pvc, and local-path-pvc)
kubectl -n {namespace} apply -f download-job.yaml
Object Storage/Ncloud Storage usage example
# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-job
namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: Never
nodeSelector:
mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources
containers:
- name: storage-initializer
image: mlx-public.kr.ncr.ntruss.com/mlx/kserve/storage-initializer:v0.13.0
env:
- name: AWS_ENDPOINT_URL
value: { S3 Endpoint } # S3 Endpoint
- name: AWS_ACCESS_KEY_ID
value: { S3 Access Key } # S3 Access Key
- name: AWS_SECRET_ACCESS_KEY
value: { S3 Secret Key } # S3 Secret Key
- name: AWS_DEFAULT_REGION
value: { S3 Region } # S3 Region
args:
- "s3://{ Object Storage dataset URL }"
- "/data/dataset" # Path to save the dataset in mountPath of spec.volueMounts
volumeMounts:
- mountPath: "/data"
name: storage-volume
volumes:
- name: storage-volume
persistentVolumeClaim:
claimName: { name of the PVC to mount} # Enter the name of the created PVC (e.g., exa-pvc, and local-path-pvc)
kubectl -n {namespace} apply -f download-job.yaml
Prepare training code
To use InfiniBand, libibverbs.so must be installed in the images used in PyTorchJob. All required libraries are installed, so it is recommended to create images based on base images shared by ML expert Platform.
To use training provided by ML expert Platform, you must write code based on NVIDIA official Pytorch base image.
The following example code is written based on the use of DataDirect Networks (DDN) high-performance storage.
Training code example
The training code example is as follows:
# mnist_distributed.py
from __future__ import print_function
import argparse
import os
import time
from torch.utils.tensorboard import SummaryWriter
from torchvision.transforms import transforms
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from mlx.sdk.data import login, load_dataset
WORLD_SIZE = int(os.environ.get("WORLD_SIZE"))
LOCAL_RANK = int(os.environ.get("LOCAL_RANK", 0))
GLOBAL_RANK = int(os.environ.get("RANK"))
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4 * 4 * 50, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4 * 4 * 50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
def train(args, model, device, train_loader, optimizer, epoch, writer):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if GLOBAL_RANK == 0 and batch_idx % args.log_interval == 0:
print(
"Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch,
batch_idx * len(data),
len(train_loader.dataset) // WORLD_SIZE,
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)
niter = epoch * len(train_loader) + batch_idx
writer.add_scalar("loss", loss.item(), niter)
def test(args, model, device, test_loader, writer, epoch):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(
output, target, reduction="sum"
).item() # sum up batch loss
pred = output.max(1, keepdim=True)[
1
] # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
accuracy = float(correct) / (len(test_loader.dataset) / WORLD_SIZE)
if GLOBAL_RANK == 0:
print("\naccuracy={:.4f}\n".format(accuracy))
writer.add_scalar("accuracy", accuracy, epoch)
def main():
parser = argparse.ArgumentParser(description="PyTorch Distributed MNIST Example")
parser.add_argument(
"--batch-size",
type=int,
default=64,
metavar="N",
help="input batch size for training (default: 64)",
)
parser.add_argument(
"--test-batch-size",
type=int,
default=1000,
metavar="N",
help="input batch size for testing (default: 1000)",
)
parser.add_argument(
"--epochs",
type=int,
default=5,
metavar="N",
help="number of epochs to train (default: 10)",
)
parser.add_argument(
"--lr",
type=float,
default=0.01,
metavar="LR",
help="learning rate (default: 0.01)",
)
parser.add_argument(
"--momentum",
type=float,
default=0.5,
metavar="M",
help="SGD momentum (default: 0.5)",
)
parser.add_argument(
"--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
)
parser.add_argument(
"--log-interval",
type=int,
default=10,
metavar="N",
help="how many batches to wait before logging training status",
)
parser.add_argument(
"--checkpoint_path",
default=f"/data/result/mnist_distributed_{int(time.time())}.pt",
help="Path to save checkpoint",
)
parser.add_argument(
"--data_path", default="/data/mnist/data", help="Path for training/test data"
)
parser.add_argument(
"--log_path",
default="/data/log",
metavar="L",
help="Directory pathwhere summary logs are stored",
)
parser.add_argument(
"--backend",
type=str,
help="Distributed backend",
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
default=dist.Backend.NCCL,
)
args = parser.parse_args()
login("{ ML expert Platform API Key }")
writer = SummaryWriter(args.log_path)
torch.manual_seed(args.seed)
print("Using distributed PyTorch with {} backend".format(args.backend))
dist.init_process_group(backend=args.backend)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
def preprocess(examples):
"""properly transform for HuggingFace datasets"""
if isinstance(examples['image'], list):
# Batch processing
examples['image'] = [transform(img) for img in examples['image']]
examples['label'] = torch.tensor(examples['label'])
else:
# Single item
examples['image'] = transform(examples['image'])
examples['label'] = torch.tensor(examples['label'])
return examples['image'], examples['label']
train_dataset = load_dataset(args.data_path, split="train")
train_dataset.set_transform(preprocess)
test_dataset = load_dataset(args.data_path, split="test")
test_dataset.set_transform(preprocess)
train_loader = DataLoader(
train_dataset,
batch_size=args.batch_size,
shuffle=False,
num_workers=1,
pin_memory=True,
sampler=DistributedSampler(train_dataset),
)
test_loader = torch.utils.data.DataLoader(
test_dataset,
batch_size=args.test_batch_size,
shuffle=False,
num_workers=1,
pin_memory=True,
sampler=DistributedSampler(test_dataset),
)
model = Net().to(LOCAL_RANK)
# Wrap the model with DistributedDataParallel if needed.
model = DDP(model, device_ids=[LOCAL_RANK])
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
for epoch in range(1, args.epochs + 1):
train(args, model, LOCAL_RANK, train_loader, optimizer, epoch, writer)
test(args, model, LOCAL_RANK, test_loader, writer, epoch)
if GLOBAL_RANK == 0:
dir_name = os.path.dirname(args.checkpoint_path)
if dir_name:
os.makedirs(dir_name, exist_ok=True)
torch.save(model.state_dict(), args.checkpoint_path)
print(f"Checkpoint saved at {args.checkpoint_path}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
Dockerfile example
FROM nvcr.io/nvidia/pytorch:23.03-py3
# Install tensorboardX
USER root
RUN pip install --no-cache-dir tensorboardX==2.6.2
RUN mkdir -p /opt/mnist/src
WORKDIR /opt/mnist/src
USER 500:500 # UID 500 and GID 500 permissions are required to use the high-performance storage
COPY mnist_distributed.py /opt/mnist/src/mnist_distributed.py
Set Container Registry access information
To pull images from ML expert Platform in Container Registry, the access information must be set.
3. Start training
It is recommended to use Elastic Launch, officially provided by Pytorch. The following are examples using the DataDirect Networks (DDN) high-performance storage based on torchrun:
Create PytorchJob
For distributed training, proceed as follows:
# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist-dist-nccl
namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
elasticPolicy:
rdzvId: mnist
rdzvBackend: c10d
minReplicas: 2
maxReplicas: 2
nProcPerNode: 8
runPolicy:
cleanPodPolicy: None
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false" # Required to disable Istio sidecar injection
spec:
nodeSelector:
mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources
containers:
- name: pytorch # Must set PyTorchJob's container name to pytorch
image: examples.com/pytorch-mnist-dist:23.03-py3 # Container image of the example code
imagePullPolicy: Always
securityContext: # securityContext is required to use Infiniband.
capabilities:
add: ["IPC_LOCK"]
command: ["bash", "-c"]
args:
- >
torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
/opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
env:
- name: NCCL_DEBUG
value: INFO
resources:
limits:
memory: "1Ti"
cpu: 120
nvidia.com/gpu: 8
rdma/hca_shared_devices_a: 1
requests:
memory: "8Gi"
cpu: 120
nvidia.com/gpu: 8
rdma/hca_shared_devices_a: 1
# shared memory
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
- mountPath: "/data"
name: storage-volume
volumes:
- emptyDir:
medium: Memory
name: shared-memory
- name: storage-volume
persistentVolumeClaim:
claimName: exa-pvc # Name of a previously created high-performance storage PVC
Run PytorchJob
kubectl -n { namespace } apply -f pytorchjob.yaml
4. View status and results
To view the training status and results:
View based on Pod Log
The logs of the training pods can be viewed using the kubectl logs command.
kubectl -n { namespace } logs pytorch-elastic-mnist-nccl-worker-0 pytorch
View based on Tensorboard
If you maintain logs for Tensorboard in the training code, you can view the information through Tensorboard provided by ML expert Platform.
For the example code, Tensorboard logs are saved in /data/logs.
Save training results
You can use Model Registry to store and manage training parameters.
You can save training parameters by automatically uploading through Model Registry SDKs. Also, you can use Notebook to upload and manage the training parameters you need.