ML expert Platform quickstart

Available in VPC

ML expert Platform provides efficient and stable services across the entire process for AI/ML services, from data management/processing through large-scale distributed training to serving deployment spanning from small models to hyperscale AI models. You can learn step by step how to train models on the FashionMNIST dataset using ML expert Platform. The following example performs training on multiple nodes, uses the Persistent Volume Claim (PVC) to save the data, and runs training with PytorchJob.

1. Create workspace and project

2. Prepare training

Prepare FashionMNIST dataset

This dataset is written based on the FashionMNIST dataset provided by Huggingface.

The dataset is assumed to have been prepared in advance as follows:

Data Manager
Object Storage, Ncloud Storage

Data management location	Recommendations	Note
Data Manager	If you need dataset versioning at the logical level If you use training code written with Huggingface Interface
Object Storage, Ncloud Storage	If you don't need to manage the dataset If you don't need Huggingface Interface

For more information about how to manage the dataset with Data Manager, see Upload dataset.

Use training data

To use training data:

Method for remote reading using Huggingface DataLoader
Method for storage reading after copying the dataset into the selected storage

For more information about Huggingface DataLoader-based remote reading, see Read dataset. The following describes the volume data method only.

Create PVC

If you want to use the selected storage in the workspace within ML expert Platform, you must create and mount PersistentVolumeClaim (PVC). For more information, see Volumes. The following are PVC creation examples according to the type of storage provided by ML expert Platform:

The DataDirect Networks (DDN) high-performance storage supports Read Write Many (RWM), so you can configure it as follows:

#exa-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: exa-pvc
    namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
    storageClassName: { High-performance storage's storageClassName } # StorageClass name of the high-performance storage
    accessModes:
        - ReadWriteMany
    resources:
        requests:
            storage: 10Gi # Size of the high-performance storage to be created

kubectl -n {namespace} apply -f exa-pvc.yaml

The local storage (NVMe) is bound to GPU Server, so PVCs must be created equal to the number of nodes. If you use the local storage (NVMe), it is recommended to use a method using EmptyDir and initContainer or a method using EmptyDir and Data Manager DataLoader.

#local-path-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: local-path-pvc
    namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
    storageClassName: { Local storage's storageClassName } # StorageClass name of the local storage
    accessModes:
        - ReadWriteOnce
    resources:
        requests:
            storage: 10Gi # (default value) Do not work for the local storage

kubectl -n {namespace} apply -f local-path-pvc.yaml

Download data

ML expert Platform provides the storage-initializer image for downloading the dataset. Use a Kubernetes Job to mount the created PVC and download the dataset using the images provided.

The following are examples of using a job to download:

Data Manager usage example

# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-job
  namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: Never
      nodeSelector:
        mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources
      containers:
        - name: storage-initializer
          image: mlx-public.kr.ncr.ntruss.com/mlx/mdm-storage-initializer:v0.0.1
          env:
          - name: MLX_APIKEY # (3)!
            value: '{ API Key }' # MLXP API Key
          args:
          - "mlx+data-manager://{ MLX endpoint url }/{workspace}/{dataset}"
          - "/data/dataset" # Path to save the dataset in mountPath of spec.volueMounts
          volumeMounts:
            - mountPath: "/data"
              name: storage-volume
      volumes:
        - name: storage-volume
          persistentVolumeClaim:
            claimName: { name of the PVC to mount} # Enter the name of the created PVC (e.g., exa-pvc, and local-path-pvc)

kubectl -n {namespace} apply -f download-job.yaml

Object Storage/Ncloud Storage usage example

# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-job
  namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: Never
      nodeSelector:
        mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources
      containers:
        - name: storage-initializer
          image: mlx-public.kr.ncr.ntruss.com/mlx/kserve/storage-initializer:v0.13.0
          env:
          - name: AWS_ENDPOINT_URL
            value: { S3 Endpoint } # S3 Endpoint
          - name: AWS_ACCESS_KEY_ID
            value: { S3 Access Key }  # S3 Access Key
          - name: AWS_SECRET_ACCESS_KEY
            value: { S3 Secret Key } # S3 Secret Key
          - name: AWS_DEFAULT_REGION
            value: { S3 Region } # S3 Region
          args:
          - "s3://{ Object Storage dataset URL }"
          - "/data/dataset" # Path to save the dataset in mountPath of spec.volueMounts
          volumeMounts:
            - mountPath: "/data"
              name: storage-volume
      volumes:
        - name: storage-volume
          persistentVolumeClaim:
            claimName: { name of the PVC to mount} # Enter the name of the created PVC (e.g., exa-pvc, and local-path-pvc)

kubectl -n {namespace} apply -f download-job.yaml

Prepare training code

Use base images

To use InfiniBand, libibverbs.so must be installed in the images used in PyTorchJob. All required libraries are installed, so it is recommended to create images based on base images shared by ML expert Platform.

To use training provided by ML expert Platform, you must write code based on NVIDIA official Pytorch base image.
The following example code is written based on the use of DataDirect Networks (DDN) high-performance storage.

Training code example

The training code example is as follows:

# mnist_distributed.py
from __future__ import print_function

import argparse
import os
import time

from torch.utils.tensorboard import SummaryWriter
from torchvision.transforms import transforms
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

from mlx.sdk.data import login, load_dataset

WORLD_SIZE = int(os.environ.get("WORLD_SIZE"))
LOCAL_RANK = int(os.environ.get("LOCAL_RANK", 0))
GLOBAL_RANK = int(os.environ.get("RANK"))


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4 * 4 * 50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4 * 4 * 50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def train(args, model, device, train_loader, optimizer, epoch, writer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if GLOBAL_RANK == 0 and batch_idx % args.log_interval == 0:
            print(
                "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                    epoch,
                    batch_idx * len(data),
                    len(train_loader.dataset) // WORLD_SIZE,
                    100.0 * batch_idx / len(train_loader),
                    loss.item(),
                )
            )
            niter = epoch * len(train_loader) + batch_idx
            writer.add_scalar("loss", loss.item(), niter)


def test(args, model, device, test_loader, writer, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(
                output, target, reduction="sum"
            ).item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[
                1
            ]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    accuracy = float(correct) / (len(test_loader.dataset) / WORLD_SIZE)
    if GLOBAL_RANK == 0:
        print("\naccuracy={:.4f}\n".format(accuracy))
        writer.add_scalar("accuracy", accuracy, epoch)

def main():
    parser = argparse.ArgumentParser(description="PyTorch Distributed MNIST Example")
    parser.add_argument(
        "--batch-size",
        type=int,
        default=64,
        metavar="N",
        help="input batch size for training (default: 64)",
    )
    parser.add_argument(
        "--test-batch-size",
        type=int,
        default=1000,
        metavar="N",
        help="input batch size for testing (default: 1000)",
    )
    parser.add_argument(
        "--epochs",
        type=int,
        default=5,
        metavar="N",
        help="number of epochs to train (default: 10)",
    )
    parser.add_argument(
        "--lr",
        type=float,
        default=0.01,
        metavar="LR",
        help="learning rate (default: 0.01)",
    )
    parser.add_argument(
        "--momentum",
        type=float,
        default=0.5,
        metavar="M",
        help="SGD momentum (default: 0.5)",
    )
    parser.add_argument(
        "--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
    )
    parser.add_argument(
        "--log-interval",
        type=int,
        default=10,
        metavar="N",
        help="how many batches to wait before logging training status",
    )
    parser.add_argument(
        "--checkpoint_path",
        default=f"/data/result/mnist_distributed_{int(time.time())}.pt",
        help="Path to save checkpoint",
    )
    parser.add_argument(
        "--data_path", default="/data/mnist/data", help="Path for training/test data"
    )
    parser.add_argument(
        "--log_path",
        default="/data/log",
        metavar="L",
        help="Directory pathwhere summary logs are stored",
    )
    parser.add_argument(
        "--backend",
        type=str,
        help="Distributed backend",
        choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
        default=dist.Backend.NCCL,
    )

    args = parser.parse_args()
    login("{ ML expert Platform API Key }")
    writer = SummaryWriter(args.log_path)
    torch.manual_seed(args.seed)

    print("Using distributed PyTorch with {} backend".format(args.backend))
    dist.init_process_group(backend=args.backend)

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    def preprocess(examples):
        """properly transform for HuggingFace datasets"""
        if isinstance(examples['image'], list):
            # Batch processing
            examples['image'] = [transform(img) for img in examples['image']]
            examples['label'] = torch.tensor(examples['label'])
        else:
            # Single item
            examples['image'] = transform(examples['image'])
            examples['label'] = torch.tensor(examples['label'])
        return examples['image'], examples['label']

    train_dataset = load_dataset(args.data_path, split="train")
    train_dataset.set_transform(preprocess)
    test_dataset = load_dataset(args.data_path, split="test")
    test_dataset.set_transform(preprocess)

    train_loader = DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=False,
        num_workers=1,
        pin_memory=True,
        sampler=DistributedSampler(train_dataset),
    )
    test_loader = torch.utils.data.DataLoader(
        test_dataset,
        batch_size=args.test_batch_size,
        shuffle=False,
        num_workers=1,
        pin_memory=True,
        sampler=DistributedSampler(test_dataset),
    )
    model = Net().to(LOCAL_RANK)

    # Wrap the model with DistributedDataParallel if needed.
    model = DDP(model, device_ids=[LOCAL_RANK])

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        train(args, model, LOCAL_RANK, train_loader, optimizer, epoch, writer)
        test(args, model, LOCAL_RANK, test_loader, writer, epoch)

    if GLOBAL_RANK == 0:
        dir_name = os.path.dirname(args.checkpoint_path)
        if dir_name:
            os.makedirs(dir_name, exist_ok=True)
        torch.save(model.state_dict(), args.checkpoint_path)
        print(f"Checkpoint saved at {args.checkpoint_path}")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Dockerfile example

FROM nvcr.io/nvidia/pytorch:23.03-py3

# Install tensorboardX
USER root
RUN pip install --no-cache-dir tensorboardX==2.6.2
RUN mkdir -p /opt/mnist/src

WORKDIR /opt/mnist/src

USER 500:500 # UID 500 and GID 500 permissions are required to use the high-performance storage
COPY mnist_distributed.py /opt/mnist/src/mnist_distributed.py

Set Container Registry access information

To pull images from ML expert Platform in Container Registry, the access information must be set.

Create image access information Secret

3. Start training

It is recommended to use Elastic Launch, officially provided by Pytorch. The following are examples using the DataDirect Networks (DDN) high-performance storage based on torchrun:

Create PytorchJob

For distributed training, proceed as follows:

# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
    name: pytorch-mnist-dist-nccl
    namespace: p-{ projectName } # Name of Kubernetes Namespace for the project
spec:
    elasticPolicy:
        rdzvId: mnist
        rdzvBackend: c10d
        minReplicas: 2
        maxReplicas: 2
        nProcPerNode: 8
    runPolicy:
        cleanPodPolicy: None
    pytorchReplicaSpecs:
        Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
                metadata:
                    annotations:
                        sidecar.istio.io/inject: "false"  # Required to disable Istio sidecar injection
                spec:
                    nodeSelector:
                        mlx.navercorp.com/zone: { name of the GPU Zone provided } # Name of the zone you can view in GPU Resources 
                containers:
                - name: pytorch  # Must set PyTorchJob's container name to pytorch
                   image: examples.com/pytorch-mnist-dist:23.03-py3 # Container image of the example code
                   imagePullPolicy: Always
                   securityContext:  # securityContext is required to use Infiniband.
                        capabilities:
                            add: ["IPC_LOCK"]
                   command: ["bash", "-c"]
                   args:
                   - >
                     torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
                     /opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
                    env:
                    - name: NCCL_DEBUG
                       value: INFO
                    resources:
                        limits:
                            memory: "1Ti"
                            cpu: 120
                            nvidia.com/gpu: 8
                            rdma/hca_shared_devices_a: 1
                        requests:
                            memory: "8Gi"
                            cpu: 120
                            nvidia.com/gpu: 8
                            rdma/hca_shared_devices_a: 1
                    # shared memory
                    volumeMounts:
                    - mountPath: /dev/shm
                       name: shared-memory
                    - mountPath: "/data"
                       name: storage-volume
                volumes:
                - emptyDir:
                   medium: Memory
                   name: shared-memory
                - name: storage-volume
                   persistentVolumeClaim:
                        claimName: exa-pvc # Name of a previously created high-performance storage PVC

Run PytorchJob

kubectl -n { namespace } apply -f pytorchjob.yaml

4. View status and results

To view the training status and results:

View based on Pod Log

The logs of the training pods can be viewed using the kubectl logs command.

kubectl -n { namespace } logs pytorch-elastic-mnist-nccl-worker-0 pytorch

View based on Tensorboard

If you maintain logs for Tensorboard in the training code, you can view the information through Tensorboard provided by ML expert Platform.
For the example code, Tensorboard logs are saved in /data/logs.

Create Tensorboards

Save training results

You can use Model Registry to store and manage training parameters.
You can save training parameters by automatically uploading through Model Registry SDKs. Also, you can use Notebook to upload and manage the training parameters you need.

Use Model Registry