Documentation Index

Fetch the complete documentation index at: https://guide.ncloud-docs.com/llms.txt

Use this file to discover all available pages before exploring further.

ML expert Platform quickstart

Prev Next

Available in VPC

ML expert Platform provides efficient and reliable services across the entire AI/ML service workflow, from data management and processing to large-scale distributed training, as well as serving and deployment for models ranging from small models to ultra-large AI models. Follow this step-by-step guide to train the FashionMNIST dataset using ML expert Platform. This example performs distributed node training, stores data using a persistent volume claim (PVC), and runs training using PyTorchJob.

1. Create a Workspace and Project

2. Prepare for training

Prepare the FashionMNIST dataset

This guide is based on the FashionMNIST dataset provided by Huggingface.

We assume that the dataset is prepared in advance as follows:

  • Data Manager
  • Object Storage, Ncloud Storage
Data location Recommended when Notes
Data Manager
  • Versioning datasets by logical unit
  • Using training code written with the Huggingface interface
Object Storage, Ncloud Storage
  • Dataset management is not required
  • The Huggingface interface is not require

To manage the dataset with Data Manager, see Upload dataset.

Use training data

You can access training data by:

  • Remotely reading the data using Huggingface DataLoader.
  • Copying the dataset to the selected storage, then reading it from there.

For more information on remotely reading data using Huggingface DataLoader, see Read datasets. The following sections describe only the volume data method.

PVC creation

To use the selected storage in your ML expert Platform environment, create and mount a persistent volume claim (PVC). For details, see Volumes. The following examples show how to create PVCs for each storage type supported by ML expert Platform.

High-performance storage (DDN) supports ReadWriteMany (RWM). You can configure the storage as shown in the examples below.

Caution
  • When using mounted high-performance storage, the UID and GID in the training image must be set to 500.
  • Do not set fsGroup in the Pod securityContext. Setting fsGroup causes Kubernetes to recursively change the ownership of all files in the mounted volume. For high-performance storage containing large amounts of data, this can make Pod initialization extremely slow.
#exa-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: exa-pvc
    namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
    storageClassName: { high-performance storageClassName } # StorageClass name for high-performance storage
    accessModes:
        - ReadWriteMany
    resources:
        requests:
            storage: 10Gi # Capacity for high-performance storage
kubectl -n {namespace} apply -f exa-pvc.yaml

Because local storage (NVMe) is bound to specific GPU servers, you must create a separate PVC for each node. For local storage (NVMe), we recommend using EmptyDir with initContainer or EmptyDir with Data Manager DataLoader.

#local-path-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: local-path-pvc
    namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
    storageClassName: { local storage storageClassName } # StorageClass name for local storage
    accessModes:
        - ReadWriteOnce
    resources:
        requests:
            storage: 10Gi # (default value) Does not work for local storage.
kubectl -n {namespace} apply -f local-path-pvc.yaml

Download data

ML expert Platform provides a storage-initializer image for downloading datasets. Use a Kubernetes Job to mount the created PVC, and then download the dataset with the provided image.

The following shows an example of downloading data using a Job.

Data Manager usage example

# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-job
  namespace: p-{ projectName } # Kubernetes Namespace name for the project
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: Never
      nodeSelector:
        mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
      containers:
        - name: storage-initializer
          image: mlx-public.kr.ncr.ntruss.com/mlx/mdm-storage-initializer:0.0.5
          env:
          - name: MLX_APIKEY # (3)!
            value: '{ API Key }' # MLXP API Key
          args:
          - "mlx+data-manager://{ MLX endpoint url }/{workspace}/{dataset}"
          - "/data/dataset" # Path for saving the dataset within the spec.volumeMounts.mountPath
          volumeMounts:
            - mountPath: "/data"
              name: storage-volume
      volumes:
        - name: storage-volume
          persistentVolumeClaim:
            claimName: { PVC name to mount } # Enter the name of the PVC created, e.g. exa-pvc, local-path-pvc)
kubectl -n {namespace} apply -f download-job.yaml

Object Storage / Ncloud Storage usage example

# download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: download-job
  namespace: p-{ projectName } # Kubernetes Namespace name for the project
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: Never
      nodeSelector:
        mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources
      containers:
        - name: storage-initializer
          image: mlx-public.kr.ncr.ntruss.com/mlx/kserve/storage-initializer:v0.13.0
          env:
          - name: AWS_ENDPOINT_URL
            value: { S3 Endpoint } # S3 Endpoint
          - name: AWS_ACCESS_KEY_ID
            value: { S3 Access Key }  # S3 Access Key
          - name: AWS_SECRET_ACCESS_KEY
            value: { S3 Secret Key } # S3 Secret Key
          - name: AWS_DEFAULT_REGION
            value: { S3 Region } # S3 Region
          args:
          - "s3://{ Object Storage dataset URL }"
          - "/data/dataset" # Path for saving the dataset within the spec.volumeMounts.mountPath
          volumeMounts:
            - mountPath: "/data"
              name: storage-volume
      volumes:
        - name: storage-volume
          persistentVolumeClaim:
            claimName: { PVC name to mount } # Enter the name of the PVC created, e.g. exa-pvc, local-path-pvc)
kubectl -n {namespace} apply -f download-job.yaml

Prepare training code

Use the base image

To use InfiniBand, the image used in your PyTorchJob must have libibverbs.so installed. Because all necessary libraries are already installed, we recommend building your custom image based on the base image provided by ML expert Platform.

To use training in ML expert Platform, you need to write code based on the official NVIDIA PyTorch base image.
The example code below is designed for use with high-performance storage (DDN).

Example training code

The example training code is as follows:

# mnist_distributed.py
from __future__ import print_function

import argparse
import os
import time

from torch.utils.tensorboard import SummaryWriter
from torchvision.transforms import transforms
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

from mlx.sdk.data import login, load_dataset

WORLD_SIZE = int(os.environ.get("WORLD_SIZE"))
LOCAL_RANK = int(os.environ.get("LOCAL_RANK", 0))
GLOBAL_RANK = int(os.environ.get("RANK"))


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4 * 4 * 50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4 * 4 * 50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def train(args, model, device, train_loader, optimizer, epoch, writer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if GLOBAL_RANK == 0 and batch_idx % args.log_interval == 0:
            print(
                "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                    epoch,
                    batch_idx * len(data),
                    len(train_loader.dataset) // WORLD_SIZE,
                    100.0 * batch_idx / len(train_loader),
                    loss.item(),
                )
            )
            niter = epoch * len(train_loader) + batch_idx
            writer.add_scalar("loss", loss.item(), niter)


def test(args, model, device, test_loader, writer, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(
                output, target, reduction="sum"
            ).item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[
                1
            ]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    accuracy = float(correct) / (len(test_loader.dataset) / WORLD_SIZE)
    if GLOBAL_RANK == 0:
        print("\naccuracy={:.4f}\n".format(accuracy))
        writer.add_scalar("accuracy", accuracy, epoch)

def main():
    parser = argparse.ArgumentParser(description="PyTorch Distributed MNIST Example")
    parser.add_argument(
        "--batch-size",
        type=int,
        default=64,
        metavar="N",
        help="input batch size for training (default: 64)",
    )
    parser.add_argument(
        "--test-batch-size",
        type=int,
        default=1000,
        metavar="N",
        help="input batch size for testing (default: 1000)",
    )
    parser.add_argument(
        "--epochs",
        type=int,
        default=5,
        metavar="N",
        help="number of epochs to train (default: 10)",
    )
    parser.add_argument(
        "--lr",
        type=float,
        default=0.01,
        metavar="LR",
        help="learning rate (default: 0.01)",
    )
    parser.add_argument(
        "--momentum",
        type=float,
        default=0.5,
        metavar="M",
        help="SGD momentum (default: 0.5)",
    )
    parser.add_argument(
        "--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
    )
    parser.add_argument(
        "--log-interval",
        type=int,
        default=10,
        metavar="N",
        help="how many batches to wait before logging training status",
    )
    parser.add_argument(
        "--checkpoint_path",
        default=f"/data/result/mnist_distributed_{int(time.time())}.pt",
        help="Path to save checkpoint",
    )
    parser.add_argument(
        "--data_path", default="/data/mnist/data", help="Path for training/test data"
    )
    parser.add_argument(
        "--log_path",
        default="/data/log",
        metavar="L",
        help="Directory pathwhere summary logs are stored",
    )
    parser.add_argument(
        "--backend",
        type=str,
        help="Distributed backend",
        choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
        default=dist.Backend.NCCL,
    )

    args = parser.parse_args()
    login("{ ML expert Platform API Key }")
    writer = SummaryWriter(args.log_path)
    torch.manual_seed(args.seed)

    print("Using distributed PyTorch with {} backend".format(args.backend))
    dist.init_process_group(backend=args.backend)

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    def preprocess(examples):
        """Correct transform for HuggingFace datasets"""
        if isinstance(examples['image'], list):
            # Batch processing
            examples['image'] = [transform(img) for img in examples['image']]
            examples['label'] = torch.tensor(examples['label'])
        else:
            # Single item
            examples['image'] = transform(examples['image'])
            examples['label'] = torch.tensor(examples['label'])
        return examples['image'], examples['label']

    train_dataset = load_dataset(args.data_path, split="train")
    train_dataset.set_transform(preprocess)
    test_dataset = load_dataset(args.data_path, split="test")
    test_dataset.set_transform(preprocess)

    train_loader = DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=False,
        num_workers=1,
        pin_memory=True,
        sampler=DistributedSampler(train_dataset),
    )
    test_loader = torch.utils.data.DataLoader(
        test_dataset,
        batch_size=args.test_batch_size,
        shuffle=False,
        num_workers=1,
        pin_memory=True,
        sampler=DistributedSampler(test_dataset),
    )
    model = Net().to(LOCAL_RANK)

    # Wrap the model with DistributedDataParallel if needed.
    model = DDP(model, device_ids=[LOCAL_RANK])

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        train(args, model, LOCAL_RANK, train_loader, optimizer, epoch, writer)
        test(args, model, LOCAL_RANK, test_loader, writer, epoch)

    if GLOBAL_RANK == 0:
        dir_name = os.path.dirname(args.checkpoint_path)
        if dir_name:
            os.makedirs(dir_name, exist_ok=True)
        torch.save(model.state_dict(), args.checkpoint_path)
        print(f"Checkpoint saved at {args.checkpoint_path}")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Example Dockerfile

FROM nvcr.io/nvidia/pytorch:23.03-py3

# Install tensorboardX
USER root
RUN pip install --no-cache-dir tensorboardX==2.6.2
RUN mkdir -p /opt/mnist/src

WORKDIR /opt/mnist/src

USER 500:500 # UID 500 and GID 500 are required for high-performance storage access
COPY mnist_distributed.py /opt/mnist/src/mnist_distributed.py

Configure container registry access information

To pull images from the container registry to ML expert Platform, you need to configure access information.

3. Start training

We recommend using Elastic Launch, which is officially provided by PyTorch. The following example is based on torchrun and uses high-performance storage (DDN).

Create a PyTorchJob

For distributed training, configure the PyTorchJob as follows:

# pytorchjob.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
    name: pytorch-mnist-dist-nccl
    namespace: p-{ projectName } # Kubernetes Namespace name for the project
spec:
    elasticPolicy:
        rdzvId: mnist
        rdzvBackend: c10d
        minReplicas: 2
        maxReplicas: 2
        nProcPerNode: 8
    runPolicy:
        cleanPodPolicy: None
    pytorchReplicaSpecs:
        Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
                metadata:
                    annotations:
                        sidecar.istio.io/inject: "false"  # Must disable Istio sidecar injection
                spec:
                    nodeSelector:
                        mlx.navercorp.com/zone: { GPU Zone name } # Zone name as shown in GPU Resources 
                containers:
                - name: pytorch  # Must set the container name for PyTorchJob to pytorch
                   image: examples.com/pytorch-mnist-dist:23.03-py3 # Container image for the example code
                   imagePullPolicy: Always
                   securityContext:  # securityContext configuration is required for InfiniBand
                        capabilities:
                            add: ["IPC_LOCK"]
                   command: ["bash", "-c"]
                   args:
                   - >
                     torchrun --nnodes ${PET_NNODES} --nproc_per_node ${PET_NPROC_PER_NODE} --rdzv_id ${PET_RDZV_ID} --rdzv_backend ${PET_RDZV_BACKEND} --rdzv_endpoint ${PET_RDZV_ENDPOINT}
                     /opt/mnist/src/mnist.py --checkpoint_path /data/checkpoints/mnist.pt --log_path /data/logs --data_path /data/dataset
                    env:
                    - name: NCCL_DEBUG
                       value: INFO
                    resources:
                        limits:
                            memory: "1Ti"
                            cpu: 120
                            nvidia.com/gpu: 8
                        requests:
                            memory: "8Gi"
                            cpu: 120
                            nvidia.com/gpu: 8
                    # shared memory
                    volumeMounts:
                    - mountPath: /dev/shm
                       name: shared-memory
                    - mountPath: "/data"
                       name: storage-volume
                volumes:
                - emptyDir:
                   medium: Memory
                   name: shared-memory
                - name: storage-volume
                   persistentVolumeClaim:
                        claimName: exa-pvc # Name of the high-performance storage PVC created previously

Run the PytorchJob

kubectl -n { namespace } apply -f pytorchjob.yaml

4. Check status and results

To check the status and results of your training:

Check using Pod logs

You can check the logs of a training Pod using the kubectl logs command.

kubectl -n { namespace } logs pytorch-elastic-mnist-nccl-worker-0 pytorch

Check using Tensorboard

If your training code logs data for Tensorboard, you can view the information through Tensorboard provided by ML expert Platform.
In the example code, Tensorboard logs are saved in /data/logs.

Save training results

Model Registry lets you store and manage training parameters.
To save training parameters, you can upload them automatically through Model Registry SDKs or upload only the necessary parameters manually through Notebook.