Data Manager

Prev Next

Available in VPC

You can learn the data manager interface configuration. In Data Manager, you can view the list of datasets in the workspace and their details.

Note
  • You can refer to the dataset uploaded to the data manager in different projects within the workspace.
  • You can only view the dataset list and details in the data manager interface.
  • Use ML expert Platform SDKs for tasks such as uploading or deleting datasets, and creating tags and branches with data manager.

View data manager list

The information about the list of datasets you saved is as follows:

mlxp_console_datamanager01_ko

  • Dataset Title: Name you set when uploading the dataset.
  • Creation date and time: Initial time of creation.
  • Operation: Click [dataset detail] to go to the view details interface.

View data manager details

You can view the details about the dataset you selected. The details are divided into tabs.

Overview

View the metadata of the dataset you selected.

Files and Versions

You can view the list of files by the directory of the selected dataset.

Use data manager SDKs

Data manager SDKs support Huggingface Dataset Interface based on Python.
To upload/download the dataset through SDKs:

Install SDKs

You can install SDKs using the following commands:

pip install "ncloud-mlx[data-manager]" # double quotes are required

Prerequisites

To use SDKs, you must create API key, and the MLX endpoint is required. Enter the created API key to complete prerequisites. You can set the endpoint URL as MLX_ENDPOINT_URL, an environment variable.

from mlx.sdk.data import login

login("{ API Key }") # MLXP API Key
login("{ API Key }", "{MLX endpoint}") # Method of pinning the endpoint URL during login, instead of using an environment variable

Read dataset

To use the dataset in the training logic, you must load it using the dataset class. For more information, see the Huggingface Python SDK official documentation.

To load a dataset from the local system:

from mlx.sdk.data import load_dataset
ds = load_dataset(
    "{ path of the data in the local system }" #  Local data path e.g., "path/to/folder/*"
)

To load the dataset managed in the data manager:

from mlx.sdk.data import load_dataset
ds = load_dataset(
    "{ workspace name }/{ dataset name }" # Dataset location e.g., "workspaceA/datasetA"
)

Upload dataset

The dataset is uploaded in the same way as Huggingface Dataset Interface. For more information, see the Huggingface Python SDK official documentation.

The typical methods of uploading are as follows:

push_to_hub

...
ds.push_to_hub(
    repo_id="{ workspace name }/{ dataset name }"
)
...

upload_file

from huggingface_hub import create_repo, upload_file

path = "{ workspace name }/{ dataset name }"# Location of the dataset to upload
create_repo(repo_id=path, repo_type="dataset")
upload_file(
    repo_id=path,
    path_or_fileobj="{ local file path }", # Local file path to upload
    path_in_repo="path/to/folder/foo.csv", # Path of the remote file in the dataset
    repo_type="dataset",
)

upload_folder

from huggingface_hub import create_repo, upload_folder

path = "{ workspace name }/{ dataset name }"# Location of the dataset to upload
create_repo(repo_id=path, repo_type="dataset")
upload_folder(
    repo_id=path,
    path_or_fileobj="{ local directory path }", # Local directory path to upload
    path_in_repo="path/to/folder", # Path of the remote file in the dataset
    repo_type="dataset",
)

Download dataset

To download the dataset to the local disk:

from huggingface_hub import snapshot_download

path = "{ workspace name }/{ dataset name }"# Location of the dataset to upload
snapshot_download(
    repo_id=path,
    repo_type="dataset",
    local_dir="path/to/folder", # Directory path to download
    local_dir_use_symlinks="auto" # Whether to use symlink using cache_dir
)

Create tag and branch

Once you create a dataset, a unique commit ID is assigned. You can read the dataset of a specific revision or log tags for additional information by using this commit ID.

To create a tag:

from huggingface_hub import create_tag

path = "{ workspace name }/{ dataset name }" 
create_tag(
    repo_id=path,
    repo_type="dataset",
    tag="{ name of the tag to be created}",
    revision="{ revision }",  # Baseline version. The default value is main.
    tag_message="{ tag message }"
)

Metadata such as tag messages is immutable, so it cannot be edited. However, you can delete and recreate it. To delete a tag:

from huggingface_hub import delete_tag

path = "{ workspace name }/{ dataset name }" 
delete_tag(
    repo_id=path,
    repo_type="dataset",
    tag="{ name of the tag to be deleted}"
)

To create a branch:

from huggingface_hub import create_branch

path = "{ workspace name }/{ dataset name }" 
create_branch(
    repo_id=path,
    repo_type="dataset",
    branch="{ name of the branch to be created}",
    revision="{ revision }"
)

Delete dataset

Caution

Note that once you delete a dataset, it cannot be recovered.

from huggingface_hub import delete_repo

path = "{ workspace name }/{ dataset name }"# Dataset to be deleted
delete_repo(
    repo_id=path,
    repo_type="dataset"
)