Prepare dataset

Available in Classic and VPC

In this section, the dataset specifications, how to create a dataset, and creation examples are described. A dataset is a bundle of data utilized to train a language model to be optimized for tasks tailored to user requirements.

If you are looking to train using the HyperCLOVA X language model, you need to prepare the dataset according to the instruction dataset guide.

Caution

You bear full responsibility for any issues or outcomes arising from uploading datasets that contain personal information.

Instruction dataset

Instruction dataset, designed to harness the capabilities of HyperCLOVA X, prioritizes quality over quantity. Detailed and lengthy entries, especially ones that specify the desired answer format, can improve tuning performance. The required amount of data may vary depending on your task, but to improve tuning performance, at least 400 sets of data (Text and Completion pairs) are recommended. If the domain has not yet been trained by HyperCLOVA X, more data may be needed.

Instruction dataset file specifications

The file specifications for the instruction dataset are as follows:

Item	Description
Extensions	CSV or JSONL
File encoding format	UTF-8
Minimum data	Minimum: 100 rows Recommended: 1,000 to 100,000 rows
File size	Up to 1 TB (available for Ncloud Object storage)

Instruction dataset template

The fields constituting the instruction dataset are as follows:

Field	Description
System_Prompt	Specific instructions to be performed by CLOVA Studio (optional).
C_ID	Conversation ID. This is a number assigned to a conversation scenario composed of the same topic. Starts from 0 and increases by 1.
T_ID	Turn ID. This is a number assigned to each pair of question (text) and answer (completion) within 1 conversation scenario. Starts from 0 and increases by 1.
Text	All speeches expected to be spoken by you.
Completion	All responses expected from CLOVA Studio.

Note

Ensure that each line of data (System_Prompt (optional), text, completion pair) is limited to 8,000 characters or less, including spaces. If it exceeds 8,000 characters, only a portion of the dataset is uploaded.
For a document classification task dataset, enter a minimum of 200 rows per classification category.
Example: 30% positive (300 cases), 30% negative (300 cases), and 40% neutral (400 cases)
For multi-classification task datasets, you can enter up to 15 classification labels.
It is recommended to use only 1 word for classification labels. Spacing and special characters are not permitted.

Instruction dataset file format

Instruction datasets must be prepared in either CSV or JSON format. When you use the CSV file format, you must utilize the Instruction dataset template. Files that do not align with the template cannot be uploaded.

CSV file

If you are configuring your dataset as a CSV file, check the following:

The first row must accurately contain "C_ID", "T_ID", "Text" and "Completion" and it must consist of only 4 columns.
Be sure to delete blank rows and columns.
When you need to break the line, type "\n."

JSONL file

If you are configuring your dataset as a JSONL file, check the following:

Each line must consist of {"C_ID": order, "T_ID": order, "Text": "input value," "Completion": "desired result value"}, and "input value" and "desired result value" must contain at least 1 character.
Use '"' for double quotation marks.
When you need to break the line, type "\n."

Conversational scenario method

Depending on the user's objectives, you can configure your dataset as either a single turn conversational scenario or a multi-turn conversational scenario. Single turn involves obtaining an answer from a single question, whereas multi-turn involves an exchange of conversation to refine and achieve the desired outcome.

Note

For detailed examples, see Single turn example file (.csv) and Multi-turn example file (.csv).

Single turn

Single turn scenarios are configured with only 1 turn (T_ID=0) for a specific C_ID. Each C_ID has only 1 T_ID, so the value of all T_IDs is "0."

clovastudio-dataset_singleturn_ko

Multi-turn

Multi-turn scenarios are configured with 2 or more turns (T_ID=0) for a specific C_ID. It is recommended to configure a conversation topic with 3 or more turns.

clovastudio-dataset_multiturn_ko

Prepare dataset fields

Describes how to fill in each field of the dataset.

System_Prompt

The following describes specific instructions to be performed by CLOVA Studio. The System_Prompt field is optional, but if added to the dataset for training, it enhances the performance of the task instructed by CLOVA Studio.
Add the System_Prompt field, if you are including it, to the first row of your dataset. Field values are not case-sensitive. The same instruction (System Prompt) must be entered for the same C_ID, and the same instruction must be applied to inference.

The AI assistant accurately understands the given content and answers the questions.

C_ID

C_ID is a number assigned to a conversation scenario composed of the same topic. It starts from 0 and increases by 1.

T_ID

T_ID is a number assigned to each pair of question (text) and answer (completion) within a single conversation scenario. It starts from 0 and increases by 1.

Text

Text represents your request. It is written in a variety of topics and tones. Any content is acceptable, and it is alright even if it contains typographical errors or ungrammatical sentences.

When only using instructions

I'm soooo bored today

What is Vaseline made of?

I'm feeling really sick today. Plzzzzzz draft a text message for my boss saying I might be late for work

When including instructions with additional information

Rewrite the following paragraph to make it more relatable. 
{Formal paragraph}

Read the following document and answer the question accurately.
{Document}
{Question}

What does the following code do?
{Code}

Completion

The Completion field provides a response to user inquiries. Write lengthy responses with details. The more you can format your answers, including order, bullet points, line breaks, and tables formatting in markdown, the better.

There are many different types of necklines for tops. Typically, the following are some common types: 
1. Round neck
2. V-neck
3. Crew neck
4. Square neck
5. Boat neck

There are many different types of necklines for tops. It could be summarized in a table as follows: 
| Type of neckline | Description |
| Round neck | Gently wraps around the neck. |
| V-neck | Dives deeper, giving a cool sensation. |
| Crew neck | Slightly broader than the round neck. |
| Square neck | Encompasses the neck in a square shape. |
| Boat neck | Broad on the sides, giving a soft feel. |

When preparing for a trip, ensure you consider the following points:
- Decide on your travel dates and destination.
- Prepare necessary documents such as passport, visa, and relevant vaccinations. 
- Get travel insurance.
- Pack your travel bag. 
- Purchase any items you will need during your trip. 
- Make reservations for flights, hotels, and other transportation. 
- Gather information about your travel destination.
- Prepare for any unforeseen circumstances.

Upload dataset

The following describes how to upload a dataset to Object Storage and utilize it when calling the training generation APIs.

Dataset upload scenario

Use Object Storage to upload the dataset required for training. Create a dedicated bucket in Object Storage and upload the dataset. During the training generation APIs call, you can load the dataset uploaded to Object Storage for training.
For safe data management, create policies that grant permissions to upload datasets and view the list of files within the bucket, and assign them to the sub account.

The following describes a scenario for uploading a dataset and using it to call the training generation APIs:

See create bucket to create a bucket in Object Storage where the data will be uploaded.
See [Create sub account](#Create sub account) to create a sub account that will be used for uploading data.
See [create policies for sub account](#create sub account) and apply policies to sub account to create and apply your policies to the sub account.
See "Upload dataset" before uploading the dataset to the bucket.
See "Confirm necessary information for calling training generation APIs" to view the following information:
- Name of bucket where training dataset is located.
- File path for training dataset.
- Access key to access training dataset.
- Secret key to access training dataset.
Use the information obtained in Step 5 to call the training generation API.

Note

Sub Account is a service that provides sub accounts to enable multiple users to use and manage the same resources. For more information on Sub Account, see Sub Account user guides.
Object Storage is a service that offers file storage spaces. For more information on Object Storage, see Object Storage guides.

Create Bucket

To create a bucket for uploading a dataset, follow these steps:

From the NAVER Cloud Platform console, click > Services > Storage > Object Storage in order.
Click [Subscribe] to complete the subscription.
Click [Create bucket] in the bucket management menu.
Enter a name for the bucket to create, and then click [Next].
- Calling the training generation APIs requires the name of the bucket.
Once the Manage settings screen appears, click [Next] without making any changes.
When the Permissions management screen appears, select the status of the publicly disclosed items to private and click [Next].
After checking the setup, click [Create bucket].

Create sub accounts

You can safely upload a dataset by using a sub account with limited access permissions.
To create sub accounts, follow these steps:

From the NAVER Cloud Platform console, click > Services > Management & Governance > Sub Account in order.
In the Sub Accounts menu, click [Create sub account].
Enter the sub account information.
- Login ID: enter the ID you want to use for login.
- Username: enter the username of sub account user.
- Access permission
  - Uncheck the checkbox of Console access
  - Select the checkbox of API Gateway access and select Allow access from all sources.
- Two-factor authentication options: select whether to use two-factor authentication.
- Login password: select "Enter manually" and set password.
- Password reset notification: uncheck the checkbox.
Click [Create].
When the Creation completion window appears, copy the sub account information and click [OK].
When the Sub account details window appears, go to the [Access key] tab and click [Add].
- Your access key ID and secret key information are created.

Create policies for sub account

You can create a policy that grants permissions to upload a dataset to a specific bucket in Object Storage and view the bucket's file list and assign them to your sub account.
To create a policy for your sub account, follow these steps:

From the NAVER Cloud Platform console, click > Services > Management & Governance > Sub Account in order.
Click [Create policy] in the Policies menu.
Set the policy information and target.
- Policy name: enter the name of the policy you want to create.
- Platform: select VPC policy.
- Service: select Object Storage.
When the Action name area appears, click [Expand] for Insert permissions and select writeObject.
In the Resources area, activate the option to specify resources for the bucket and click [Select resource].
When the Bucket resource selection window appears, select the bucket where the dataset will be uploaded, and then click the icon.
Once the bucket to upload the dataset to is added to the resource list, click [Apply].
Click [Add target].
Confirm that Object Storage is added to the target and click [Create].

Apply policy to sub account

To apply a policy to your sub account, follow these steps:

From the NAVER Cloud Platform console, click > Services > Management & Governance > Sub Account in order.
Click the name of the sub account.
In the sub account details screen, go to the [Policies] tab and click [Add individual permission].
When the Add policy window appears, click the [User-defined policy] tab. Select the checkbox for the created policy and click [Add].

Upload dataset

To upload a dataset to an Object Storage bucket, follow these steps:

Forward the sub account information, access key ID, and secret key to the user who will upload the dataset.
See the Object Storage API guides to upload the dataset to Object Storage.

Confirm necessary information for calling training generation APIs

To confirm the necessary information for calling the training generation APIs using the dataset uploaded to Object Storage, follow these steps:

From the NAVER Cloud Platform console, click > Services > Storage > Object Storage in order.
In the Bucket management menu, copy the name of the bucket where the dataset was uploaded and apply it to trainingDatasetBucket.
Click the file uploaded to the bucket, check the link from the details, copy the path within the bucket excluding the bucket name from it, and apply it to trainingDatasetFilePath.
In the NAVER Cloud Platform console, click > Services > Management & Governance > Sub Account in order.
Click the sub account that has access to the bucket. Go to the details and click the [Access key] tab.
Copy the Access Key ID and apply it to trainingDatasetAccessKey.
Copy the Secret Key and apply it to trainingDatasetSecretKey.

Note

For more information on training generation APIs, see the CLOVA Studio APIs guide.