Prepare dataset
    • PDF

    Prepare dataset

    • PDF

    Article Summary

    Available in Classic and VPC

    In this section on preparing a dataset, the dataset specifications, how to create a dataset, and creation examples are described.

    A dataset is a bundle of data utilized to train a language model to be optimized for tasks tailored to user requirements.

    The dataset can vary depending on the model you wish to train.

    • To train an existing model, you must prepare the dataset following the guidelines in the dataset guide. See Dataset.
    • If you are looking to train using HyperCLOVA X language model, you should prepare the dataset according to the instruction dataset guide. See Instruction dataset.
    Caution

    Users bear full responsibility for any issues or outcomes arising from uploading datasets that contain personal information.

    Dataset

    Describes how to prepare a dataset required when training an existing model.

    Dataset file specifications

    The dataset file specifications are as follows:

    ItemDescription
    Extensions*.csv or *.jsonl
    File encoding formatUTF-8
    Minimum data400 rows or more recommended
    File sizeUnder 50 MB; if training via API, under 100 MB
    File nameBetween 2 and 30 characters

    Dataset template

    The fields constituting the dataset are as follows:

    FieldDescription
    TextAll speeches expected to be spoken by the user
    CompletionAll responses expected from CLOVA Studio
    Note
    • Each row (text and completion pair) should contain data less than or equal to 4000 characters, including spaces. If it exceeds 4000 characters, only a portion of the dataset is uploaded.
    • For a document classification task dataset, enter a minimum of 200 rows per classification category.
      <example> 30% positive (300 cases), 30% negative (300 cases), 40% neutral (400 cases)
    • For multi-classification task datasets, you can input up to 15 classification labels.
    • It is recommended to use only one word for classification labels. Spacing and special characters are not permitted.

    Dataset file format

    Datasets should be prepared in either CSV or JSON format. When using the CSV file format, you must utilize the Dataset template. Files that do not align with the template cannot be uploaded.

    CSV file

    If you are configuring your dataset as a CSV file, check the following:

    • The first row must accurately contain "Text" and "Completion" and it must consist of only 2 columns.
    • Be sure to delete blank rows and columns.
    • When you need to break the line, please use "\n".

    JSONL file

    If you are configuring your dataset as a JSONL file, check the following:

    • Each line must consist of {"Text": "Input value", "Completion": "Desired result value"}, and "Input value" and "Desired result value" must contain at least one character.
    • Please enclose double quotation marks with '"'.
    • When you need to break the line, please use "\n".

    Prepare dataset fields

    For instructions on how to write the Text and Completion fields of the dataset, see Prepare dataset fields.

    Note

    Datasets intended for conversation tasks should be prepared as follows:
    clovastudio-tuning_dataformat2_en

    • Input 3 or more speeches into the Text column, and input 1 speech into the Completion column.
    • Unify the speaker of the Completion column to one person.
    • We recommend configuring so that the speeches are continued between the Text column and the Completion column.
    • The agents of speech (speakers) are limited to 2 people.
    • Clearly specify who is speaking at the beginning of each speech. <example> "customer:", "seller:"

    Instruction dataset

    The instruction dataset, designed to harness the capabilities of HyperCLOVA X, prioritizes quality over quantity. Detailed and lengthy entries, especially ones that specify the desired answer format, can improve tuning performance. Although the amount of data required may vary based on user tasks, for optimal tuning performance, a minimum of 400 data per turn (text and completion pair) is required, and in areas where HyperCLOVA X has not been trained, a larger volume of data is necessary.

    Instruction dataset file specifications

    The file specifications for the instruction dataset are as follows:

    ItemDescription
    Extensions*.csv or *.jsonl
    File encoding formatUTF-8
    Minimum data400 rows or more recommended
    File sizeUnder 100 MB

    Instruction dataset template

    The fields constituting the instruction dataset are as follows:

    FieldDescription
    C_IDConversation ID. This is a number assigned to a conversation scenario composed of the same topic. Starts from 0 and increases by 1
    T_IDTurn ID. This is a number assigned to each pair of question (text) and answer (completion) within a single conversation scenario. Starts from 0 and increases by 1
    TextAll speeches expected to be spoken by the user
    CompletionAll responses expected from CLOVA Studio
    Note
    • Each row (text and completion pair) should contain data less than or equal to 8000 characters, including spaces. If it exceeds 8000 characters, only a portion of the dataset is uploaded.
    • For a document classification task dataset, enter a minimum of 200 rows per classification category.
      <example> 30% positive (300 cases), 30% negative (300 cases), 40% neutral (400 cases)
    • For multi-classification task datasets, you can input up to 15 classification labels.
    • It is recommended to use only one word for classification labels. Spacing and special characters are not permitted.

    Instruction dataset file format

    Instruction datasets should be prepared in either CSV or JSON format. When using the CSV file format, you must utilize the Instruction dataset template. Files that do not align with the template cannot be uploaded.

    CSV file

    If you are configuring your dataset as a CSV file, check the following:

    • The first row must accurately contain "C_ID", "T_ID", "Text" and "Completion" and it must consist of only 4 columns.
    • Be sure to delete blank rows and columns.

    JSONL file

    If you are configuring your dataset as a JSONL file, check the following:

    • Each line must consist of {"C_ID": order, "T_ID": order, "Text": "Input value", "Completion": "Desired result value"}, and "Input value" and "Desired result value" must contain at least one character.
    • Please enclose double quotation marks with '"'.
    • When you need to break the line, please use "\n".

    Conversational scenario method

    Depending on the user's objectives, you can configure your dataset as either a single turn conversational scenario or a multi-turn conversational scenario. Single turn involves obtaining an answer from a single question, whereas multi-turn involves an exchange of conversation to refine and achieve the desired outcome.

    Note

    Single turn

    Single turn scenarios are configured with only 1 turn (T_ID=0) for a specific C_ID. Each C_ID has only one T_ID, so the value of all T_IDs is "0".

    clovastudio-dataset_singleturn_ko

    Multi-turn

    Multi-turn scenarios are configured with 2 or more turns (T_ID=0) for a specific C_ID. It is recommended to configure a conversation topic with 3 or more turns.

    clovastudio-dataset_multiturn_ko

    Prepare dataset fields

    Describes how to fill in each field of the dataset.

    C_ID

    C_ID is a number assigned to a conversation scenario composed of the same topic. It starts from 0 and increases by 1.

    T_ID

    T_ID is a number assigned to each pair of question (text) and answer (completion) within a single conversation scenario. It starts from 0 and increases by 1.

    Text

    Text represents the user's request. It is written in a variety of topics and tones. Any content is acceptable, and it is alright even if it contains typographical errors or ungrammatical sentences.

    • When only using directives

      I'm so bored today
      
      What is Vaseline made of?
      
      I'm feeling really sick today. Plzzzzzz draft a message for my boss saying I might be late for work
      
    • When including directives with additional information

      Rewrite the following paragraph to make it more relatable. 
      {formal paragraph}
      
      Read the following document and answer the question accurately.
      {Document}
      {Question}
      
      What does the following code do?
      {Code}
      

    Completion

    The Completion field provides a response to user inquiries. Write lengthy responses with details. The more you can format your answers, including order, bullet points, line breaks, and formatting in markdown, the better.

    There are many different types of necklines for tops. Typically, the following are some common types: 
    1. Round neck
    2. V-neck
    3. Crew neck
    4. Square neck
    5. Boat neck
    
    There are many different types of necklines for tops. It could be summarized in a table as follows: 
    | Type of neckline | Description |
    | Round neck | Gently wraps around the neck |
    | V-neck | Dives deeper, giving a cool sensation |
    | Crew neck | Slightly broader than the round neck |
    | Square neck | Encompasses the neck in a square shape |
    | Boat neck | Broad on the sides, giving a soft feel |
    
    When preparing for a trip, ensure you consider the following points:
    - Decide on your travel dates and destination
    - Prepare necessary documents such as passport, visa, and relevant vaccinations 
    - Get travel insurance
    - Pack your travel bag 
    - Purchase any items you will need during your trip 
    - Make reservations for flights, hotels, and other transportation 
    - Gather information about your travel destination
    - Prepare for any unforeseen circumstances 
    

    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.