Data preparation

Prev Next

Available in VPC

You can use NCLUE features based on your company’s user data. Therefore, before using the NCLUE service, prepare a dataset according to the Data preparation guide after checking the types of data that are required.

Note

Once prepared, upload the dataset to the Object Storage Bucket. See the Object Storage user guide.

Data types

The data required to use the NCLUE service is as follows:

Sequence datasets

A description of sequence datasets is as follows:

  • As data containing user behavior to be used for feature creation, this dataset is created as a list by extracting the behavior history of each user from the client company's data only. The features created are used as data to create task models.
  • This format lists users' behavior history in chronological order. (see Sequence dataset format)
  • A sequence dataset must be prepared for every user whose behavior you want to predict.
    • Example: To model various tasks for 3 million users, you must prepare behavior sequence datasets for all 3 million users.

Correct answer datasets

A description of correct answer datasets is as follows:

  • Data containing the correct answers to the user behaviors used to create a task.
  • In this context, "correct answer" refers to the user's behavior or characteristics that the task model aims to predict.
  • In this format, the users of the sequence dataset utilized for feature creation are tagged as 1 as the correct answer label and 0 as the incorrect answer label for the task. (see Correct dataset format)
  • Tasks can be created with data from some users only, but the more correct answer data is available, the more accuracy will improve.

Data preparation

Check the descriptions of data format and composition for each type of data that are necessary knowledge for data preparation.

Sequence datasets

A description of the format and composition of sequence datasets is as follows.

Sequence dataset format

The sequence dataset required to create features must use the following format.
Create the dataset as a tab-delimited CSV file.

  • Data format

    # separator tab
    {user_id}\t{sequence}
    
  • {sequence} format

    {behavior}<|s|>{behavior}<|s|>{behavior}<|s|>......<|s|>{behavior}
    
  • Final data format

    # format : {user_id}\t{behavior}<|s|>{behavior}<|s|>......<|s|>{behavior}
    
    • Example:
      u730023 Starbucks Pangyo Avenue France<|s|>Momos<|s|>Pokémon GO<|s|>Hallabong<|s|>AirPods<|s|>Nike Sale
      

Sequence dataset composition

A description of the composition of sequence datasets is as follows:

  • User ID ({user_id})

    • User IDs must be unique.
    • You may enter numbers, strings, or a combination of numbers and strings.
    • The maximum permissible length is 100 characters.
    • Identifiers that contain personal information (such as resident registration numbers, passport numbers, driver’s license numbers, credit card numbers, mobile phone numbers, or email addresses) cannot be used as user IDs.
    • Rather than reusing the user IDs from your company’s system, we recommend creating separate user IDs for NCLUE.
  • Behavior ({behavior})

    • "Behavior" refers to distinct actions taken by users when interacting with the client's services and products.
    • Examples of behaviors include keywords searched, services viewed, and products purchased.
    • You may input any string (words, phrases, clauses, or sentences) that represents various behavior history.
    • We recommend entering strings similar to behaviors that might occur in NAVER services, such as NAVER search keywords, shopping product names, and business names.
    • Include spaces when entering {behavior} values to make strings easy to read.
    • {behavior} may include spaces or special characters.
  • Sequence ({sequence})

    • Behaviors entered in the sequence are only taken into consideration in the listed order. List the behavior history of one person in chronological order, excluding time information, and separate each behavior with <|s|>. The behavior on the farthest left is the oldest behavior, while the behavior on the farthest right is the most recent behavior.
    • The maximum length of a sequence is limited to 2048 tokens in HyperCLOVA, the software we use internally. A sequence can contain 150-500 behaviors, depending on the lengths of the behavior strings. If the maximum length is exceeded, the content of the exceeding string will not be entered.
Note
  • User information entered into sequence datasets is used only as a separator within the NCLUE service and is not used as input for feature creation or task model training.

Correct answer datasets

A description of the format and composition of correct answer datasets is as follows.

Correct answer dataset format

The format of correct answer datasets, which are required for task creation, is as follows:
Create the dataset as a tab-delimited CSV file.

  • Data format

    # separator tab
    {user_id}\t{label}
    
  • Final data format

    # format : {user_id}\t{label}
    
    • Example:
      u192873 0
      u730023 1
      u239376 0
      u846712 1
      u558145 1
      

Answer dataset composition

Here are the guidelines for creating answer datasets:

  • {label} is represented as 1 (correct answer) or 0 (incorrect answer).
  • The answer dataset must include at least 200 entries, with a minimum of 100 each for 0 and 1. Larger datasets improve performance.
  • You can create an answer dataset with user IDs selected from the sequence dataset used for feature creation.
    • Example: If 2,000 out of 1 million users have answer data for identifying a specific Task A, mark only those 2,000 users as 1 and some of the remaining users as 0 to use as the answer dataset.
  • You can create answer datasets based on the purpose of the Task Model.
    • Example: For a Task Model that predicts which users are likely to buy product M from your company’s products, you can create the answer dataset by labeling customers who bought M as 1 and customers who did not buy M as 0.
  • User IDs ({user_id}) cannot contain personal information (such as resident registration numbers, passport numbers, driver’s license numbers, credit card numbers, mobile phone numbers, or email addresses).