Data preparation

Prev Next

Available in VPC

The features offered by the NCLUE service can be utilized based on the client company's user data. Therefore, before using the NCLUE service, prepare a dataset according to the Data preparation guide after checking the types of data that are required.

Note

Once prepared, upload the dataset to the Object Storage Bucket. See the Object Storage user guide.

Data types

The data required to use the NCLUE service is as follows:

Sequence datasets

A description of sequence datasets is as follows:

  • As data containing user behavior to be used for feature creation, this dataset is created as a list by extracting the behavior history of each user from the client company's data only. The features created are used as data to create task models.
  • This format lists users' behavior history in chronological order. (see Sequence dataset format)
  • A sequence dataset must be prepared for every user whose behavior you want to predict.
    • Example: say we want to perform a variety of task modeling on 3 million users. In that case, we need to prepare behavior sequence datasets for all 3 million users.

Correct answer datasets

A description of correct answer datasets is as follows:

  • Data containing the correct answers to the user behaviors used to create a task.
  • In this context, "correct answer" refers to the user's behavior or characteristics that the task model aims to predict.
  • In this format, the users of the sequence dataset utilized for feature creation are tagged as 1 as the correct answer label and 0 as the incorrect answer label for the task. (see Correct dataset format)
  • Tasks can be created with data from some users only, but the more correct answer data is available, the more accuracy will improve.

Data preparation

Check the descriptions of data format and composition for each type of data that are necessary knowledge for data preparation.

Sequence datasets

A description of the format and composition of sequence datasets is as follows.

Sequence dataset format

The format of sequence datasets, which are required for feature creation, is as follows:
Prepare it as a tab-delimited csv file.

  • Data format

    # separator tab
    {user_id}\t{sequence}
    
  • {sequence} format

    {behavior}->{behavior}->{behavior}->......->{behavior}
    
  • Final data format

    # format : {user_id}\t{behavior}->{behavior}->......->{behavior}
    
    • Example:
      u730023 Starbucks Pangyo Avenue France->Momos->Pokémon GO->Dekopon->AirPods->Nike Sale
      

Sequence dataset composition

A description of the composition of sequence datasets is as follows:

  • User ID ({user_id})

    • User IDs must be unique.
    • You may enter numbers, strings, or a combination of numbers and strings.
    • The maximum permissible length is 100 characters.
    • Identifiers that contain personal information (such as resident registration number, passport number, driver's license number, credit card number, mobile phone number, or email address) cannot be used as user IDs.
    • Rather than reusing the user IDs registered in the client system, we recommend that you create different user IDs to use the NCLUE service.
  • Behavior ({behavior})

    • "Behavior" refers to distinct actions taken by users when interacting with the client's services and products.
    • Examples of behaviors include searched keywords, viewed service names, purchased product names, and so on.
    • You may input any string (words, phrases, clauses, or sentences) that represents various behavior history.
    • It is recommended to use strings similar to behaviors that might occur within the NAVER service, such as NAVER search keywords, shopping product names, and business names.
    • Include spaces when entering {behavior} values to make strings easy to read.
    • {behavior} may include spaces or special characters.
  • Sequence ({sequence})

    • Behaviors entered in the sequence are only taken into consideration in the listed order. List the person's behavior history in chronological order, excluding the time information and separating each behavior with '->'. The behavior on the farthest left is the oldest behavior, while the behavior on the farthest right is the most recent behavior.
    • The maximum length of a sequence is limited to 2048 tokens in HyperCLOVA, the software we use internally. A sequence can contain up to 150 to 500 behaviors depending on the lengths of the behavior strings. If the maximum length is exceeded, the content of the exceeding string will not be entered.
Note
  • User information entered into sequence datasets is used only as a separator within the NCLUE service and is not used as input for feature creation or task model training.

Correct answer datasets

A description of the format and composition of correct answer datasets is as follows.

Correct answer dataset format

The format of correct answer datasets, which are required for task creation, is as follows:
Prepare it as a tab-delimited csv file.

  • Data format

    # separator tab
    {user_id}\t{label}
    
  • Final data format

    # format : {user_id}\t{label}
    
    • Example:
      u192873 0
      u730023 1
      u239376 0
      u846712 1
      u558145 1
      

Correct answer dataset composition

A description of the composition of correct answer datasets is as follows:

  • {label} is represented as 1 (correct answer) or 0 (incorrect answer).
  • Correct answer set data should feature a total of at least 200 entries with 100 0s and 100 1s each, but the larger the amount of data, the better the performance.
  • Correct answer set data can be composed by selecting some of the user IDs from the sequence dataset used for feature creation.
    • Example: If the correct answer data for finding out a specific Task A is for 2000 people out of 1 million, the correct answer dataset is used by marking only 2000 people as 1 and the rest as 0.
  • Correct answer datasets can be composed according to the purposes of the task model.
    • Example: If we consider a task model that seeks to predict which users will buy a product called M among a company's products, the correct answer set data can be composed by labeling the users among the company's customers who bought M with a 1 and the users who did not buy M with a 0.
  • User IDs ({user_id}) cannot contain personal information (such as resident registration number, passport number, driver's license number, credit card number, mobile phone number, or email address).