Detailed description of provided data
    • PDF

    Detailed description of provided data

    • PDF

    Article Summary

    Available in Classic and VPC

    Cloud Data Box provides user behavior data such as NAVER search and shopping, image and annotation data for AI learning, and object data for news articles and NLP analysis.

    Search and shopping sample data are provided by default when you create a data box. Sample data is provided by uploading it to Cloud Hadoop HDFS, and by mounting the NAS read-only on the Ncloud TensorFlow Server. Request data to be supplied after configuring the analysis environment. Search, shopping, and AI data by the default option (for the first half of the previous year) are provided by mounting NAS as read-only on Cloud Hadoop and Ncloud TensorFlow Server. If you request a subscription to the Insight option, then the latest data from the first half of the year before last year until last month is provided. Additional data (from the previous month) is provided on the 15th of each month.

    Describes the data provided by Cloud Data Box in detail.

    Note
    • Search data and shopping data are extracted from the logged-in user data as target.
    • To download sample NAVER default data, click sample.xlsx.
    • To download sample NAVER Pro#1 data, click sample_pro#1.xlsx.
    • To download sample NAVER Pro#2 data, click sample_pro#2.xlsx.
    • To download the NAVER Pro#2 data specifications, click Specifications_pro#2.xlsx.
    • ${DATABOX_HOME_DIR}: home directory of Hadoop cluster NAS is /mnt, home directory in Ncloud TensorFlow Server is /home/ncp/workspace. Home directory of sample data Hadoop cluster HDFS data storage is /user/ncp.

    Standard data and Insight option data

    Search data (search)

    1. Search click data (search/click)

    ItemsDescription
    Data introductionData showing which keywords NAVER users searched, and which areas they clicked on
    (For search terms searched by more than 100 users per day)
    Data provision period- Standard data: first half of the previous year (January to June), updated at the beginning of every year
    - Insight data: January, from 2 years ago, to the previous month of the current year, updated monthly (January 2019 to July 2021 provided in August of 2021)
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users with a click history during the set period
    Data aggregation criteria- date (base date): search/click date
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - keyword (search word): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces)
    - area (click area): service area that the user clicked on in NAVER's integrated search results
    - count (number of clicked users): aggregated by base date, device, gender, age group, region, search term, and click area
    Sample data location${DATABOX_HOME_DIR}/sample/search/click
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/search20y1h/search/click
    Directory structureclouddatabox-data_searchclick.png
    Note

    For detailed descriptions and examples of the search click data area, see Search click area example file. This document is an example intended to help improve the user's understanding, but may vary from the actual function provided.

    2. Search click co-occurrence (search/click_cooccurrence)

    ItemsDescription
    Data introductionData showing what the NAVER users searched and clicked together in 1 day
    (For search terms searched by more than 100 users per week)
    Data provision period- Standard data: first half of the previous year (January to June), updated at the beginning of every year
    - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users with a click history during the set period
    Data aggregation criteria- week (base date): search/click date
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - keyword1 (search term 1): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces)
    - area1 (click area 1): service area that the user clicked on in the NAVER integrated search results of search term 1
    - keyword2 (search term 2): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces)
    - area2 (click area 2): service area that the user clicked on in the NAVER integrated search results of search term 2
    - count (number of users clicked together): counted by base date, device, gender, age group, region, search term 1, click area 1, search term 2, and click area 2 (the order of search terms 1 and 2 is not considered)
    Sample data location${DATABOX_HOME_DIR}/sample/search/click_cooccurrence
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/search20y1h/search/click_cooccurrence
    Directory structureclouddatabox-data_searchclickco.png

    3. Search access location (search/click_location)

    ItemsDescription
    Data introductionData showing in which regions NAVER users searched for and clicked keywords
    (As the access area is measured based on IP, there may be errors in accuracy)
    Data provision period- Standard data: first half of previous year (January to June), updated at the beginning of every year
    - Insight data: January 2 years ago to the previous month of the year (updated monthly)
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users with a click history during the set period
    Data aggregation criteria- date (base date): search/click date
    - time (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): access location (metropolitan city/province). Area measured based on access IP
    - loc2 (region 2): access location (city/county/district). Area measured based on access IP
    - keyword (search word): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces). Keywords searched by different users during the day (counted only for clicks after search)
    - count (number of clicked users): aggregated by base date, time zone, device, gender, age group, region, and search term
    Sample data location${DATABOX_DIR}/sample/search/click_location
    Entire data locationFor data from the first half of 2020: ${DATABOX_DIR}/search20y1h/search/click_location
    Directory structureclouddatabox-data_searchclicklocation_ko

    Shopping data (shopping)

    1. Product click data (shopping/click)

    ItemsDescription
    Data introductionData showing which keywords NAVER users searched and which category of products they clicked on
    (For search terms/product categories with more than 100 clicks per day)
    Data provision period- Standard data: first half of the previous year (January to June), updated at the beginning of every year
    - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users with a product click history during the set period
    Data aggregation criteria- date (base date): search/click date
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - keyword (inbound keyword): keyword that the user searched for before clicking on the product
    - cat (product category): large/medium/small/three categories of the product clicked by the user
    - count (number of visitors): counted by date, device, gender, age group, region, product category, and search word
    - brand (brand name): estimated brand name of the product clicked by the user
    - item (product name): estimated product name of the product clicked by the user
    Sample data location${DATABOX_HOME_DIR}/sample/shopping/click
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/click
    Directory structureclouddatabox-data_shoppingclick.png

    2. Product purchase data (shopping/purchase)

    ItemsDescription
    Data introductionData showing which products logged-in NAVER users purchased
    (For product categories with 10 or more purchases per day)
    Data provision period- Standard data: first half of the previous year (January to June), updated at the beginning of every year
    - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly
    (unit: daily unit)
    Extraction targetsNAVER users with a purchase history during the set period
    Data aggregation criteria- date (base date): purchase date
    - device: device information used by the user for purchase (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - cat (product category): large/medium/small/three categories of the product purchased by the user
    - count (number of purchasers): counted by base date, device, gender, age group, region, search term, and product category
    - brand (brand name): estimated brand name of the product purchased by the user
    - item (product name): estimated product name of the product purchased by the user
    Sample data location${DATABOX_HOME_DIR}/sample/shopping/purchase
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/purchase
    Directory structureclouddatabox-data_shoppingpur_en.png

    3. Product click co-occurrence (shopping/click_cooccurrence)

    ItemsDescription
    Data introductionData showing which products NAVER users searched for and clicked on during the day
    (For search terms/product categories with more than 100 clicks per day)
    Data provision period- Standard data: first half of the previous year (January to June), updated at the beginning of every year
    - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users with a product click history during the set period
    Data aggregation criteria- week (base date): the start date of the extraction period
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - keyword1 (inbound keyword 1): keyword that the user searched for before clicking on product category 1
    - cat1 (product category 1): large/medium/small/three categories of the product clicked by the user
    - keyword2 (inbound keyword 2): keyword that the user searched for before clicking on product category 2
    - cat2 (product category 2): large/medium/small/three categories of the product clicked by the user
    - count (number of users who visited together): counted by standard date (weekly), device, gender, age group, region, search term 1, product category 1, search term 2, product category 2 (click order is not considered)
    Sample data location${DATABOX_HOME_DIR}/sample/shopping/click_cooccurrence
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/click_cooccurrence
    Directory structureclouddatabox-data_shoppingclickco_en.png

    4. Product purchase co-occurrence (shopping/purchase_cooccurrence)

    ItemsDescription
    Data introductionData showing what products NAVER users purchased together during the day
    (For product categories with 10 or more purchases per day)
    Data provision period- Standard data: first half of the previous year (January to June), updated at the beginning of every year
    - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly
    (unit: daily unit)
    Extraction targetsNAVER users with a purchase history during the set period
    Data aggregation criteria- week (base date): the start date of the extraction period
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - cat1 (product category 1): large/medium/small/three categories of the product purchased by the user
    - cat2 (product category 2): large/medium/small/three categories of the product purchased by the user
    - count (number of users who purchased together): counted by base date (weekly), device, gender, age group, region, product category 1, product category 2
    Sample data location${DATABOX_HOME_DIR}/sample/shopping/purchase_cooccurrence
    Entire data locationFor data from the first half of 2020: {DATABOX_HOME_DIR}/shopping20y1h/shopping/purchase_cooccurrence
    Directory structureclouddatabox-data_shoppingpurco_en.png

    AI data (ai)

    1. Reused image (ai/clova/iitp_waste_images)

    ItemsDescription
    Data introductionCollect and label recycled images
    Data use examplesDevelopment of technology for solving image classification problems using industrial product data in real-life environment
    Data details- Data format: JPEG 3,000 sheets (1280720, 7201280), iitp_waste_images_3000_result.csv
    - Labeling information
    1: General garbage (others)
    2: Paper
    3: Cans and scrap metal
    4: Glass bottles
    5: Plastics (including PET)
    6: Vinyl
    7: Styrofoam
    8: Food
    Data sample (examples)clouddatabox-data_recycleimage_ko
    Entire data location${DATABOX_HOME_DIR}/ai/clova/iitp_waste_images

    2. Food image (ai/clova/naver_food_fixed)

    ItemsDescription
    Data introductionData that tags food areas in the image with a bounding box
    Data location${DATABOX_HOME_DIR}/ai/clova/naver_food_fixed
    Data use examplesExtract essential elements from the image, and train and develop AI to solve the problem
    Data detailsCollected image
    - Number of data: 2,042 sheets
    - Data format: JPEG image, json file for each
    Data sample (examples)clouddatabox-data_foodimage_ko
    Entire data location${DATABOX_HOME_DIR}/ai/clova/naver_food_fixed

    3. Restaurant image (ai/clova/externalImageOCR)

    ItemsDescription
    Data introductionKorean text OCR annotation data (including English and numbers) in images of posts, billboards, menus, and restaurant signs
    Data use examplesDevelop technology for extracting text from images, and converting it to digital data
    Data detailsKorean, English, and numeric annotation data in signage images collected by yourself
    - Number of data: 1,180 (signboard 197, restaurant sign 324, menu 614, standing_signboard 45)
    - Data format: json, JPEG (original image, result image)
    Data sample (JSON examples)clouddatabox-data_storeimage_ko
    Entire data location${DATABOX_HOME_DIR}/ai/clova/externalImageOCR

    4. News data for NLP experiments (ai/nlp)

    ItemsDescription
    Data introductionData gathered by identifying object names (name) in news articles content collected by the NAVER News service and linking Wikipedia pages related to the object names
    Data aggregation criteriaConsists of news title, body, and category. The locations of the object names in the body text are indicated using BIO tags, and IDs are attached.
    Data use examplesA technology that links object names (name) to information related to the objects (Wikipedia page corresponding to the names) from text can be developed (entity linking)
    Data sample (examples)clouddatabox-data_newsdata_ko
    Entire data location${DATABOX_HOME_DIR}/ai/nlp

    Insight Pro Option data

    NAVER Insight Pro Option data (pro#1: ID by subgroup)

    1. Search click data (search/click)

    ItemsDescription
    Data introduction- Data showing which keywords NAVER users searched, and which areas they clicked on
    - Anonymized data by grouping users based on gender/age group/region rather than personal-level data
    (For search terms searched by more than 100 users per day)
    Data provision periodJanuary, 2 years ago, to the previous week of the current year, updated weekly (as of February 16, 2022, January 2020 to January 2022)
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users
    Data aggregation criteria- date (base date): search/click date
    - hour (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - user (user group ID): anonymized user group ID
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - keyword (search word): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces)
    - area (clicked area): service area that the user clicked on in NAVER's integrated search results
    - count (number of clicks): counted by base date, time zone, device, gender, user group ID, age group, region, search term, and click area
    Sample data location${DATABOX_HOME_DIR}/sample/pro_search/click
    Directory structureclouddatabox-data_prosearchclick

    2. Product click data (shopping/click)

    ItemsDescription
    Data introduction- Data showing which keywords NAVER users searched, which category of products they clicked on
    - Anonymized data by grouping users based on gender/age group/region rather than personal-level data
    (For search terms/product categories with more than 100 clicks per day)
    Data provision periodJanuary, 2 years ago, to the previous week of the current year, updated weekly (as of February 16, 2022, January 2020 to January 2022)
    (unit: daily unit)
    Extraction targetsLogged-in NAVER users with a product click history during the set period
    Data aggregation criteria- date (base date): search/click date
    - hour (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
    - device: device information used by the user for search (mobile/pc)
    - gender: user's gender (f/m)
    - user (user group ID): anonymized user group ID
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - keyword (inbound keyword): keyword that the user searched for before clicking on the product
    - cat (product category): large/medium/small/three categories of the product clicked by the user
    - count (number of clicks): counted by base date, time zone, device, gender, user group ID, age group, region, search term, and product category
    - brand (brand name): estimated brand name of the product clicked by the user
    - item (product name): estimated product name of the product clicked by the user
    Sample data location${DATABOX_HOME_DIR}/sample/pro_shopping/click
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/pro20y1h/shopping/click
    Directory structureclouddatabox-data_proshoppingclick

    3. Product purchase data (shopping/purchase)

    ItemsDescription
    Data introduction- Data showing which products logged-in NAVER users purchased
    - Anonymized data by grouping users based on gender/age group/region rather than personal-level data
    (For product categories with 10 or more purchases per day)
    Data provision periodJanuary, from 2 years ago, to the previous month of the current year, updated weekly (January 2020 to January 2021 provided on February 16, 2022) (Unit: daily)
    (unit: daily unit)
    Extraction targetsNAVER users with a purchase history during the set period
    Data aggregation criteria- date (base date): purchase date
    - hour (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
    - device: device information used by the user for purchase (mobile/pc)
    - gender: user's gender (f/m)
    - user (user group ID): anonymized user group ID
    - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-)
    - loc1 (region 1): metropolitan city/province based on address
    - loc2 (region 2): city/county/district based on address
    - cat (product category): large/medium/small/three categories of the product purchased by the user
    - count (number of purchases): count base date, time zone, device, gender, user group ID, age group, region, and product category
    - brand (brand name): estimated brand name of the product purchased by the user
    - item (product name): estimated product name of the product purchased by the user
    Sample data location${DATABOX_HOME_DIR}/sample/pro_shopping/purchase
    Entire data locationFor data from the first half of 2020: ${DATABOX_HOME_DIR}/pro20y1h/shopping/purchase
    Directory structureclouddatabox-data_proshoppingpur

    NAVER Insight Pro data (pro#2: combination of NAVER and NICE Information Service)

    1. Combination of NAVER and NICE Information Service

    ItemsDescription
    Data introductionAnonymous information created using the result of combining information sets (pseudonymized data) of NAVER and NICE Information Service
    - NAVER Data: data on purchase intentions, interests, etc. based on NAVER users' behavioral data on NAVER
    - NICE evaluation information data: credit information data of subjects aged 19 to 90 with CB rating
    Data provision periodQuarterly update from the quarter data before last (as of August 18, 2022, 2022 Q1 data provided, quarterly update)
    Extraction targets- NAVER: logged-in users among NAVER members
    - NICE evaluation information: credit information data of subjects aged between 19 and 90 with CB rating
    Data aggregation criteria- Personal information: information related to personal information such as age/gender/occupation code
    - Default information: information related to default (long-term overdue)
    - Short-term overdue information: information on short-term overdue (more than 30 days overdue for loan business)
    - Card opening/performance information: information related to card opening and performance
    - Loan opening/performance information: information related to card opening and performance
    - Repayment capacity information (home): statistical information related to home address
    - Repayment capacity information (workplace): statistical information related to workplace address
    - Real estate ACM: information on property ownership
    - Model information: model information calculated by using credit information in a complex way
    - Purchasing intention: information that identifies the user's purchase intention by using shopping clicks, orders, advertisement searches/clicks, etc.
    - Interests: information that identifies the user's interests through search/clicks or visits to cafes/blogs
    Sample data location${DATABOX_HOME_DIR}/sample/nice

    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.