Available in VPC
Cloud Data Box provides user behavior data such as NAVER search and shopping, image and annotation data for AI training, and object data for news articles and NLP analysis.
Search and shopping sample data are provided by default when you create a data box. When supplying the sample data, we upload it to Cloud Hadoop HDFS and mount the NAS on the TensorFlow Server as read-only. Request data supply after configuring the analysis environment. Search, shopping, and AI data by the default option (data from the first half of the previous year) will then be provided to you via Cloud Hadoop and NAS mounted on TensorFlow Server as read-only. If you subscribe to Insight Option, you can receive the latest data from the first half of the 2 years back to the last month. On the 15th of each month, you can also receive the data from the previous month.
This section describes the data provided by Cloud Data Box in detail.
Note
- Search data and shopping data are extracted from the logged-in user data as the target.
- To download sample NAVER default data, click sample.xlsx.
- To download sample NAVER Pro#1 data, click sample_pro#1.xlsx.
- To download sample NAVER Pro#2 data, click sample_pro#2.xlsx.
- To download the NAVER Pro#2 data specifications, click Specifications_pro#2.xlsx.
- ${DATABOX_HOME_DIR}: home directory of the Hadoop cluster NAS is /mnt, and that of the TensorFlow server is /home/ncp/workspace. Home directory of the sample data Hadoop cluster HDFS data storage is /user/ncp.
Standard data and Insight Option data
Search data (search)
1. Search click data (search/click)
| Item |
Description |
| Data introduction |
Data showing the keywords NAVER users searched, and the areas they clicked on (Targets search keywords searched by more than 100 users daily) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of each year
- Insight Data: from January of 2 years back to the previous month of the current year, updated monthly (For example, on August 2021, subscribers would receive data from January 2019 to July 2021.)
(unit: day)
|
| Extraction targets |
Logged-in NAVER users with a click history during the set period |
| Data aggregation criteria |
- date (base date): search/click date
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- keyword (search keyword): keyword the user entered in the NAVER integrated search area (converted to lowercase, whitespaces removed)
- area (click area): service area that the user clicked on in NAVER's integrated search results
- count (number of clicked users): aggregated by base date, device, gender, age group, region, search keyword, and click area
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/search/click |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/search20y1h/search/click |
| Directory structure |
 |
Note
For detailed descriptions and examples of the search click data area, see Search click area example file. This document is an example intended to help improve the user's understanding. It may differ from the actual feature provided.
2. Search click co-occurrence (search/click_cooccurrence)
| Item |
Description |
| Data introduction |
Data showing the information NAVER users searched and clicked together in 1 day (Targets search keywords searched by more than 100 users weekly) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of every year
- Insight Data: from January of 2 years back to the previous month of the current year, updated monthly
(unit: day)
|
| Extraction targets |
Logged-in NAVER users with a click history during the set period |
| Data aggregation criteria |
- week (base date): search/click date
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- keyword1 (search keyword1): keyword the user entered in the NAVER integrated search area (converted to lowercase, whitespaces removed)
- area1 (click area1): service area that the user clicked on in NAVER's integrated search results on keyword1
- keyword2 (search keyword2): keyword the user entered in the NAVER integrated search area (converted to lowercase, whitespaces removed)
- area2 (click area2): service area that the user clicked on in NAVER's integrated search results on keyword2
- count (number of users that clicked the data on keywords together): counted by base date, device, gender, age group, region, keyword1, area1, keyword2, and area2 (The search order between keyword1 and keyword2 is not considered.)
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/search/click_cooccurrence |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/search20y1h/search/click_cooccurrence |
| Directory structure |
 |
3. Search access location (search/click_location)
| Item |
Description |
| Data introduction |
Data showing the keywords NAVER users searched for by region (As the access area is measured based on IP, there may be errors in accuracy) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of every year
- Insight Data: from January of 2 years back to the previous month of the current year (updated monthly)
(unit: day)
|
| Extraction targets |
Logged-in NAVER users with a click history during the set period |
| Data aggregation criteria |
- date (base date): search/click date
- time (time frame): uses 3-hour time windows (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): access location (metropolitan city/province). Area tracked based on access IP
- loc2 (region 2): access location (city/county/district). Area tracked based on access IP
- keyword (search keyword): keyword the user entered in the NAVER integrated search area (converted to lowercase, whitespaces removed). Keywords searched by different users during the day (counted only if they clicked a link after the search)
- count (number of users that clicked links): aggregated by base date, time frame, device, gender, age group, region, and search keyword
|
| Sample data location |
${DATABOX_DIR}/sample/search/click_location |
| Entire data location |
For data from the first half of 2020: ${DATABOX_DIR}/search20y1h/search/click_location |
| Directory structure |
 |
Shopping data (shopping)
1. Product click data (shopping/click)
| Item |
Description |
| Data introduction |
Data showing the keywords NAVER users searched for and the product categories they clicked on (Targets search keywords/product categories with more than 100 clicks daily) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of every year
- Insight Data: from January of 2 years back to the previous month of the current year, updated monthly
(unit: day)
|
| Extraction targets |
Logged-in NAVER users with product click history during the set period |
| Data aggregation criteria |
- date (base date): search/click date
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- keyword (entry keyword): keyword that the user searched for before clicking on the product
- cat (product category): section/division/group/class of the product clicked by the user
- count (number of visitors): counted by date, device, gender, age group, region, product category, and search keyword
- brand (brand name): presumed name of the brand that the product the user clicked belongs to
- item (product name): presumed name of the product the user clicked
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/shopping/click |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/click |
| Directory structure |
 |
2. Product purchase data (shopping/purchase)
| Item |
Description |
| Data introduction |
Data showing the products logged-in NAVER users purchased (Targets product categories with 10 or more purchases daily) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of every year
- Insight Data: from January of 2 years back to the previous month of the current year, updated monthly
(unit: day)
|
| Extraction targets |
NAVER users with a purchase history during the set period |
| Data aggregation criteria |
- date (base date): purchase date
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- cat (product category): section/division/group/class of the product purchased by the user
- count (number of buyers): counted by date, device, gender, age group, region, search keyword, and product category
- brand (brand name): presumed name of the brand that the product the user purchased belongs to
- item (product name): presumed name of the product the user purchased
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/shopping/purchase |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/purchase |
| Directory structure |
 |
3. Product click co-occurrence (shopping/click_cooccurrence)
| Item |
Description |
| Data introduction |
Data showing the products NAVER users searched for and clicked on together during the day (Targets search keywords/product categories with more than 100 clicks daily) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of every year
- Insight Data: from January of 2 years back to the previous month of the current year, updated monthly
(unit: day)
|
| Extraction targets |
Logged-in NAVER users with product click history during the set period |
| Data aggregation criteria |
- week (base date): the starting date of the extraction period
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- keyword1 (entry keyword 1): keyword that the user searched for before clicking on product category 1
- cat1 (product category 1): section/division/group/class of the product clicked by the user
- keyword2 (entry keyword 2): keyword that the user searched for before clicking on product category 2
- cat2 (product category 2): section/division/group/class of the product clicked by the user
- count (number of users that visited web pages on keywords together): counted by base date (weekly), device, gender, age group, region, keyword1, cat1, keyword2, and cat2 (The click order is not considered.)
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/shopping/click_cooccurrence |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/click_cooccurrence |
| Directory structure |
 |
4. Product purchase co-occurrence (shopping/purchase_cooccurrence)
| Item |
Description |
| Data introduction |
Data showing the products NAVER users purchased together daily (Targets product categories with 10 or more purchases daily) |
| Data provision period |
- Standard Data: first half of the previous year (January to June), updated at the beginning of every year
- Insight Data: from January of 2 years back to the previous month of the current year, updated monthly
(unit: day)
|
| Extraction targets |
NAVER users with a purchase history during the set period |
| Data aggregation criteria |
- week (base date): the starting date of the extraction period
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- cat1 (product category 1): section/division/group/class of the product the user purchased
- cat2 (product category 2): section/division/group/class of the product the user purchased
- count (number of users that purchased products together): counted by base date (weekly), device, gender, age group, region, cat1, and cat2
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/shopping/purchase_cooccurrence |
| Entire data location |
For data from the first half of 2020: {DATABOX_HOME_DIR}/shopping20y1h/shopping/purchase_cooccurrence |
| Directory structure |
 |
AI data (ai)
1. Reused image (ai/clova/iitp_waste_images)
| Item |
Description |
| Data introduction |
Collect and label recycled images |
| Data use examples |
Developing technology that can be used to solve image classification problems using industrial product data in a real-life environment |
| Data details |
- Data format: 3,000 images in JPEG format (1280720, 7201280), iitp_waste_images_3000_result.csv
- Labeling information
1: General waste (others) 2: Paper 3: Cans and scrap metal 4: Glass bottles 5: Plastics (including PET) 6: Vinyl 7: Styrofoam 8: Food
|
| Data sample (examples) |
 |
| Entire data location |
${DATABOX_HOME_DIR}/ai/clova/iitp_waste_images |
2. Food image (ai/clova/naver_food_fixed)
| Item |
Description |
| Data introduction |
Data that tags food areas in the image with a bounding box |
| Data location |
${DATABOX_HOME_DIR}/ai/clova/naver_food_fixed |
| Data use examples |
Train an AI based on the necessary elements extracted from images to solve problems |
| Data details |
Collected image Number of data: 2,042 imagesData format: JPEG image and its JSON file |
| Data sample (examples) |
 |
| Entire data location |
${DATABOX_HOME_DIR}/ai/clova/naver_food_fixed |
3. Restaurant image (ai/clova/externalImageOCR)
| Item |
Description |
| Data introduction |
Korean text OCR annotation data (including Roman alphabets and numbers) in images of posts, billboards, menus, and restaurant signs |
| Data use examples |
Develop a technology that can extract text from images and convert it to digital data |
| Data details |
Korean, Roman alphabet, and numeric annotation data in collected images of signs Number of data: 1,180 images (signboard 197, restaurant sign 324, menu 614, standing_signboard 45) Data format: JSON, JPEG (original image, result image) |
| Data sample (JSON examples) |
 |
| Entire data location |
${DATABOX_HOME_DIR}/ai/clova/externalImageOCR |
4. News data for NLP experiments (ai/nlp)
| Item |
Description |
| Data introduction |
Data of an object name (an entity's name) in a news article collected from NAVER News Service linked to the Wikipedia page related to that object |
| Data aggregation criteria |
Consists of news article's title, body, and category. The locations of the object names in the body text are indicated by BIO tags, and IDs are attached to the names. |
| Data use examples |
A technology that links object names (name) to information related to the objects (Wikipedia page corresponding to the names) from text can be developed (entity linking) |
| Data sample (examples) |
 |
| Entire data location |
${DATABOX_HOME_DIR}/ai/nlp |
Insight Pro Option data
NAVER Insight Pro Option data (pro#1: ID by subgroup)
1. Search click data (search/click)
| Item |
Description |
| Data introduction |
- Data showing the keywords NAVER users searched for and the areas they clicked on
- Not individual-level data but anonymized group-level data (Users were grouped by gender/age group/region.)
(Targets search keywords searched by more than 100 users daily)
|
| Data provision period |
January of 2 years back to the previous week of the current year, updated weekly (For example, on February 16, 2022, subscribers would receive data from January 2020 to January 2022.) (unit: day) |
| Extraction targets |
Logged-in NAVER users |
| Data aggregation criteria |
- date (base date): search/click date
- time (time frame): uses 3-hour time windows (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- user (user group ID): anonymized user group ID
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- keyword (search keyword): keyword the user entered in the NAVER integrated search area (converted to lowercase, whitespaces removed)
- area (click area): service area that the user clicked on in NAVER's integrated search results
- count (number of clicks): aggregated by base date, time frame, device, gender, user group ID, age group, region, search keyword, and click area
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/pro_search/click |
| Directory structure |
 |
2. Product click data (shopping/click)
| Item |
Description |
| Data introduction |
- Data showing the keywords NAVER users searched for and the product categories and products they clicked on
- Not individual-level data but anonymized group-level data (Users were grouped by gender/age group/region.)
(Targets search keywords/product categories with more than 100 clicks daily)
|
| Data provision period |
January of 2 years back to the previous week of the current year, updated weekly (For example, on February 16, 2022, subscribers would receive data from January 2020 to January 2022.) (unit: day) |
| Extraction targets |
Logged-in NAVER users with product click history during the set period |
| Data aggregation criteria |
- date (base date): search/click date
- time (time frame): uses 3-hour time windows (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
- device: device information used by the user for search (mobile/pc)
- gender: user's gender (f/m)
- user (user group ID): anonymized user group ID
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- keyword (entry keyword): keyword that the user searched for before clicking on the product
- cat (product category): section/division/group/class of the product clicked by the user
- count (number of clicks): aggregated by base date, time frame, device, gender, user group ID, age group, region, search keyword, and product category
- brand (brand name): presumed name of the brand that the product the user clicked belongs to
- item (product name): presumed name of the product the user clicked
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/pro_shopping/click |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/pro20y1h/shopping/click |
| Directory structure |
 |
3. Product purchase data (shopping/purchase)
| Item |
Description |
| Data introduction |
- Data showing the products logged-in NAVER users purchased
- Not individual-level data but anonymized group-level data (Users were grouped by gender/age group/region.)
(Targets product categories with 10 or more purchases daily)
|
| Data provision period |
January of 2 years back to the previous month of the current year, updated monthly (For example, on February 16, 2022, subscribers would receive data from January 2020 to January 2022.) (unit: day) |
| Extraction targets |
NAVER users with a purchase history during the set period |
| Data aggregation criteria |
- date (base date): purchase date
- time (time frame): uses 3-hour time window (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23)
- device: device information used by the user to make the purchase (mobile/pc)
- gender: user's gender (f/m)
- user (user group ID): anonymized user group ID
- age (age range): uses 5-year age groups (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60-64/65-69/70-)
- loc1 (region 1): metropolitan city/province based on the user's address
- loc2 (region 2): city/county/district based on the user's address
- cat (product category): section/division/group/class of the product purchased by the user
- count (number of purchases): counted by date, time frame, device, gender, age group, user group ID, region, search keyword, and product category
- brand (brand name): presumed name of the brand that the product the user purchased belongs to
- item (product name): presumed name of the product the user purchased
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/pro_shopping/purchase |
| Entire data location |
For data from the first half of 2020: ${DATABOX_HOME_DIR}/pro20y1h/shopping/purchase |
| Directory structure |
 |
NAVER Insight Pro data (pro#2: combination of NAVER's data and NICE Information Service's data)
1. Combination of NAVER's data and NICE Information Service's data
| Item |
Description |
| Data introduction |
Anonymized information created by utilizing the combination of NAVER and NICE Information Service's information (pseudonymized data) NAVER's Data: data on purchase intentions or interests based on NAVER users' behavioral data on NAVERNICE Information Service's data: credit information data of subjects aged 19 to 90 with CB rating |
| Data provision period |
Quarterly data from the quarter before last, updated quarterly (For example, on August 18, 2022, subscribers would receive data of 2022 Q1 along with quarterly updates.) |
| Extraction targets |
- NAVER: logged-in users among NAVER members
- NICE Information Service: credit information data of subjects aged 19 to 90 with CB rating
|
| Data aggregation criteria |
- Personal information: information directly related to a person such as age/gender/occupation code
- Default information: information on one's unpaid debts (long-term overdue debts)
- Short-term overdue information: information on one's unpaid debts short-term overdue unpaid debts (including debts from lenders that are overdue for more than 30 days)
- Credit card issue/usage information: information on one's issued credit cards and their usage
- Loan application/usage information: information on loan application and loan usage
- Repayment ability information (home): statistical information related to one's registered home
- Repayment ability information (workplace): statistical information related to one's registered workplace
- Real estate ACM: information on one's property ownership
- Model information: information on a model that utilizes credit information in a complex manner to generate insights
- Purchase intention: one's purchase intention identified by leveraging their data including clicks, orders, advertisement searches, and advertisement clicks
- Interests: one's interest identified by leveraging their data including searches, clicks, visits to NAVER Cafes or NAVER Blogs
|
| Sample data location |
${DATABOX_HOME_DIR}/sample/nice |