- Print
- PDF
Detailed description of provided data
- Print
- PDF
Available in Classic and VPC
Cloud Data Box provides user behavior data such as NAVER search and shopping, image and annotation data for AI learning, and object data for news articles and NLP analysis.
Search and shopping sample data are provided by default when you create a data box. Sample data is provided by uploading it to Cloud Hadoop HDFS, and by mounting the NAS read-only on the Ncloud TensorFlow Server. Request data to be supplied after configuring the analysis environment. Search, shopping, and AI data by the default option (for the first half of the previous year) are provided by mounting NAS as read-only on Cloud Hadoop and Ncloud TensorFlow Server. If you request a subscription to the Insight option, then the latest data from the first half of the year before last year until last month is provided. Additional data (from the previous month) is provided on the 15th of each month.
Describes the data provided by Cloud Data Box in detail.
- Search data and shopping data are extracted from the logged-in user data as target.
- To download sample NAVER default data, click sample.xlsx.
- To download sample NAVER Pro#1 data, click sample_pro#1.xlsx.
- To download sample NAVER Pro#2 data, click sample_pro#2.xlsx.
- To download the NAVER Pro#2 data specifications, click Specifications_pro#2.xlsx.
- ${DATABOX_HOME_DIR}: home directory of Hadoop cluster NAS is /mnt, home directory in Ncloud TensorFlow Server is /home/ncp/workspace. Home directory of sample data Hadoop cluster HDFS data storage is /user/ncp.
Standard data and Insight option data
Search data (search)
1. Search click data (search/click)
Items | Description |
---|---|
Data introduction | Data showing which keywords NAVER users searched, and which areas they clicked on (For search terms searched by more than 100 users per day) |
Data provision period | - Standard data: first half of the previous year (January to June), updated at the beginning of every year - Insight data: January, from 2 years ago, to the previous month of the current year, updated monthly (January 2019 to July 2021 provided in August of 2021) (unit: daily unit) |
Extraction targets | Logged-in NAVER users with a click history during the set period |
Data aggregation criteria | - date (base date): search/click date - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - keyword (search word): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces) - area (click area): service area that the user clicked on in NAVER's integrated search results - count (number of clicked users): aggregated by base date, device, gender, age group, region, search term, and click area |
Sample data location | ${DATABOX_HOME_DIR}/sample/search/click |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/search20y1h/search/click |
Directory structure |
For detailed descriptions and examples of the search click data area, see Search click area example file. This document is an example intended to help improve the user's understanding, but may vary from the actual function provided.
2. Search click co-occurrence (search/click_cooccurrence)
Items | Description |
---|---|
Data introduction | Data showing what the NAVER users searched and clicked together in 1 day (For search terms searched by more than 100 users per week) |
Data provision period | - Standard data: first half of the previous year (January to June), updated at the beginning of every year - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly (unit: daily unit) |
Extraction targets | Logged-in NAVER users with a click history during the set period |
Data aggregation criteria | - week (base date): search/click date - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - keyword1 (search term 1): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces) - area1 (click area 1): service area that the user clicked on in the NAVER integrated search results of search term 1 - keyword2 (search term 2): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces) - area2 (click area 2): service area that the user clicked on in the NAVER integrated search results of search term 2 - count (number of users clicked together): counted by base date, device, gender, age group, region, search term 1, click area 1, search term 2, and click area 2 (the order of search terms 1 and 2 is not considered) |
Sample data location | ${DATABOX_HOME_DIR}/sample/search/click_cooccurrence |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/search20y1h/search/click_cooccurrence |
Directory structure |
3. Search access location (search/click_location)
Items | Description |
---|---|
Data introduction | Data showing in which regions NAVER users searched for and clicked keywords (As the access area is measured based on IP, there may be errors in accuracy) |
Data provision period | - Standard data: first half of previous year (January to June), updated at the beginning of every year - Insight data: January 2 years ago to the previous month of the year (updated monthly) (unit: daily unit) |
Extraction targets | Logged-in NAVER users with a click history during the set period |
Data aggregation criteria | - date (base date): search/click date - time (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23) - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): access location (metropolitan city/province). Area measured based on access IP - loc2 (region 2): access location (city/county/district). Area measured based on access IP - keyword (search word): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces). Keywords searched by different users during the day (counted only for clicks after search) - count (number of clicked users): aggregated by base date, time zone, device, gender, age group, region, and search term |
Sample data location | ${DATABOX_DIR}/sample/search/click_location |
Entire data location | For data from the first half of 2020: ${DATABOX_DIR}/search20y1h/search/click_location |
Directory structure |
Shopping data (shopping)
1. Product click data (shopping/click)
Items | Description |
---|---|
Data introduction | Data showing which keywords NAVER users searched and which category of products they clicked on (For search terms/product categories with more than 100 clicks per day) |
Data provision period | - Standard data: first half of the previous year (January to June), updated at the beginning of every year - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly (unit: daily unit) |
Extraction targets | Logged-in NAVER users with a product click history during the set period |
Data aggregation criteria | - date (base date): search/click date - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - keyword (inbound keyword): keyword that the user searched for before clicking on the product - cat (product category): large/medium/small/three categories of the product clicked by the user - count (number of visitors): counted by date, device, gender, age group, region, product category, and search word - brand (brand name): estimated brand name of the product clicked by the user - item (product name): estimated product name of the product clicked by the user |
Sample data location | ${DATABOX_HOME_DIR}/sample/shopping/click |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/click |
Directory structure |
2. Product purchase data (shopping/purchase)
Items | Description |
---|---|
Data introduction | Data showing which products logged-in NAVER users purchased (For product categories with 10 or more purchases per day) |
Data provision period | - Standard data: first half of the previous year (January to June), updated at the beginning of every year - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly (unit: daily unit) |
Extraction targets | NAVER users with a purchase history during the set period |
Data aggregation criteria | - date (base date): purchase date - device: device information used by the user for purchase (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - cat (product category): large/medium/small/three categories of the product purchased by the user - count (number of purchasers): counted by base date, device, gender, age group, region, search term, and product category - brand (brand name): estimated brand name of the product purchased by the user - item (product name): estimated product name of the product purchased by the user |
Sample data location | ${DATABOX_HOME_DIR}/sample/shopping/purchase |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/purchase |
Directory structure |
3. Product click co-occurrence (shopping/click_cooccurrence)
Items | Description |
---|---|
Data introduction | Data showing which products NAVER users searched for and clicked on during the day (For search terms/product categories with more than 100 clicks per day) |
Data provision period | - Standard data: first half of the previous year (January to June), updated at the beginning of every year - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly (unit: daily unit) |
Extraction targets | Logged-in NAVER users with a product click history during the set period |
Data aggregation criteria | - week (base date): the start date of the extraction period - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - keyword1 (inbound keyword 1): keyword that the user searched for before clicking on product category 1 - cat1 (product category 1): large/medium/small/three categories of the product clicked by the user - keyword2 (inbound keyword 2): keyword that the user searched for before clicking on product category 2 - cat2 (product category 2): large/medium/small/three categories of the product clicked by the user - count (number of users who visited together): counted by standard date (weekly), device, gender, age group, region, search term 1, product category 1, search term 2, product category 2 (click order is not considered) |
Sample data location | ${DATABOX_HOME_DIR}/sample/shopping/click_cooccurrence |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/shopping20y1h/shopping/click_cooccurrence |
Directory structure |
4. Product purchase co-occurrence (shopping/purchase_cooccurrence)
Items | Description |
---|---|
Data introduction | Data showing what products NAVER users purchased together during the day (For product categories with 10 or more purchases per day) |
Data provision period | - Standard data: first half of the previous year (January to June), updated at the beginning of every year - Insight data: 2 years ago, starting in January, to the previous month of the year, updated monthly (unit: daily unit) |
Extraction targets | NAVER users with a purchase history during the set period |
Data aggregation criteria | - week (base date): the start date of the extraction period - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - cat1 (product category 1): large/medium/small/three categories of the product purchased by the user - cat2 (product category 2): large/medium/small/three categories of the product purchased by the user - count (number of users who purchased together): counted by base date (weekly), device, gender, age group, region, product category 1, product category 2 |
Sample data location | ${DATABOX_HOME_DIR}/sample/shopping/purchase_cooccurrence |
Entire data location | For data from the first half of 2020: {DATABOX_HOME_DIR}/shopping20y1h/shopping/purchase_cooccurrence |
Directory structure |
AI data (ai)
1. Reused image (ai/clova/iitp_waste_images)
Items | Description |
---|---|
Data introduction | Collect and label recycled images |
Data use examples | Development of technology for solving image classification problems using industrial product data in real-life environment |
Data details | - Data format: JPEG 3,000 sheets (1280720, 7201280), iitp_waste_images_3000_result.csv - Labeling information 1: General garbage (others) 2: Paper 3: Cans and scrap metal 4: Glass bottles 5: Plastics (including PET) 6: Vinyl 7: Styrofoam 8: Food |
Data sample (examples) | |
Entire data location | ${DATABOX_HOME_DIR}/ai/clova/iitp_waste_images |
2. Food image (ai/clova/naver_food_fixed)
Items | Description |
---|---|
Data introduction | Data that tags food areas in the image with a bounding box |
Data location | ${DATABOX_HOME_DIR}/ai/clova/naver_food_fixed |
Data use examples | Extract essential elements from the image, and train and develop AI to solve the problem |
Data details | Collected image - Number of data: 2,042 sheets - Data format: JPEG image, json file for each |
Data sample (examples) | |
Entire data location | ${DATABOX_HOME_DIR}/ai/clova/naver_food_fixed |
3. Restaurant image (ai/clova/externalImageOCR)
Items | Description |
---|---|
Data introduction | Korean text OCR annotation data (including English and numbers) in images of posts, billboards, menus, and restaurant signs |
Data use examples | Develop technology for extracting text from images, and converting it to digital data |
Data details | Korean, English, and numeric annotation data in signage images collected by yourself - Number of data: 1,180 (signboard 197, restaurant sign 324, menu 614, standing_signboard 45) - Data format: json, JPEG (original image, result image) |
Data sample (JSON examples) | |
Entire data location | ${DATABOX_HOME_DIR}/ai/clova/externalImageOCR |
4. News data for NLP experiments (ai/nlp)
Items | Description |
---|---|
Data introduction | Data gathered by identifying object names (name) in news articles content collected by the NAVER News service and linking Wikipedia pages related to the object names |
Data aggregation criteria | Consists of news title, body, and category. The locations of the object names in the body text are indicated using BIO tags, and IDs are attached. |
Data use examples | A technology that links object names (name) to information related to the objects (Wikipedia page corresponding to the names) from text can be developed (entity linking) |
Data sample (examples) | |
Entire data location | ${DATABOX_HOME_DIR}/ai/nlp |
Insight Pro Option data
NAVER Insight Pro Option data (pro#1: ID by subgroup)
1. Search click data (search/click)
Items | Description |
---|---|
Data introduction | - Data showing which keywords NAVER users searched, and which areas they clicked on - Anonymized data by grouping users based on gender/age group/region rather than personal-level data (For search terms searched by more than 100 users per day) |
Data provision period | January, 2 years ago, to the previous week of the current year, updated weekly (as of February 16, 2022, January 2020 to January 2022) (unit: daily unit) |
Extraction targets | Logged-in NAVER users |
Data aggregation criteria | - date (base date): search/click date - hour (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23) - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - user (user group ID): anonymized user group ID - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - keyword (search word): keyword entered by the user in the NAVER integrated search area (convert to lowercase, remove spaces) - area (clicked area): service area that the user clicked on in NAVER's integrated search results - count (number of clicks): counted by base date, time zone, device, gender, user group ID, age group, region, search term, and click area |
Sample data location | ${DATABOX_HOME_DIR}/sample/pro_search/click |
Directory structure |
2. Product click data (shopping/click)
Items | Description |
---|---|
Data introduction | - Data showing which keywords NAVER users searched, which category of products they clicked on - Anonymized data by grouping users based on gender/age group/region rather than personal-level data (For search terms/product categories with more than 100 clicks per day) |
Data provision period | January, 2 years ago, to the previous week of the current year, updated weekly (as of February 16, 2022, January 2020 to January 2022) (unit: daily unit) |
Extraction targets | Logged-in NAVER users with a product click history during the set period |
Data aggregation criteria | - date (base date): search/click date - hour (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23) - device: device information used by the user for search (mobile/pc) - gender: user's gender (f/m) - user (user group ID): anonymized user group ID - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - keyword (inbound keyword): keyword that the user searched for before clicking on the product - cat (product category): large/medium/small/three categories of the product clicked by the user - count (number of clicks): counted by base date, time zone, device, gender, user group ID, age group, region, search term, and product category - brand (brand name): estimated brand name of the product clicked by the user - item (product name): estimated product name of the product clicked by the user |
Sample data location | ${DATABOX_HOME_DIR}/sample/pro_shopping/click |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/pro20y1h/shopping/click |
Directory structure |
3. Product purchase data (shopping/purchase)
Items | Description |
---|---|
Data introduction | - Data showing which products logged-in NAVER users purchased - Anonymized data by grouping users based on gender/age group/region rather than personal-level data (For product categories with 10 or more purchases per day) |
Data provision period | January, from 2 years ago, to the previous month of the current year, updated weekly (January 2020 to January 2021 provided on February 16, 2022) (Unit: daily) (unit: daily unit) |
Extraction targets | NAVER users with a purchase history during the set period |
Data aggregation criteria | - date (base date): purchase date - hour (time zone): grouped by 3 hours (00-02/03-05/06-08/09-11/12-14/15-17/18-20/21-23) - device: device information used by the user for purchase (mobile/pc) - gender: user's gender (f/m) - user (user group ID): anonymized user group ID - Age (age range): unit of 5 years (-12/13-18/19-24/25-29/30-34/35-39/40-44/45-49/50-54/55-59/60 -64/65-69/70-) - loc1 (region 1): metropolitan city/province based on address - loc2 (region 2): city/county/district based on address - cat (product category): large/medium/small/three categories of the product purchased by the user - count (number of purchases): count base date, time zone, device, gender, user group ID, age group, region, and product category - brand (brand name): estimated brand name of the product purchased by the user - item (product name): estimated product name of the product purchased by the user |
Sample data location | ${DATABOX_HOME_DIR}/sample/pro_shopping/purchase |
Entire data location | For data from the first half of 2020: ${DATABOX_HOME_DIR}/pro20y1h/shopping/purchase |
Directory structure |
NAVER Insight Pro data (pro#2: combination of NAVER and NICE Information Service)
1. Combination of NAVER and NICE Information Service
Items | Description |
---|---|
Data introduction | Anonymous information created using the result of combining information sets (pseudonymized data) of NAVER and NICE Information Service - NAVER Data: data on purchase intentions, interests, etc. based on NAVER users' behavioral data on NAVER - NICE evaluation information data: credit information data of subjects aged 19 to 90 with CB rating |
Data provision period | Quarterly update from the quarter data before last (as of August 18, 2022, 2022 Q1 data provided, quarterly update) |
Extraction targets | - NAVER: logged-in users among NAVER members - NICE evaluation information: credit information data of subjects aged between 19 and 90 with CB rating |
Data aggregation criteria | - Personal information: information related to personal information such as age/gender/occupation code - Default information: information related to default (long-term overdue) - Short-term overdue information: information on short-term overdue (more than 30 days overdue for loan business) - Card opening/performance information: information related to card opening and performance - Loan opening/performance information: information related to card opening and performance - Repayment capacity information (home): statistical information related to home address - Repayment capacity information (workplace): statistical information related to workplace address - Real estate ACM: information on property ownership - Model information: model information calculated by using credit information in a complex way - Purchasing intention: information that identifies the user's purchase intention by using shopping clicks, orders, advertisement searches/clicks, etc. - Interests: information that identifies the user's interests through search/clicks or visits to cafes/blogs |
Sample data location | ${DATABOX_HOME_DIR}/sample/nice |