View sample data

Prev Next

Available in VPC

When you create a data box, you receive a data sample in the search and shopping field. Once a data supply request has been made, all external network connections are blocked. Using the sample data, you can configure the analysis environment before the communication restriction. This section describes how to view sample data on Cloud Hadoop and Ncloud TensorFlow Server. For more information on the data, see Detailed description of provided data.

View sample data on Cloud Hadoop

Note

When you view Cloud Data Box's data in NIMORO products of Big Data & Analytics, you must upload NAVER data on the Hadoop HDFS path.

1. Check sample data's location

The sample data is uploaded to the file path under Hadoop HDFS.
/user/ncp/sample

2. View file in HDFS

You can view the sample files uploaded to HDFS by accessing Hue on a web browser from Connect Server.

  • Hue access address https://엣지노드IP:8443/hue
    databox-sample-00

3. Create Hive External Table and view data

Access Hue and create Hive External Table on Hive Query Editor using the sample data files.

  • Update hdfs://hadoop-000-000 using the Hadoop cluster's name in the following script and run the script. To see the Hadoop cluster's name, click the [Details] button on the data box page and go to the [Infrastructure] tab.
  • The data type and schema of the sample data and those of the provided search and shopping data match. As such, you can create a table on the actual data after the data supply request submission using the following script. You only need to update it with the new database and the data upload path details.
  • Running the command "MSCK REPAIR TABLE" after the table is created may result in an error indicating that the table does not exist. In that case, run the command "MSCK REPAIR TABLE" again later.
-- Create database for sample data
CREATE DATABASE sample;

-- 1. Create the "search click" table 
CREATE EXTERNAL TABLE sample.search_click (
      `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword`  STRING
    , `area`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/search/click/";  -- Location the sample data is uploaded to. Update it using the Hadoop cluster's name.

-- 2. Create the "search click co-occurrence" table
CREATE EXTERNAL TABLE sample.search_click_cooccurrence (
     `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword1`  STRING
    , `area1`  STRING
    , `keyword2`  STRING
    , `area2`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/search/click_cooccurrence/";

-- 3. Create the "search keyword by access location" table
CREATE EXTERNAL TABLE sample.search_click_location (
     `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `hour`    varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/search/click_location/";

-- 4. Create the "product click" table
CREATE EXTERNAL TABLE sample.shopping_click (
     `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword`  STRING
    , `cat`  STRING
    , `brand` STRING
    , `item` STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/click/";

-- 5. Create the "product purchase" table
CREATE EXTERNAL TABLE sample.shopping_purchase (
     `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `cat`  STRING
    , `brand` STRING
    , `item` STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/purchase/";

-- 6. Create the "product click co-occurrence" table
CREATE EXTERNAL TABLE sample.shopping_click_cooccurrence (
     `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword1`  STRING
    , `cat1`  STRING
    , `keyword2`  STRING
    , `cat2`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/click_cooccurrence/";

-- 7. Create the "product purchase co-occurrence" table
CREATE EXTERNAL TABLE sample.shopping_purchase_cooccurrence (
     `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `cat1`  STRING
    , `cat2`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`     varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/purchase_cooccurrence/";

-- 8. Create the "Pro Option search click" table
CREATE EXTERNAL TABLE sample.pro_search_click (
     `user`  STRING
    , `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword`  STRING
    , `area`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`   varchar(10)
    , `hour`  varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/pro_search/click/";

-- 9. Create the "Pro Option product click" table
CREATE EXTERNAL TABLE sample.pro_shopping_click (
     `user`  STRING
    , `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `keyword`  STRING
    , `brand` STRING
    , `item` STRING
    , `cat`  STRING
    , `count`  BIGINT
) PARTITIONED BY (
     `date`   varchar(10)
    , `hour`  varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/pro_shopping/click/";

-- 10. Create the "Pro Option product purchase" table
CREATE EXTERNAL TABLE sample.pro_shopping_purchase (
     `user`  STRING
    , `age`  STRING
    , `loc1`  STRING
    , `loc2`  STRING
    , `cat`  STRING
    , `brand` STRING
    , `item` STRING 
    , `count`  BIGINT
) PARTITIONED BY (
     `date`   varchar(10)
    , `hour`  varchar(10)
    , `device`  varchar(10)
    , `gender`  varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/pro_shopping/purchase/";

Once you have run the above script, you can see Hive Table's data on Hue as follows.

SET hive.resultset.use.unique.column.names = false;

SELECT * FROM sample.search_click LIMIT 10;

SELECT *  FROM sample.search_click 
WHERE `date` = '2021-01-01' and device = 'mobile' and gender = 'f' and `count` > 10 LIMIT 10;

image

4. View sample data on Zeppelin

You can access Zeppelin to view the Parquet file data uploaded to Cloud Hadoop.

  • Zeppelin: use Ambari service link or https://EdgeNodeIP:9996

Update hdfs://nv0xxx-hadoop using the Hadoop cluster's name in the following script and run the script. To see the Hadoop cluster's name, click the [Details] button on the data box page and go to the [Infrastructure] tab.

%spark2
val df = spark.read.parquet("hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click")
println(df.count())
df.show()

databox-sample-021

You can view the data of the Hive Tables you created in the previous stages.

%jdbc(hive)
SET hive.resultset.use.unique.column.names = false;
SELECT * FROM sample.search_click LIMIT 10;

clouddatabox-sample_02

View sample data on Ncloud TensorFlow Server

To view sample data on Ncloud TensorFlow Server, follow these steps:

  1. Go to the following file path on Ncloud TensorFlow Server and see the sample data.

    • /home/ncp/workspace/sample
    • Sample data is supplied to you as read-only.
  2. Access Jupyter Notebook and install the necessary module(s).

    !pip install -U pyarrow
    
  3. See if the sample file data provided to you matches the following data.

    import pandas as pd
    import pyarrow.parquet as pq
    from collections import OrderedDict
    source = "/home/ncp/workspace/sample/search/click/date=2021-01-01/device=pc/gender=m"
    d = pq.read_table(source=source).to_pydict() 
    df1 = pd.DataFrame( OrderedDict( d ) )
    

    databox-sample-03