Available in VPC
When you create a data box, you receive a data sample in the search and shopping field. Once a data supply request has been made, all external network connections are blocked. Using the sample data, you can configure the analysis environment before the communication restriction. This section describes how to view sample data on Cloud Hadoop and Ncloud TensorFlow Server. For more information on the data, see Detailed description of provided data.
View sample data on Cloud Hadoop
When you view Cloud Data Box's data in NIMORO products of Big Data & Analytics, you must upload NAVER data on the Hadoop HDFS path.
1. Check sample data's location
The sample data is uploaded to the file path under Hadoop HDFS.
/user/ncp/sample
2. View file in HDFS
You can view the sample files uploaded to HDFS by accessing Hue on a web browser from Connect Server.
- Hue access address
https://엣지노드IP:8443/hue

3. Create Hive External Table and view data
Access Hue and create Hive External Table on Hive Query Editor using the sample data files.
- Update
hdfs://hadoop-000-000using the Hadoop cluster's name in the following script and run the script. To see the Hadoop cluster's name, click the [Details] button on the data box page and go to the [Infrastructure] tab. - The data type and schema of the sample data and those of the provided search and shopping data match. As such, you can create a table on the actual data after the data supply request submission using the following script. You only need to update it with the new database and the data upload path details.
- Running the command "MSCK REPAIR TABLE" after the table is created may result in an error indicating that the table does not exist. In that case, run the command "MSCK REPAIR TABLE" again later.
-- Create database for sample data
CREATE DATABASE sample;
-- 1. Create the "search click" table
CREATE EXTERNAL TABLE sample.search_click (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `area` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/search/click/"; -- Location the sample data is uploaded to. Update it using the Hadoop cluster's name.
-- 2. Create the "search click co-occurrence" table
CREATE EXTERNAL TABLE sample.search_click_cooccurrence (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword1` STRING
, `area1` STRING
, `keyword2` STRING
, `area2` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/search/click_cooccurrence/";
-- 3. Create the "search keyword by access location" table
CREATE EXTERNAL TABLE sample.search_click_location (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/search/click_location/";
-- 4. Create the "product click" table
CREATE EXTERNAL TABLE sample.shopping_click (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `cat` STRING
, `brand` STRING
, `item` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/click/";
-- 5. Create the "product purchase" table
CREATE EXTERNAL TABLE sample.shopping_purchase (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `cat` STRING
, `brand` STRING
, `item` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/purchase/";
-- 6. Create the "product click co-occurrence" table
CREATE EXTERNAL TABLE sample.shopping_click_cooccurrence (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword1` STRING
, `cat1` STRING
, `keyword2` STRING
, `cat2` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/click_cooccurrence/";
-- 7. Create the "product purchase co-occurrence" table
CREATE EXTERNAL TABLE sample.shopping_purchase_cooccurrence (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `cat1` STRING
, `cat2` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/shopping/purchase_cooccurrence/";
-- 8. Create the "Pro Option search click" table
CREATE EXTERNAL TABLE sample.pro_search_click (
`user` STRING
, `age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `area` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/pro_search/click/";
-- 9. Create the "Pro Option product click" table
CREATE EXTERNAL TABLE sample.pro_shopping_click (
`user` STRING
, `age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `brand` STRING
, `item` STRING
, `cat` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/pro_shopping/click/";
-- 10. Create the "Pro Option product purchase" table
CREATE EXTERNAL TABLE sample.pro_shopping_purchase (
`user` STRING
, `age` STRING
, `loc1` STRING
, `loc2` STRING
, `cat` STRING
, `brand` STRING
, `item` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://hadoop-000-000/user/ncp/sample/pro_shopping/purchase/";
Once you have run the above script, you can see Hive Table's data on Hue as follows.
SET hive.resultset.use.unique.column.names = false;
SELECT * FROM sample.search_click LIMIT 10;
SELECT * FROM sample.search_click
WHERE `date` = '2021-01-01' and device = 'mobile' and gender = 'f' and `count` > 10 LIMIT 10;

4. View sample data on Zeppelin
You can access Zeppelin to view the Parquet file data uploaded to Cloud Hadoop.
- Zeppelin: use Ambari service link or https://EdgeNodeIP:9996
Update hdfs://nv0xxx-hadoop using the Hadoop cluster's name in the following script and run the script. To see the Hadoop cluster's name, click the [Details] button on the data box page and go to the [Infrastructure] tab.
%spark2
val df = spark.read.parquet("hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click")
println(df.count())
df.show()

You can view the data of the Hive Tables you created in the previous stages.
%jdbc(hive)
SET hive.resultset.use.unique.column.names = false;
SELECT * FROM sample.search_click LIMIT 10;

View sample data on Ncloud TensorFlow Server
To view sample data on Ncloud TensorFlow Server, follow these steps:
-
Go to the following file path on Ncloud TensorFlow Server and see the sample data.
/home/ncp/workspace/sample- Sample data is supplied to you as read-only.
-
Access Jupyter Notebook and install the necessary module(s).
!pip install -U pyarrow -
See if the sample file data provided to you matches the following data.
import pandas as pd import pyarrow.parquet as pq from collections import OrderedDict source = "/home/ncp/workspace/sample/search/click/date=2021-01-01/device=pc/gender=m" d = pq.read_table(source=source).to_pydict() df1 = pd.DataFrame( OrderedDict( d ) )