- Print
- PDF
Searching sample data
- Print
- PDF
Available in Classic and VPC
When creating a databox, search and shopping-related sample data is provided. Since external communications are blocked after data supply is requested, you can use the sample data to set the analysis environment beforehand. The following describes how to view sample data in Cloud Hadoop and the Ncloud TensorFlow Server. For a detailed description of the data, refer to Detailed description of provided data.
View sample data in Cloud Hadoop
1. Check sample data location
Sample data is uploaded to the following path in Hadoop HDFS.
/user/ncp/sample
2. Check files in HDFS
You can check the sample files uploaded to HDFS by accessing HUE through a web browser from Connect Server.
- HUE: https://edge node IP:8081
3. View data by creating Hive external table
After connecting to HUE, you can use the sample data file to create a Hive external table in Hive Query Editor.
- In the script below, replace
hdfs://nv0xxx-hadoop
with the name of your Hadoop cluster, and then run the script. The Hadoop cluster name can be found under the [Infrastructure] tab that appears when you click the [Details] button of the databox. - The type and schema of data provided in the sample data are the same as those for the search and shopping data provided in the supplied data. This means that you can create a table for the actual data just by changing the database and data upload paths in the script below after requesting data supply.
- When running the MSCK REPAIR TABLE command after the table is created, you may see an error indicating that the table doesn't exist. In this case, run the MSCK REPAIR TABLE command again in a few moments.
-- Create database for sample data
CREATE DATABASE sample;
-- 1. Create search click table
CREATE EXTERNAL TABLE sample.search_click (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `area` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/search/click/"; -- Path where sample data is uploaded. Replace it with the name of the Hadoop cluster before use.
-- 2. Create keyword click co-occurrence table
CREATE EXTERNAL TABLE sample.search_click_cooccurrence (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword1` STRING
, `area1` STRING
, `keyword2` STRING
, `area2` STRING
, `count` BIGINT
) PARTITIONED BY (
`week` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/search/click_cooccurrence/";
-- 3. Create keyword table by access location
CREATE EXTERNAL TABLE sample.search_click_location (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/search/click_location/";
-- 4. Create product click table
CREATE EXTERNAL TABLE sample.shopping_click (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `cat` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click/";
-- 5. Create product purchase table
CREATE EXTERNAL TABLE sample.shopping_purchase (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `cat` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/purchase/";
-- 6. Create product click co-occurrence table
CREATE EXTERNAL TABLE sample.shopping_click_cooccurrence (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword1` STRING
, `cat1` STRING
, `keyword2` STRING
, `cat2` STRING
, `count` BIGINT
) PARTITIONED BY (
`week` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click_cooccurrence/";
-- 7. Create product purchase co-occurrence table
CREATE EXTERNAL TABLE sample.shopping_purchase_cooccurrence (
`age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword1` STRING
, `cat1` STRING
, `cat2` STRING
, `count` BIGINT
) PARTITIONED BY (
`week` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/purchase_cooccurrence/";
-- 8. Create Pro Option search click table
CREATE EXTERNAL TABLE sample.pro_search_click (
`user` STRING
, `age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `area` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/pro_search/click/";
-- 9. Create Pro Option product click table
CREATE EXTERNAL TABLE sample.pro_shopping_click (
`user` STRING
, `age` STRING
, `loc1` STRING
, `loc2` STRING
, `keyword` STRING
, `cat` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/pro_shopping/click/";
-- 10. Create Pro Option product purchase table
CREATE EXTERNAL TABLE sample.pro_shopping_purchase (
`user` STRING
, `age` STRING
, `loc1` STRING
, `loc2` STRING
, `cat` STRING
, `count` BIGINT
) PARTITIONED BY (
`date` varchar(10)
, `hour` varchar(10)
, `device` varchar(10)
, `gender` varchar(10)
) ROW FORMAT delimited fields TERMINATED BY ','
STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/pro_shopping/purchase/";
After running the script above and checking that all the tables have been created, refresh the Metastore information as follows.
MSCK REPAIR TABLE sample.search_click;
MSCK REPAIR TABLE sample.search_click_cooccurrence;
MSCK REPAIR TABLE sample.search_click_location;
MSCK REPAIR TABLE sample.shopping_click;
MSCK REPAIR TABLE sample.shopping_purchase;
MSCK REPAIR TABLE sample.shopping_click_cooccurrence;
MSCK REPAIR TABLE sample.shopping_purchase_cooccurrence;
MSCK REPAIR TABLE sample.pro_search_click;
MSCK REPAIR TABLE sample.pro_shopping_click;
MSCK REPAIR TABLE sample.pro_shopping_purchase;
After running the script above, Hive table data can be searched from HUE as shown below.
SET hive.resultset.use.unique.column.names = false;
SELECT * FROM sample.search_click LIMIT 10;
SELECT * FROM sample.search_click
WHERE `date` = '2021-01-01' and device = 'mobile' and gender = 'f' and `count` > 10 LIMIT 10;
4. View sample data in Zeppelin
You can connect to Zeppelin and view the Parquet file data uploaded to Cloud Hadoop.
- Zeppelin: Use the Ambari service link or https://edge node IP:9996
In the script below, replace hdfs://nv0xxx-hadoop
with the name of your Hadoop cluster, and then run the script. The Hadoop cluster name can be found under the [Infrastructure] tab that appears when you click the [Details] button of the databox.
%spark2
val df = spark.read.parquet("hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click")
println(df.count())
df.show()
You can check the data in the Hive table created in the previous step.
%jdbc(hive)
SET hive.resultset.use.unique.column.names = false;
SELECT * FROM sample.search_click LIMIT 10;
View sample data in Ncloud TensorFlow Server
The following describes how to view sample data from a Ncloud TensorFlow Server.
Check the sample data uploaded to the following path in the Ncloud TensorFlow Server.
/home/ncp/workspace/sample
- Sample data is provided as read-only.
Connect to Jupyter and install the required modules.
!pip install -U pyarrow
Check the sample data file provided as shown below.
import pandas as pd import pyarrow.parquet as pq from collections import OrderedDict source = "/home/ncp/workspace/sample/search/click/date=2021-01-01/device=pc/gender=m" d = pq.read_table(source=source).to_pydict() df1 = pd.DataFrame( OrderedDict( d ) )