Searching sample data
    • PDF

    Searching sample data

    • PDF

    Article Summary

    Available in Classic and VPC

    When creating a databox, search and shopping-related sample data is provided. Since external communications are blocked after data supply is requested, you can use the sample data to set the analysis environment beforehand. The following describes how to view sample data in Cloud Hadoop and the Ncloud TensorFlow Server. For a detailed description of the data, refer to Detailed description of provided data.

    View sample data in Cloud Hadoop

    1. Check sample data location

    Sample data is uploaded to the following path in Hadoop HDFS.
    /user/ncp/sample

    2. Check files in HDFS

    You can check the sample files uploaded to HDFS by accessing HUE through a web browser from Connect Server.

    • HUE: https://edge node IP:8081
      databox-sample-00

    3. View data by creating Hive external table

    After connecting to HUE, you can use the sample data file to create a Hive external table in Hive Query Editor.

    • In the script below, replace hdfs://nv0xxx-hadoop with the name of your Hadoop cluster, and then run the script. The Hadoop cluster name can be found under the [Infrastructure] tab that appears when you click the [Details] button of the databox.
    • The type and schema of data provided in the sample data are the same as those for the search and shopping data provided in the supplied data. This means that you can create a table for the actual data just by changing the database and data upload paths in the script below after requesting data supply.
    • When running the MSCK REPAIR TABLE command after the table is created, you may see an error indicating that the table doesn't exist. In this case, run the MSCK REPAIR TABLE command again in a few moments.
    -- Create database for sample data
    CREATE DATABASE sample;
    
    -- 1. Create search click table
    CREATE EXTERNAL TABLE sample.search_click (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword`  STRING
        , `area`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
              `date`     varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/search/click/";  -- Path where sample data is uploaded. Replace it with the name of the Hadoop cluster before use.
    
    -- 2. Create keyword click co-occurrence table
    CREATE EXTERNAL TABLE sample.search_click_cooccurrence (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword1`  STRING
        , `area1`  STRING
        , `keyword2`  STRING
        , `area2`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `week`     varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/search/click_cooccurrence/";
    
    -- 3. Create keyword table by access location
    CREATE EXTERNAL TABLE sample.search_click_location (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `date`     varchar(10)
        , `hour`    varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/search/click_location/";
    
    -- 4. Create product click table
    CREATE EXTERNAL TABLE sample.shopping_click (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword`  STRING
        , `cat`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `date`     varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click/";
    
    -- 5. Create product purchase table
    CREATE EXTERNAL TABLE sample.shopping_purchase (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `cat`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `date`     varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/purchase/";
    
    -- 6. Create product click co-occurrence table
    CREATE EXTERNAL TABLE sample.shopping_click_cooccurrence (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword1`  STRING
        , `cat1`  STRING
        , `keyword2`  STRING
        , `cat2`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `week`     varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click_cooccurrence/";
    
    -- 7. Create product purchase co-occurrence table
    CREATE EXTERNAL TABLE sample.shopping_purchase_cooccurrence (
         `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword1`  STRING
        , `cat1`  STRING
        , `cat2`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `week`     varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/purchase_cooccurrence/";
    
    -- 8. Create Pro Option search click table
    CREATE EXTERNAL TABLE sample.pro_search_click (
         `user`  STRING
        , `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword`  STRING
        , `area`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `date`   varchar(10)
        , `hour`  varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/pro_search/click/";  
    
    -- 9. Create Pro Option product click table
    CREATE EXTERNAL TABLE sample.pro_shopping_click (
         `user`  STRING
        , `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `keyword`  STRING
        , `cat`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `date`   varchar(10)
        , `hour`  varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/pro_shopping/click/";
    
    -- 10. Create Pro Option product purchase table
    CREATE EXTERNAL TABLE sample.pro_shopping_purchase (
         `user`  STRING
        , `age`  STRING
        , `loc1`  STRING
        , `loc2`  STRING
        , `cat`  STRING
        , `count`  BIGINT
    ) PARTITIONED BY (
         `date`   varchar(10)
        , `hour`  varchar(10)
        , `device`  varchar(10)
        , `gender`  varchar(10)
    ) ROW FORMAT delimited fields TERMINATED BY ','
    STORED AS PARQUET LOCATION "hdfs://nv0xxx-hadoop/user/ncp/sample/pro_shopping/purchase/";
    

    After running the script above and checking that all the tables have been created, refresh the Metastore information as follows.

    MSCK REPAIR TABLE sample.search_click;
    MSCK REPAIR TABLE sample.search_click_cooccurrence;
    MSCK REPAIR TABLE sample.search_click_location;
    MSCK REPAIR TABLE sample.shopping_click;
    MSCK REPAIR TABLE sample.shopping_purchase;
    MSCK REPAIR TABLE sample.shopping_click_cooccurrence;
    MSCK REPAIR TABLE sample.shopping_purchase_cooccurrence;
    MSCK REPAIR TABLE sample.pro_search_click;
    MSCK REPAIR TABLE sample.pro_shopping_click;
    MSCK REPAIR TABLE sample.pro_shopping_purchase;
    

    After running the script above, Hive table data can be searched from HUE as shown below.

    SET hive.resultset.use.unique.column.names = false;
    
    SELECT * FROM sample.search_click LIMIT 10;
    
    SELECT *  FROM sample.search_click 
    WHERE `date` = '2021-01-01' and device = 'mobile' and gender = 'f' and `count` > 10 LIMIT 10;
    

    image.png

    4. View sample data in Zeppelin

    You can connect to Zeppelin and view the Parquet file data uploaded to Cloud Hadoop.

    • Zeppelin: Use the Ambari service link or https://edge node IP:9996

    In the script below, replace hdfs://nv0xxx-hadoop with the name of your Hadoop cluster, and then run the script. The Hadoop cluster name can be found under the [Infrastructure] tab that appears when you click the [Details] button of the databox.

    %spark2
    val df = spark.read.parquet("hdfs://nv0xxx-hadoop/user/ncp/sample/shopping/click")
    println(df.count())
    df.show()
    

    databox-sample-021

    You can check the data in the Hive table created in the previous step.

    %jdbc(hive)
    SET hive.resultset.use.unique.column.names = false;
    SELECT * FROM sample.search_click LIMIT 10;
    

    clouddatabox-sample_02

    View sample data in Ncloud TensorFlow Server

    The following describes how to view sample data from a Ncloud TensorFlow Server.

    1. Check the sample data uploaded to the following path in the Ncloud TensorFlow Server.

      • /home/ncp/workspace/sample
      • Sample data is provided as read-only.
    2. Connect to Jupyter and install the required modules.

      !pip install -U pyarrow
      
    3. Check the sample data file provided as shown below.

      import pandas as pd
      import pyarrow.parquet as pq
      from collections import OrderedDict
      source = "/home/ncp/workspace/sample/search/click/date=2021-01-01/device=pc/gender=m"
      d = pq.read_table(source=source).to_pydict() 
      df1 = pd.DataFrame( OrderedDict( d ) )
      

      databox-sample-03


    Was this article helpful?

    What's Next
    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.