Integrating with Cloud Hadoop and Data Catalog
    • PDF

    Integrating with Cloud Hadoop and Data Catalog

    • PDF

    Article Summary

    Available in VPC

    The following describes how to integrate with NAVER Cloud platform's Cloud Hadoop and Data Catalog.

    Preparations

    1. Subscribe to the Data Catalog.

    2. Please generate a Cloud Hadoop cluster.

      • When creating a Cloud Hadoop cluster, set the Hive Metastore repository to Data Catalog.
      • For more information on creating Cloud Hadoop, see Getting started with Cloud Hadoop.

    image.png

    • It can be integrated with Cloud Hadoop version 2.0 or higher.
    • If you select to configure Kerberos authentication, Data Catalog integration may be limited.

    Utilize Data Catalog from Cloud Hadoop

    The following describes how to use data scanned from Data Catalog in Cloud Hadoop.

    Note

    Cloud DB data scanned by Data Catalog cannot be utilized in Cloud Hadoop.

    Scan Object Storage data

    1. Download sample data. Unzip the data and upload the AllstarFull.csv file to your Object Storage bucket.

      Note

      The provided sample data is a portion of Lahman's Baseball Database Version 2012, and all copyrights of the data belong to Sean Lahman.

    2. Create a connection.

      • For more information on creating a connection, see Connection guide.
    3. Create a scanner to scan the AllstarFull.csv file that you have uploaded to the Object Storage bucket.

      • For more information on creating and running a scanner, see Scanner guide.
    4. Run the scanner to verify the table you have created.

      • Make sure the table is created with the Prefix value + allstarfull_csv name that you have set in the scanner.

    Use scanned data in Hive

    Search data

    hive> select * from hadoop_allstarfull_csv limit 5;
    

    The results are as follows:

    aaronha01	1955	0	NLS195507120	ML1	NL	1	NULL
    aaronha01	1956	0	ALS195607100	ML1	NL	1	NULL
    aaronha01	1957	0	NLS195707090	ML1	NL	1	9
    aaronha01	1958	0	ALS195807080	ML1	NL	1	9
    aaronha01	1959	1	NLS195907070	ML1	NL	1	9
    
    • If the column type of the table is datetime, it is not supported by Hive, thus causing issues with retrieving data normally. In such cases, change the column type to a compatible type and then start searching.
    • If the data format is json, you cannot retrieve the data properly unless the format is ndjson. Change the file format to ndjson and then start searching again.
    Note

    For more information on the ndjson format, see ndjson.

    Use scanned data in Spark

    Search data

    var allstar = spark.table("default.hadoop_allstarfull_csv")
    var header = allstar.first()
    allstar = allstar.filter(row => row != header)
    allstar.show()
    
    • When reading a table stored in the Hive Metastore using Spark, the header values are included. You can exclude the header values by using filter.

    The results are as follows:

    +---------+------+-------+------------+------+----+---+-----------+
    | playerid|yearid|gamenum|      gameid|teamid|lgid| gp|startingpos|
    +---------+------+-------+------------+------+----+---+-----------+
    |aaronha01|  1955|      0|NLS195507120|   ML1|  NL|  1|       null|
    |aaronha01|  1956|      0|ALS195607100|   ML1|  NL|  1|       null|
    |aaronha01|  1957|      0|NLS195707090|   ML1|  NL|  1|          9|
    |aaronha01|  1958|      0|ALS195807080|   ML1|  NL|  1|          9|
    |aaronha01|  1959|      1|NLS195907070|   ML1|  NL|  1|          9|
    |aaronha01|  1959|      2|NLS195908030|   ML1|  NL|  1|          9|
    

    Edit data

    var transformed_allstar = allstar.select($"playerid", $"yearid").withColumn("tenyearslater", $"yearid" + 10)
    transformed_allstar.show()
    
    transformed_allstar.write
    .format("hive")
    .mode("overwrite")
    .option("path", "s3a://{bucketName}/transformed_allstar_csv/")
    .option("header", "true")
    .saveAsTable("transformed_allstar_csv")
    
    • This example shows how to create the new column tenyearslater by adding 10 to the existing yearid column's values and to save the updated table in the Hive Metastore.

    The results are as follows:

    +---------+------+-------------+
    | playerid|yearid|tenyearslater|
    +---------+------+-------------+
    |aaronha01|  1955|         1965|
    |aaronha01|  1956|         1966|
    |aaronha01|  1957|         1967|
    |aaronha01|  1958|         1968|
    |aaronha01|  1959|         1969|
    

    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.