Integrate Cloud Hadoop with Data Catalog

Prev Next

Available in VPC

The following describes how to integrate with NAVER Cloud platform's Cloud Hadoop and Data Catalog.

Preparations

  1. Subscribe to the Data Catalog.
  1. Please generate a Cloud Hadoop cluster.
  • When creating a Cloud Hadoop cluster, set the Hive Metastore repository to Data Catalog.
  • For more information on creating Cloud Hadoop, see Getting started with Cloud Hadoop.

image.png

  • It can be integrated with Cloud Hadoop version 2.0 or higher.
  • If you select to configure Kerberos authentication, Data Catalog integration may be limited.

Utilize Data Catalog from Cloud Hadoop

The following describes how to use data scanned from Data Catalog in Cloud Hadoop.

Note

Cloud DB data scanned by Data Catalog cannot be utilized in Cloud Hadoop.

Scan Object Storage data

  1. Download sample data. Unzip the data and upload the AllstarFull.csv file to your Object Storage bucket.
Note

The provided sample data is a portion of Lahman's Baseball Database Version 2012, and all copyrights of the data belong to Sean Lahman.

  1. Create a connection.
  • For more information on creating a connection, see Connection guide.
  1. Create a scanner to scan the AllstarFull.csv file that you have uploaded to the Object Storage bucket.
  • For more information on creating and running a scanner, see Scanner guide.
  1. Run the scanner to verify the table you have created.
  • Make sure the table is created with the Prefix value + allstarfull_csv name that you have set in the scanner.

Use scanned data in Hive

Search data

hive> select * from hadoop_allstarfull_csv limit 5;

The results are as follows:

aaronha01 1955 0 NLS195507120 ML1 NL 1 NULL
aaronha01 1956 0 ALS195607100 ML1 NL 1 NULL
aaronha01 1957 0 NLS195707090 ML1 NL 1 9
aaronha01 1958 0 ALS195807080 ML1 NL 1 9
aaronha01 1959 1 NLS195907070 ML1 NL 1 9
  • If the column type of the table is datetime, it is not supported by Hive, thus causing issues with retrieving data normally. In such cases, change the column type to a compatible type and then start searching.
  • If the data format is json, you cannot retrieve the data properly unless the format is ndjson. Change the file format to ndjson and then start searching again.
Note

For more information on the ndjson format, see ndjson.

Use scanned data in Spark

Search data

var allstar = spark.table("default.hadoop_allstarfull_csv")
var header = allstar.first()
allstar = allstar.filter(row => row != header)
allstar.show()
  • When reading a table stored in the Hive Metastore using Spark, the header values are included. You can exclude the header values by using filter.

The results are as follows:

+---------+------+-------+------------+------+----+---+-----------+
| playerid|yearid|gamenum| gameid|teamid|lgid| gp|startingpos|
+---------+------+-------+------------+------+----+---+-----------+
|aaronha01| 1955| 0|NLS195507120| ML1| NL| 1| null|
|aaronha01| 1956| 0|ALS195607100| ML1| NL| 1| null|
|aaronha01| 1957| 0|NLS195707090| ML1| NL| 1| 9|
|aaronha01| 1958| 0|ALS195807080| ML1| NL| 1| 9|
|aaronha01| 1959| 1|NLS195907070| ML1| NL| 1| 9|
|aaronha01| 1959| 2|NLS195908030| ML1| NL| 1| 9|

Edit data

var transformed_allstar = allstar.select($"playerid", $"yearid").withColumn("tenyearslater", $"yearid" + 10)
transformed_allstar.show()

transformed_allstar.write
.format("hive")
.mode("overwrite")
.option("path", "s3a://{bucketName}/transformed_allstar_csv/")
.option("header", "true")
.saveAsTable("transformed_allstar_csv")
  • This example shows how to create the new column tenyearslater by adding 10 to the existing yearid column's values and to save the updated table in the Hive Metastore.

The results are as follows:

+---------+------+-------------+
| playerid|yearid|tenyearslater|
+---------+------+-------------+
|aaronha01| 1955| 1965|
|aaronha01| 1956| 1966|
|aaronha01| 1957| 1967|
|aaronha01| 1958| 1968|
|aaronha01| 1959| 1969|