Available in VPC
The following describes how to integrate with NAVER Cloud platform's Cloud Hadoop and Data Catalog.Preparations
- Subscribe to the Data Catalog.
- For more information on using Data Catalog, see the Getting started with Data Catalog guide.
- Please generate a Cloud Hadoop cluster.
- When creating a Cloud Hadoop cluster, set the Hive Metastore repository to Data Catalog.
- For more information on creating Cloud Hadoop, see Getting started with Cloud Hadoop.

- It can be integrated with Cloud Hadoop version 2.0 or higher.
- If you select to configure Kerberos authentication, Data Catalog integration may be limited.
Utilize Data Catalog from Cloud Hadoop
The following describes how to use data scanned from Data Catalog in Cloud Hadoop.
Note
Cloud DB data scanned by Data Catalog cannot be utilized in Cloud Hadoop.
Scan Object Storage data
- Download sample data. Unzip the data and upload the
AllstarFull.csvfile to your Object Storage bucket.
Note
The provided sample data is a portion of Lahman's Baseball Database Version 2012, and all copyrights of the data belong to Sean Lahman.
- Create a connection.
- For more information on creating a connection, see Connection guide.
- Create a scanner to scan the
AllstarFull.csvfile that you have uploaded to the Object Storage bucket.
- For more information on creating and running a scanner, see Scanner guide.
- Run the scanner to verify the table you have created.
- Make sure the table is created with the
Prefixvalue +allstarfull_csvname that you have set in the scanner.
Use scanned data in Hive
Search data
hive> select * from hadoop_allstarfull_csv limit 5;
The results are as follows:
aaronha01 1955 0 NLS195507120 ML1 NL 1 NULL
aaronha01 1956 0 ALS195607100 ML1 NL 1 NULL
aaronha01 1957 0 NLS195707090 ML1 NL 1 9
aaronha01 1958 0 ALS195807080 ML1 NL 1 9
aaronha01 1959 1 NLS195907070 ML1 NL 1 9
- If the column type of the table is
datetime, it is not supported by Hive, thus causing issues with retrieving data normally. In such cases, change the column type to a compatible type and then start searching. - If the data format is
json, you cannot retrieve the data properly unless the format isndjson. Change the file format tondjsonand then start searching again.
Note
For more information on the ndjson format, see ndjson.
Use scanned data in Spark
Search data
var allstar = spark.table("default.hadoop_allstarfull_csv")
var header = allstar.first()
allstar = allstar.filter(row => row != header)
allstar.show()
- When reading a table stored in the Hive Metastore using Spark, the header values are included. You can exclude the header values by using
filter.
The results are as follows:
+---------+------+-------+------------+------+----+---+-----------+
| playerid|yearid|gamenum| gameid|teamid|lgid| gp|startingpos|
+---------+------+-------+------------+------+----+---+-----------+
|aaronha01| 1955| 0|NLS195507120| ML1| NL| 1| null|
|aaronha01| 1956| 0|ALS195607100| ML1| NL| 1| null|
|aaronha01| 1957| 0|NLS195707090| ML1| NL| 1| 9|
|aaronha01| 1958| 0|ALS195807080| ML1| NL| 1| 9|
|aaronha01| 1959| 1|NLS195907070| ML1| NL| 1| 9|
|aaronha01| 1959| 2|NLS195908030| ML1| NL| 1| 9|
Edit data
var transformed_allstar = allstar.select($"playerid", $"yearid").withColumn("tenyearslater", $"yearid" + 10)
transformed_allstar.show()
transformed_allstar.write
.format("hive")
.mode("overwrite")
.option("path", "s3a://{bucketName}/transformed_allstar_csv/")
.option("header", "true")
.saveAsTable("transformed_allstar_csv")
- This example shows how to create the new column
tenyearslaterby adding 10 to the existingyearidcolumn's values and to save the updated table in the Hive Metastore.
The results are as follows:
+---------+------+-------------+
| playerid|yearid|tenyearslater|
+---------+------+-------------+
|aaronha01| 1955| 1965|
|aaronha01| 1956| 1966|
|aaronha01| 1957| 1967|
|aaronha01| 1958| 1968|
|aaronha01| 1959| 1969|