- Print
- PDF
Integrating with Cloud Hadoop and Data Catalog
- Print
- PDF
Available in VPC
The following describes how to integrate with NAVER Cloud platform's Cloud Hadoop and Data Catalog.Preparations
Subscribe to the Data Catalog.
- For more information on using Data Catalog, see the Getting started with Data Catalog guide.
Please generate a Cloud Hadoop cluster.
- When creating a Cloud Hadoop cluster, set the Hive Metastore repository to Data Catalog.
- For more information on creating Cloud Hadoop, see Getting started with Cloud Hadoop.
- It can be integrated with Cloud Hadoop version 2.0 or higher.
- If you select to configure Kerberos authentication, Data Catalog integration may be limited.
Utilize Data Catalog from Cloud Hadoop
The following describes how to use data scanned from Data Catalog in Cloud Hadoop.
Cloud DB data scanned by Data Catalog cannot be utilized in Cloud Hadoop.
Scan Object Storage data
Download sample data. Unzip the data and upload the
AllstarFull.csv
file to your Object Storage bucket.NoteThe provided sample data is a portion of Lahman's Baseball Database Version 2012, and all copyrights of the data belong to Sean Lahman.
Create a connection.
- For more information on creating a connection, see Connection guide.
Create a scanner to scan the
AllstarFull.csv
file that you have uploaded to the Object Storage bucket.- For more information on creating and running a scanner, see Scanner guide.
Run the scanner to verify the table you have created.
- Make sure the table is created with the
Prefix
value +allstarfull_csv
name that you have set in the scanner.
- Make sure the table is created with the
Use scanned data in Hive
Search data
hive> select * from hadoop_allstarfull_csv limit 5;
The results are as follows:
aaronha01 1955 0 NLS195507120 ML1 NL 1 NULL
aaronha01 1956 0 ALS195607100 ML1 NL 1 NULL
aaronha01 1957 0 NLS195707090 ML1 NL 1 9
aaronha01 1958 0 ALS195807080 ML1 NL 1 9
aaronha01 1959 1 NLS195907070 ML1 NL 1 9
- If the column type of the table is
datetime
, it is not supported by Hive, thus causing issues with retrieving data normally. In such cases, change the column type to a compatible type and then start searching. - If the data format is
json
, you cannot retrieve the data properly unless the format isndjson
. Change the file format tondjson
and then start searching again.
For more information on the ndjson
format, see ndjson.
Use scanned data in Spark
Search data
var allstar = spark.table("default.hadoop_allstarfull_csv")
var header = allstar.first()
allstar = allstar.filter(row => row != header)
allstar.show()
- When reading a table stored in the Hive Metastore using Spark, the header values are included. You can exclude the header values by using
filter
.
The results are as follows:
+---------+------+-------+------------+------+----+---+-----------+
| playerid|yearid|gamenum| gameid|teamid|lgid| gp|startingpos|
+---------+------+-------+------------+------+----+---+-----------+
|aaronha01| 1955| 0|NLS195507120| ML1| NL| 1| null|
|aaronha01| 1956| 0|ALS195607100| ML1| NL| 1| null|
|aaronha01| 1957| 0|NLS195707090| ML1| NL| 1| 9|
|aaronha01| 1958| 0|ALS195807080| ML1| NL| 1| 9|
|aaronha01| 1959| 1|NLS195907070| ML1| NL| 1| 9|
|aaronha01| 1959| 2|NLS195908030| ML1| NL| 1| 9|
Edit data
var transformed_allstar = allstar.select($"playerid", $"yearid").withColumn("tenyearslater", $"yearid" + 10)
transformed_allstar.show()
transformed_allstar.write
.format("hive")
.mode("overwrite")
.option("path", "s3a://{bucketName}/transformed_allstar_csv/")
.option("header", "true")
.saveAsTable("transformed_allstar_csv")
- This example shows how to create the new column
tenyearslater
by adding 10 to the existingyearid
column's values and to save the updated table in the Hive Metastore.
The results are as follows:
+---------+------+-------------+
| playerid|yearid|tenyearslater|
+---------+------+-------------+
|aaronha01| 1955| 1965|
|aaronha01| 1956| 1966|
|aaronha01| 1957| 1967|
|aaronha01| 1958| 1968|
|aaronha01| 1959| 1969|