Integrating with Cloud Hadoop and Data Catalog

release/20240425
English

Integrating with Cloud Hadoop and Data Catalog

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in VPC

The following describes how to integrate with NAVER Cloud platform's Cloud Hadoop and Data Catalog.

Preparations

Subscribe to the Data Catalog.
- For more information on using Data Catalog, see the Getting started with Data Catalog guide.
Please generate a Cloud Hadoop cluster.
- When creating a Cloud Hadoop cluster, set the Hive Metastore repository to Data Catalog.
- For more information on creating Cloud Hadoop, see Getting started with Cloud Hadoop.

It can be integrated with Cloud Hadoop version 2.0 or higher.
If you select to configure Kerberos authentication, Data Catalog integration may be limited.

Utilize Data Catalog from Cloud Hadoop

The following describes how to use data scanned from Data Catalog in Cloud Hadoop.

Note

Cloud DB data scanned by Data Catalog cannot be utilized in Cloud Hadoop.

Scan Object Storage data

Download sample data. Unzip the data and upload the AllstarFull.csv file to your Object Storage bucket.
Note
The provided sample data is a portion of Lahman's Baseball Database Version 2012, and all copyrights of the data belong to Sean Lahman.
Create a connection.
- For more information on creating a connection, see Connection guide.
Create a scanner to scan the AllstarFull.csv file that you have uploaded to the Object Storage bucket.
- For more information on creating and running a scanner, see Scanner guide.
Run the scanner to verify the table you have created.
- Make sure the table is created with the Prefix value + allstarfull_csv name that you have set in the scanner.

Use scanned data in Hive

Search data

hive> select * from hadoop_allstarfull_csv limit 5;

The results are as follows:

aaronha01	1955	0	NLS195507120	ML1	NL	1	NULL
aaronha01	1956	0	ALS195607100	ML1	NL	1	NULL
aaronha01	1957	0	NLS195707090	ML1	NL	1	9
aaronha01	1958	0	ALS195807080	ML1	NL	1	9
aaronha01	1959	1	NLS195907070	ML1	NL	1	9

If the column type of the table is datetime, it is not supported by Hive, thus causing issues with retrieving data normally. In such cases, change the column type to a compatible type and then start searching.
If the data format is json, you cannot retrieve the data properly unless the format is ndjson. Change the file format to ndjson and then start searching again.

Note

For more information on the ndjson format, see ndjson.

Use scanned data in Spark

Search data

var allstar = spark.table("default.hadoop_allstarfull_csv")
var header = allstar.first()
allstar = allstar.filter(row => row != header)
allstar.show()

When reading a table stored in the Hive Metastore using Spark, the header values are included. You can exclude the header values by using filter.

The results are as follows:

+---------+------+-------+------------+------+----+---+-----------+
| playerid|yearid|gamenum|      gameid|teamid|lgid| gp|startingpos|
+---------+------+-------+------------+------+----+---+-----------+
|aaronha01|  1955|      0|NLS195507120|   ML1|  NL|  1|       null|
|aaronha01|  1956|      0|ALS195607100|   ML1|  NL|  1|       null|
|aaronha01|  1957|      0|NLS195707090|   ML1|  NL|  1|          9|
|aaronha01|  1958|      0|ALS195807080|   ML1|  NL|  1|          9|
|aaronha01|  1959|      1|NLS195907070|   ML1|  NL|  1|          9|
|aaronha01|  1959|      2|NLS195908030|   ML1|  NL|  1|          9|

Edit data

var transformed_allstar = allstar.select($"playerid", $"yearid").withColumn("tenyearslater", $"yearid" + 10)
transformed_allstar.show()

transformed_allstar.write
.format("hive")
.mode("overwrite")
.option("path", "s3a://{bucketName}/transformed_allstar_csv/")
.option("header", "true")
.saveAsTable("transformed_allstar_csv")

This example shows how to create the new column tenyearslater by adding 10 to the existing yearid column's values and to save the updated table in the Hive Metastore.

The results are as follows:

+---------+------+-------------+
| playerid|yearid|tenyearslater|
+---------+------+-------------+
|aaronha01|  1955|         1965|
|aaronha01|  1956|         1966|
|aaronha01|  1957|         1967|
|aaronha01|  1958|         1968|
|aaronha01|  1959|         1969|

Was this article helpful?

What's Next

Managing Data Catalog permissions

Table of contents

Preparations
Utilize Data Catalog from Cloud Hadoop