Using Sqoop

Available in VPC

SQL to Hadoop (Sqoop) is open source software designed to transfer data between Hadoop and relational databases.
With a simple CLI, you can easily move RDBMS-specific tables such as Oracle and MySQL, or data that meets specific conditions, to HDFS, and you can directly move them to Hive, Pig, or HBase. Conversely, you can move data stored in HDFS to an RDBMS.

Since its first version in 2009, Sqoop has been evolving continuously since becoming a Top-Level Apache project in 2012. Currently, Apache Sqoop has evolved into two versions: Sqoop 1 and Sqoop 2. Sqoop 1 uses a client method and Sqoop 2 uses a server side method added to the existing Sqoop 1 method.

Sqoop is installed by default in the Cluster Type supported by Cloud Hadoop of NAVER Cloud Platform.
This guide explains how to copy MySQL data to Cloud Hadoop using Sqoop or send it the other way.
For other ways to use Sqoop, see Apache Sqoop Documentation.

Note

This guide is based on the MySQL version 5. When migrating to MySQL version 8, you need to upgrade the connector version.

The connector version for MySQL 8 will be updated soon.

Preparations

To use Sqoop in Cloud Hadoop, you need to complete the following preparations:

Item	Description	Guide
ACG and ACL settings	Set them to enable communication between the Cloud Hadoop cluster server and RDBMS server	Firewall settings (ACG)
RDBMS	RDBMS with original data saved

Proceed with the following steps in order after completing the preparations.

1. Connect to Hadoop cluster

Connect to the master node of the Cloud Hadoop cluster you want to work with.
For how to access the master node of a cluster, see the Connect cluster nodes through SSH guide.

2. Connect to MySQL and check the target data to import

Run the following command and check if MySQL client is installed for remote access to the MySQL server.
- By default, the MySQL client is installed on the master node (2 nodes).
```
[sshuser@m-001-example-1fx8-hd ~]$ mysql -V
mysql  Ver 14.14 Distrib 5.7.37, for Linux (x86_64) using  EditLine wrapper
```

Run the following command to check if the connection to the MySQL server is established.

mysql -h mysql.local -u [Cluster administrator account name] -p

Connection example

[sshuser@m-001-example-1fx8-hd ~]$ mysql -u example -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 37
Server version: 5.7.37-log MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Caution

When the connection is not established properly, check if ACG is opened properly using MySQL DB port, and check the ACL list of the MySQL DB.

Check the data to be imported to Hadoop cluster HDFS from the MySQL DB server of the user.

The following example shows the SQOOP_TEST table created in advance in the database called sqoop.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sqoop              |
+--------------------+
4 rows in set (0.00 sec)

mysql> use sqoop
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------+
| Tables_in_sqoop |
+-----------------+
| SQOOP_TEST      |
+-----------------+
1 row in set (0.00 sec)

mysql> select * from SQOOP_TEST;
+--------------+--------+-----------+------+---------------------+------+------+--------+
| SQOOP_TESTNO | ENAME  | JOB       | MGR  | HIREDATE            | SAL  | COMM | DEPTNO |
+--------------+--------+-----------+------+---------------------+------+------+--------+
|         7369 | SMITH  | CLERK     | 7902 | 1980-12-17 00:00:00 |  800 | NULL |     20 |
|         7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 00:00:00 | 1600 |  300 |     30 |
|         7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 00:00:00 | 1250 |  500 |     30 |
|         7566 | JONES  | MANAGER   | 7839 | 1981-04-02 00:00:00 | 2975 | NULL |     20 |
|         7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 00:00:00 | 1250 | 1400 |     30 |
|         7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 00:00:00 | 2850 | NULL |     30 |
|         7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 00:00:00 | 2450 | NULL |     10 |
|         7788 | SCOTT  | ANALYST   | 7566 | 1987-04-19 00:00:00 | 3000 | NULL |     20 |
|         7839 | KING   | PRESIDENT | NULL | 1981-11-17 00:00:00 | 5000 | NULL |     10 |
|         7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 00:00:00 | 1500 |    0 |     30 |
|         7876 | ADAMS  | CLERK     | 7788 | 1987-05-23 00:00:00 | 1100 | NULL |     20 |
|         7900 | JAMES  | CLERK     | 7698 | 1981-12-03 00:00:00 |  950 | NULL |     30 |
|         7902 | FORD   | ANALYST   | 7566 | 1981-12-03 00:00:00 | 3000 | NULL |     20 |
|         7934 | MILLER | CLERK     | 7782 | 1982-01-23 00:00:00 | 1300 | NULL |     10 |
+--------------+--------+-----------+------+---------------------+------+------+--------+
14 rows in set (0.00 sec)

3. Import data with the Sqoop command

sudo su - Run the [User account] command to switch to the user account.

[sshuser@m-001-example-1fx8-hd ~]$ sudo su - test
[test@m-001-example-1fx8-hd ~]#

Use the Sqoop command import and import data.

The import command allows you to import specific databases, specific tables, or query results.

    sqoop import --connect jdbc:mysql://[Master node Private IP]:3306/[Database name] --username [Cluster administrator account name] --password [Cluster administrator password] --table [Target table]

The import command allows you to import specific databases, specific tables, or query results. For general options of Sqoop import command, see the table below.

Argument	Description
`--connect <jdbc-uri>`	Designates JDBC connection character strings
`--connection-manager <class-name>`	Designates connection manager class to use
`--driver <class-name>`	Use when manually designating JDBC driver class to be used
`--hadoop-home <dir>`	Designates $ HADOOP_HOME
`--help`	Help
`-P`	Reads password from console
`--password <password>`	Sets authentication password
`--username <username>`	Sets authentication username
`--verbose`	Used to print more information while working
`--connection-param-file <filename>`	Designates optional properties file that provides connection parameters

Sqoop imports data through MapReduce operations. The following is an example of the most typical method of importing a specific table from a particular database.

[root@m-001-example-1fx8-hd ~]# sqoop import --connect jdbc:mysql://[Master node Private IP]:3306/sqoop \
--username [Cluster administrator account name] \
-P --table sqoop_test \
--hive-import --create-hive-table --hive-table sqoop_workspace.sqoop_test -m 1

...

18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
18/08/03 09:06:42 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/03 09:06:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`SQOOP_TESTNO`), MAX(`SQOOP_TESTNO`) FROM `SQOOP_TEST`
18/08/03 09:06:42 INFO db.IntegerSplitter: Split size: 141; Num splits: 4 from: 7369 to: 7934
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: number of splits:4
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532345019075_0002
18/08/03 09:06:43 INFO impl.YarnClientImpl: Submitted application application_1532345019075_0002
18/08/03 09:06:43 INFO mapreduce.Job: The url to track the job: http://m-002-obj001.hd:8088/proxy/application_1532345019075_0002/
18/08/03 09:06:43 INFO mapreduce.Job: Running job: job_1532345019075_0002
18/08/03 09:06:49 INFO mapreduce.Job: Job job_1532345019075_0002 running in uber mode : false
18/08/03 09:06:49 INFO mapreduce.Job:  map 0% reduce 0%
18/08/03 09:06:57 INFO mapreduce.Job:  map 75% reduce 0%
18/08/03 09:07:38 INFO mapreduce.Job:  map 100% reduce 0%
18/08/03 09:07:40 INFO mapreduce.Job: Job job_1532345019075_0002 completed successfully
18/08/03 09:07:40 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=683688
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                S3A: Number of bytes read=497
                S3A: Number of bytes written=853
                S3A: Number of read operations=44
                S3A: Number of large read operations=0
                S3A: Number of write operations=36
        Job Counters
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=277580
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=55516
                Total vcore-milliseconds taken by all map tasks=55516
                Total megabyte-milliseconds taken by all map tasks=170545152
        Map-Reduce Framework
                Map input records=14
                Map output records=14
                Input split bytes=497
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=248
                CPU time spent (ms)=5690
                Physical memory (bytes) snapshot=1087877120
                Virtual memory (bytes) snapshot=12255760384
                Total committed heap usage (bytes)=702545920
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=853
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 62.4343 seconds (0 bytes/sec)
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Retrieved 14 records.

Caution

The table to import must have Primary Key. If you do not have the Primary Key, the following error occurs.

18/08/03 09:00:25 ERROR tool.ImportTool: Error during import: No primary key could be found for table SQOOP_TEST. Please specify one with --split-by or perform a sequential import with '-m 1'.

4. Check HDFS data on Hive

Check if the table is saved properly through Hive.
- For instructions on how to use Hive, see the guide titled Using Hive.
Write the query that verifies if the imported table is stored properly and then run it.

SELECT * FROM sqoop_workspace.sqoop_test;

chadoop-4-1-101_ko

Check FAQs first.

Q. I want to use Sqoop Client on the master node of the Cloud Hadoop service. Do I also have to add jdbc driver to datanode?
A. You do not need to install jdbc driver on datanode. You can use the jdbc driver on the server if there is one to use the Sqoop command.