- Print
- PDF
Using Sqoop
- Print
- PDF
Available in VPC
SQL to Hadoop (Sqoop) is open-source software designed to transfer data between Hadoop and relational databases.
With a simple CLI, you can easily move RDBMS-specific tables such as Oracle and MySQL, or data that meets specific conditions, to HDFS, and you can directly move them to Hive, Pig, HBase, etc. Conversely, you can move data stored in HDFS to an RDBMS.
Since its first version in 2009, Sqoop has been evolving continuously since becoming a Top-Level Apache project in 2012. Currently, Apache Sqoop has evolved into two versions: Sqoop 1 and Sqoop 2. Sqoop 1 uses a client method and Sqoop 2 uses a server side method added to the existing Sqoop 1 method.
Sqoop is installed by default in the Cluster Type supported by Cloud Hadoop of NAVER Cloud Platform.
This guide explains how to copy MySQL data to Cloud Hadoop using Sqoop or send it the other way.
For other ways to use Sqoop, see Apache Sqoop Documentation.
This guide is based on the MySQL Ver. 5. If migrating to MySQL Ver. 8, you need to update the version of the connector.
The connector version of Ver. 8 will be updated soon.
Preparations
To use Sqoop in Cloud Hadoop, you need to complete the following preparations:
Item | Description | Guide |
---|---|---|
ACG and ACL settings | Set them to enable communication between the Cloud Hadoop cluster server and RDBMS server | Firewall settings (ACG) |
RDBMS | RDBMS with original data saved |
Proceed with the following steps in order after completing the preparations.
- 1. Connect to Hadoop cluster
- 2. Connect to MySQL and check the target data to import
- 3. Import data with the Sqoop command
- 4. Check HDFS data on Hive
1. Connect to Hadoop cluster
Connect to the master node of the Cloud Hadoop cluster you want to work with.
For how to access the master node of a cluster, see the Connect cluster nodes through SSH guide.
2. Connect to MySQL and check the target data to import
Run the following command and check if MySQL client is installed for remote access to the MySQL server.
- By default, the MySQL client is installed on the master node (2 nodes).
[sshuser@m-001-example-1fx8-hd ~]$ mysql -V mysql Ver 14.14 Distrib 5.7.37, for Linux (x86_64) using EditLine wrapper
Run the following command to check if the connection to the MySQL server is established.
mysql -h mysql.local -u [Cluster administrator account name] -p
Connection example[sshuser@m-001-example-1fx8-hd ~]$ mysql -u example -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 37 Server version: 5.7.37-log MySQL Community Server (GPL) Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql>
CautionWhen the connection is not established properly, check if ACG is opened properly using MySQL DB port, and check the ACL list of the MySQL DB.
Check the data to be imported to Hadoop cluster HDFS from the MySQL DB server of the user.
- The following example shows the
SQOOP_TEST
table created in advance in the database calledsqoop
.
mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sqoop | +--------------------+ 4 rows in set (0.00 sec) mysql> use sqoop Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +-----------------+ | Tables_in_sqoop | +-----------------+ | SQOOP_TEST | +-----------------+ 1 row in set (0.00 sec) mysql> select * from SQOOP_TEST; +--------------+--------+-----------+------+---------------------+------+------+--------+ | SQOOP_TESTNO | ENAME | JOB | MGR | HIREDATE | SAL | COMM | DEPTNO | +--------------+--------+-----------+------+---------------------+------+------+--------+ | 7369 | SMITH | CLERK | 7902 | 1980-12-17 00:00:00 | 800 | NULL | 20 | | 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 00:00:00 | 1600 | 300 | 30 | | 7521 | WARD | SALESMAN | 7698 | 1981-02-22 00:00:00 | 1250 | 500 | 30 | | 7566 | JONES | MANAGER | 7839 | 1981-04-02 00:00:00 | 2975 | NULL | 20 | | 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 00:00:00 | 1250 | 1400 | 30 | | 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 00:00:00 | 2850 | NULL | 30 | | 7782 | CLARK | MANAGER | 7839 | 1981-06-09 00:00:00 | 2450 | NULL | 10 | | 7788 | SCOTT | ANALYST | 7566 | 1987-04-19 00:00:00 | 3000 | NULL | 20 | | 7839 | KING | PRESIDENT | NULL | 1981-11-17 00:00:00 | 5000 | NULL | 10 | | 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 00:00:00 | 1500 | 0 | 30 | | 7876 | ADAMS | CLERK | 7788 | 1987-05-23 00:00:00 | 1100 | NULL | 20 | | 7900 | JAMES | CLERK | 7698 | 1981-12-03 00:00:00 | 950 | NULL | 30 | | 7902 | FORD | ANALYST | 7566 | 1981-12-03 00:00:00 | 3000 | NULL | 20 | | 7934 | MILLER | CLERK | 7782 | 1982-01-23 00:00:00 | 1300 | NULL | 10 | +--------------+--------+-----------+------+---------------------+------+------+--------+ 14 rows in set (0.00 sec)
- The following example shows the
3. Import data with the Sqoop command
Run the
sudo su -
command to convert to the root account.[sshuser@m-001-example-1fx8-hd ~]$ sudo su - [root@m-001-example-1fx8-hd ~]#
Use the Sqoop command
import
and import data.
- The
import
command allows you to import specific databases, specific tables, query results, etc.
sqoop import --connect jdbc:mysql://[Master node Private IP]:3306/[Database name] --username [Cluster administrator account name] --password [Cluster administrator password] --table \[Target table]
- The
import
command allows you to import specific databases, specific tables, query results, etc. For general options of Sqoop import command, see the table below.
Argument | Description |
---|---|
--connect <jdbc-uri> | Designates JDBC connection character strings |
--connection-manager <class-name> | Designates connection manager class to use |
--driver <class-name> | Use when manually designating JDBC driver class to use |
--hadoop-home <dir> | $ Designates HADOOP_HOME |
--help | Help |
-P | Reads password from console |
--password <password> | Sets authentication password |
--username <username> | Sets authentication username |
--verbose | Used to print more information while working |
--connection-param-file <filename> | Designates optional properties file that provides connection parameters |
Sqoop imports data through MapReduce operations. The following is an example of the most typical method of importing a specific table from a particular database.
[root@m-001-example-1fx8-hd ~]# sqoop import --connect jdbc:mysql://[Master node Private IP]:3306/sqoop \
--username [Cluster administrator account name] \
-P --table sqoop_test \
--hive-import --create-hive-table --hive-table sqoop_workspace.sqoop_test -m 1
...
18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
18/08/03 09:06:42 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/03 09:06:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`SQOOP_TESTNO`), MAX(`SQOOP_TESTNO`) FROM `SQOOP_TEST`
18/08/03 09:06:42 INFO db.IntegerSplitter: Split size: 141; Num splits: 4 from: 7369 to: 7934
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: number of splits:4
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532345019075_0002
18/08/03 09:06:43 INFO impl.YarnClientImpl: Submitted application application_1532345019075_0002
18/08/03 09:06:43 INFO mapreduce.Job: The url to track the job: http://m-002-obj001.hd:8088/proxy/application_1532345019075_0002/
18/08/03 09:06:43 INFO mapreduce.Job: Running job: job_1532345019075_0002
18/08/03 09:06:49 INFO mapreduce.Job: Job job_1532345019075_0002 running in uber mode : false
18/08/03 09:06:49 INFO mapreduce.Job: map 0% reduce 0%
18/08/03 09:06:57 INFO mapreduce.Job: map 75% reduce 0%
18/08/03 09:07:38 INFO mapreduce.Job: map 100% reduce 0%
18/08/03 09:07:40 INFO mapreduce.Job: Job job_1532345019075_0002 completed successfully
18/08/03 09:07:40 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=683688
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
S3A: Number of bytes read=497
S3A: Number of bytes written=853
S3A: Number of read operations=44
S3A: Number of large read operations=0
S3A: Number of write operations=36
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=277580
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=55516
Total vcore-milliseconds taken by all map tasks=55516
Total megabyte-milliseconds taken by all map tasks=170545152
Map-Reduce Framework
Map input records=14
Map output records=14
Input split bytes=497
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=248
CPU time spent (ms)=5690
Physical memory (bytes) snapshot=1087877120
Virtual memory (bytes) snapshot=12255760384
Total committed heap usage (bytes)=702545920
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=853
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 62.4343 seconds (0 bytes/sec)
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Retrieved 14 records.
The table to import must have Primary Key. If you do not have the Primary Key, the following error occurs.
18/08/03 09:00:25 ERROR tool.ImportTool: Error during import: No primary key could be found for table SQOOP_TEST. Please specify one with --split-by or perform a sequential import with '-m 1'.
4. Check HDFS data on Hive
- Check if the table is saved properly through Hive.
- For instructions on how to use Hive, see the Using Hive guide.
- Write the query that verifies if the imported table is stored properly and then run it.
SELECT * FROM sqoop_workspace.sqoop_test;
Check the FAQs first.
Q. I want to use Sqoop Client on the master node of the Cloud Hadoop service. Do I also have to add jdbc driver to datanode?
A. You do not need to install jdbc driver on datanode. You can use the jdbc driver on the server if there is one to use the Sqoop command.