Using Sqoop
  • PDF

Using Sqoop

  • PDF

Available in VPC

Sqoop stands for SQL to Hadoop, and is an open source software designed for transferring data between Hadoop and a relational database.
It allows you to easily move data that meets a specific table or conditions of RDBMS, such as Oracle and MySQL, to HDFS using a simple Command Line Interface (CLI). It then transfers the data to Hive, Pig, HBase, etc., which can be checked immediately. It can also be used to move data stored in HDFS to RDBMS.

The first version of Sqoop was released in 2009, and it has been continuously evolving since it became a Top-Level Apache project in 2012. Currently, Apache Sqoop has evolved into two versions: Sqoop 1 and Sqoop 2. Sqoop 1 uses a client method, and Sqoop 2 provides a server side method in addition to the methods provided in Sqoop 1.

By default, Sqoop is installed in cluster types supported by Cloud Hadoop in NAVER Cloud Platform.
This guide explains how to copy MySQL to Cloud Hadoop using Sqoop, or send it the other way.
For other methods of using Sqoop, refer to Apache Sqoop User Guide.

Note

This guide is based on the MySQL 5 version. The connector's version upgrade is required if you migrate to the MySQL 8 version.

The connector version of version 8 will be updated soon.

Preparations

Using Sqoop in Cloud Hadoop requires the following preparations.

Item Description Guide
ACG and ACL settings Set them to enable communication between the Cloud Hadoop cluster server and the RDBMS server Setting firewall (ACG)
RDBMS The RDBMS where the original data is stored

Proceed with the following steps in order after completing the preparations.

1. Connect to Hadoop cluster

Connect to the master node of the Cloud Hadoop cluster you want to work with.
Please refer to How to connect via SSH to cluster node for how to connect to cluster master node.

2. Connect to the MySQL server and check the target data to import

  1. Run the command below to check if the MySQL client has been installed for the remote access to the MySQL server.

    • By default, MySQL client is installed on the master node (two nodes).
    [root@localhost ~]# mysql -V
    mysql  Ver 14.14 Distrib 5.6.41, for Linux (x86_64) using  EditLine wrapper
    
    Note

    To access MySQL remotely from the data node, install the MySQL client using the following command.

    [root@localhost ~] # yum -y install mysql
    
  2. Run the following command to check if the connection to the MySQL server is established.

    mysql -h [MySQL DB server's public IP] -p
    


    Connection example

    [root@objobj001 ~]# mysql -h 10.10.10.10 -p
    Enter password:
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 37
    Server version: 5.6.35-log MySQL Community Server (GPL)
    Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
    
    Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
    
    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
    
    mysql>
    
    Caution

    If the connection is not established properly, then check if the ACG is open to the MySQL DB port. Also, check MySQL DB's ACL list.

  3. Check the data to be imported to Hadoop cluster’s HDFS from my MySQL DB server.

    • The example below shows the SQOOP_TEST table created in advance in the database called sqoop.
    mysql> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | mysql              |
    | performance_schema |
    | sqoop              |
    +--------------------+
    4 rows in set (0.00 sec)
    
    mysql> use sqoop
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A
    
    Database changed
    mysql> show tables;
    +-----------------+
    | Tables_in_sqoop |
    +-----------------+
    | SQOOP_TEST      |
    +-----------------+
    1 row in set (0.00 sec)
    
    mysql> select * from SQOOP_TEST;
    +--------------+--------+-----------+------+---------------------+------+------+--------+
    | SQOOP_TESTNO | ENAME  | JOB       | MGR  | HIREDATE            | SAL  | COMM | DEPTNO |
    +--------------+--------+-----------+------+---------------------+------+------+--------+
    |         7369 | SMITH  | CLERK     | 7902 | 1980-12-17 00:00:00 |  800 | NULL |     20 |
    |         7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 00:00:00 | 1600 |  300 |     30 |
    |         7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 00:00:00 | 1250 |  500 |     30 |
    |         7566 | JONES  | MANAGER   | 7839 | 1981-04-02 00:00:00 | 2975 | NULL |     20 |
    |         7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 00:00:00 | 1250 | 1400 |     30 |
    |         7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 00:00:00 | 2850 | NULL |     30 |
    |         7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 00:00:00 | 2450 | NULL |     10 |
    |         7788 | SCOTT  | ANALYST   | 7566 | 1987-04-19 00:00:00 | 3000 | NULL |     20 |
    |         7839 | KING   | PRESIDENT | NULL | 1981-11-17 00:00:00 | 5000 | NULL |     10 |
    |         7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 00:00:00 | 1500 |    0 |     30 |
    |         7876 | ADAMS  | CLERK     | 7788 | 1987-05-23 00:00:00 | 1100 | NULL |     20 |
    |         7900 | JAMES  | CLERK     | 7698 | 1981-12-03 00:00:00 |  950 | NULL |     30 |
    |         7902 | FORD   | ANALYST   | 7566 | 1981-12-03 00:00:00 | 3000 | NULL |     20 |
    |         7934 | MILLER | CLERK     | 7782 | 1982-01-23 00:00:00 | 1300 | NULL |     10 |
    +--------------+--------+-----------+------+---------------------+------+------+--------+
    14 rows in set (0.00 sec)
    

3. Import data with the Sqoop command

  1. Run the command below to switch to the user account set upon the cluster installation.

    [root@objobj001 ~]# su - sqoop_test
    [sqoop_test@objobj001 ~]$
    
  2. Import the data using the import Sqoop command.

    • The import command allows you to import specific databases, specific tables, query results, etc.
    sqoop import --connect jdbc:mysql://[MySQL DB server public IP]/[database name] --username [user name] --password [password] --table [target table]
    
    • The import command allows you to import specific databases, specific tables, query results, etc. Refer to the table below for general options of a Sqoop import.
    Argument Description
    --connect <jdbc-uri> Designates JDBC connection string
    --connection-manager <class-name> Designates connection manager class to use
    --driver <class-name> Manually specifies JDBC driver class to use
    --hadoop-home <dir> Designates $ HADOOP_HOME
    --help Help
    -P Reads password from console
    --password <password> Sets authentication password
    --username <username> Sets authentication username
    --verbose Used to print more information while working
    --connection-param-file <filename> Designates optional properties file that provides connection parameters

Sqoop imports data through MapReduce operations. Below is an example of the most typical method of importing a specific table from a particular database.

[sqoop_test@objobj001 ~]# sqoop import --connect jdbc:mysql://10.10.10.10/sqoop \
--username root \
-P --table sqoop_test \
--hive-import --create-hive-table --hive-table sqoop_workspace.sqoop_test -m 1

...

18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
18/08/03 09:06:42 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/03 09:06:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`SQOOP_TESTNO`), MAX(`SQOOP_TESTNO`) FROM `SQOOP_TEST`
18/08/03 09:06:42 INFO db.IntegerSplitter: Split size: 141; Num splits: 4 from: 7369 to: 7934
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: number of splits:4
18/08/03 09:06:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532345019075_0002
18/08/03 09:06:43 INFO impl.YarnClientImpl: Submitted application application_1532345019075_0002
18/08/03 09:06:43 INFO mapreduce.Job: The url to track the job: http://m-002-obj001.hd:8088/proxy/application_1532345019075_0002/
18/08/03 09:06:43 INFO mapreduce.Job: Running job: job_1532345019075_0002
18/08/03 09:06:49 INFO mapreduce.Job: Job job_1532345019075_0002 running in uber mode : false
18/08/03 09:06:49 INFO mapreduce.Job:  map 0% reduce 0%
18/08/03 09:06:57 INFO mapreduce.Job:  map 75% reduce 0%
18/08/03 09:07:38 INFO mapreduce.Job:  map 100% reduce 0%
18/08/03 09:07:40 INFO mapreduce.Job: Job job_1532345019075_0002 completed successfully
18/08/03 09:07:40 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=683688
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                S3A: Number of bytes read=497
                S3A: Number of bytes written=853
                S3A: Number of read operations=44
                S3A: Number of large read operations=0
                S3A: Number of write operations=36
        Job Counters
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=277580
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=55516
                Total vcore-milliseconds taken by all map tasks=55516
                Total megabyte-milliseconds taken by all map tasks=170545152
        Map-Reduce Framework
                Map input records=14
                Map output records=14
                Input split bytes=497
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=248
                CPU time spent (ms)=5690
                Physical memory (bytes) snapshot=1087877120
                Virtual memory (bytes) snapshot=12255760384
                Total committed heap usage (bytes)=702545920
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=853
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 62.4343 seconds (0 bytes/sec)
18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Retrieved 14 records.
Caution

The table to import must have a Primary Key. If the table does not have a Primary Key, then the following errors will be returned.

18/08/03 09:00:25 ERROR tool.ImportTool: Error during import: No primary key could be found for table SQOOP_TEST. Please specify one with --split-by or perform a sequential import with '-m 1'.

4. View HDFS data in Hive

  1. Check if the table is stored properly using Hive.
    • For more details about how to use Hive, refer to Using Hive.
  2. Write the query that verifies if the imported table is stored properly, and then execute it.
SELECT * FROM sqoop_workspace.sqoop_test;

cloudhadoop-sqoop-hive_en


Was this article helpful?

What's Next