Using Sqoop
    • PDF

    Using Sqoop

    • PDF

    Article Summary

    Available in VPC

    SQL to Hadoop (Sqoop) is open-source software designed to transfer data between Hadoop and relational databases.
    With a simple CLI, you can easily move RDBMS-specific tables such as Oracle and MySQL, or data that meets specific conditions, to HDFS, and you can directly move them to Hive, Pig, HBase, etc. Conversely, you can move data stored in HDFS to an RDBMS.

    Since its first version in 2009, Sqoop has been evolving continuously since becoming a Top-Level Apache project in 2012. Currently, Apache Sqoop has evolved into two versions: Sqoop 1 and Sqoop 2. Sqoop 1 uses a client method and Sqoop 2 uses a server side method added to the existing Sqoop 1 method.

    Sqoop is installed by default in the Cluster Type supported by Cloud Hadoop of NAVER Cloud Platform.
    This guide explains how to copy MySQL data to Cloud Hadoop using Sqoop or send it the other way.
    For other ways to use Sqoop, see Apache Sqoop Documentation.

    Note

    This guide is based on the MySQL Ver. 5. If migrating to MySQL Ver. 8, you need to update the version of the connector.

    The connector version of Ver. 8 will be updated soon.

    Preparations

    To use Sqoop in Cloud Hadoop, you need to complete the following preparations:

    ItemDescriptionGuide
    ACG and ACL settingsSet them to enable communication between the Cloud Hadoop cluster server and RDBMS serverFirewall settings (ACG)
    RDBMSRDBMS with original data saved

    Proceed with the following steps in order after completing the preparations.

    1. Connect to Hadoop cluster

    Connect to the master node of the Cloud Hadoop cluster you want to work with.
    For how to access the master node of a cluster, see the Connect cluster nodes through SSH guide.

    2. Connect to MySQL and check the target data to import

    1. Run the following command and check if MySQL client is installed for remote access to the MySQL server.

      • By default, the MySQL client is installed on the master node (2 nodes).
      [sshuser@m-001-example-1fx8-hd ~]$ mysql -V
      mysql  Ver 14.14 Distrib 5.7.37, for Linux (x86_64) using  EditLine wrapper
      
    2. Run the following command to check if the connection to the MySQL server is established.

      mysql -h mysql.local -u [Cluster administrator account name] -p
      


      Connection example

      [sshuser@m-001-example-1fx8-hd ~]$ mysql -u example -p
      Enter password:
      Welcome to the MySQL monitor.  Commands end with ; or \g.
      Your MySQL connection id is 37
      Server version: 5.7.37-log MySQL Community Server (GPL)
      Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
      
      Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
      
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
      
      mysql>
      
      Caution

      When the connection is not established properly, check if ACG is opened properly using MySQL DB port, and check the ACL list of the MySQL DB.

    3. Check the data to be imported to Hadoop cluster HDFS from the MySQL DB server of the user.

      • The following example shows the SQOOP_TEST table created in advance in the database called sqoop.
      mysql> show databases;
      +--------------------+
      | Database           |
      +--------------------+
      | information_schema |
      | mysql              |
      | performance_schema |
      | sqoop              |
      +--------------------+
      4 rows in set (0.00 sec)
      
      mysql> use sqoop
      Reading table information for completion of table and column names
      You can turn off this feature to get a quicker startup with -A
      
      Database changed
      mysql> show tables;
      +-----------------+
      | Tables_in_sqoop |
      +-----------------+
      | SQOOP_TEST      |
      +-----------------+
      1 row in set (0.00 sec)
      
      mysql> select * from SQOOP_TEST;
      +--------------+--------+-----------+------+---------------------+------+------+--------+
      | SQOOP_TESTNO | ENAME  | JOB       | MGR  | HIREDATE            | SAL  | COMM | DEPTNO |
      +--------------+--------+-----------+------+---------------------+------+------+--------+
      |         7369 | SMITH  | CLERK     | 7902 | 1980-12-17 00:00:00 |  800 | NULL |     20 |
      |         7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 00:00:00 | 1600 |  300 |     30 |
      |         7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 00:00:00 | 1250 |  500 |     30 |
      |         7566 | JONES  | MANAGER   | 7839 | 1981-04-02 00:00:00 | 2975 | NULL |     20 |
      |         7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 00:00:00 | 1250 | 1400 |     30 |
      |         7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 00:00:00 | 2850 | NULL |     30 |
      |         7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 00:00:00 | 2450 | NULL |     10 |
      |         7788 | SCOTT  | ANALYST   | 7566 | 1987-04-19 00:00:00 | 3000 | NULL |     20 |
      |         7839 | KING   | PRESIDENT | NULL | 1981-11-17 00:00:00 | 5000 | NULL |     10 |
      |         7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 00:00:00 | 1500 |    0 |     30 |
      |         7876 | ADAMS  | CLERK     | 7788 | 1987-05-23 00:00:00 | 1100 | NULL |     20 |
      |         7900 | JAMES  | CLERK     | 7698 | 1981-12-03 00:00:00 |  950 | NULL |     30 |
      |         7902 | FORD   | ANALYST   | 7566 | 1981-12-03 00:00:00 | 3000 | NULL |     20 |
      |         7934 | MILLER | CLERK     | 7782 | 1982-01-23 00:00:00 | 1300 | NULL |     10 |
      +--------------+--------+-----------+------+---------------------+------+------+--------+
      14 rows in set (0.00 sec)
      

    3. Import data with the Sqoop command

    1. Run the sudo su - command to convert to the root account.

      [sshuser@m-001-example-1fx8-hd ~]$ sudo su -
      [root@m-001-example-1fx8-hd ~]#
      
    2. Use the Sqoop command import and import data.

    • The import command allows you to import specific databases, specific tables, query results, etc.
    sqoop import --connect jdbc:mysql://[Master node Private IP]:3306/[Database name] --username [Cluster administrator account name] --password [Cluster administrator password] --table \[Target table]
    
    • The import command allows you to import specific databases, specific tables, query results, etc. For general options of Sqoop import command, see the table below.
    ArgumentDescription
    --connect <jdbc-uri>Designates JDBC connection character strings
    --connection-manager <class-name>Designates connection manager class to use
    --driver <class-name>Use when manually designating JDBC driver class to use
    --hadoop-home <dir>$ Designates HADOOP_HOME
    --helpHelp
    -PReads password from console
    --password <password>Sets authentication password
    --username <username>Sets authentication username
    --verboseUsed to print more information while working
    --connection-param-file <filename>Designates optional properties file that provides connection parameters

    Sqoop imports data through MapReduce operations. The following is an example of the most typical method of importing a specific table from a particular database.

    
    [root@m-001-example-1fx8-hd ~]# sqoop import --connect jdbc:mysql://[Master node Private IP]:3306/sqoop  \
    --username [Cluster administrator account name] \ 
    -P --table sqoop_test \ 
    --hive-import --create-hive-table --hive-table sqoop_workspace.sqoop_test -m 1
    
    ...
    
    18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
    18/08/03 09:06:38 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
    18/08/03 09:06:42 INFO db.DBInputFormat: Using read commited transaction isolation
    18/08/03 09:06:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`SQOOP_TESTNO`), MAX(`SQOOP_TESTNO`) FROM `SQOOP_TEST`
    18/08/03 09:06:42 INFO db.IntegerSplitter: Split size: 141; Num splits: 4 from: 7369 to: 7934
    18/08/03 09:06:42 INFO mapreduce.JobSubmitter: number of splits:4
    18/08/03 09:06:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1532345019075_0002
    18/08/03 09:06:43 INFO impl.YarnClientImpl: Submitted application application_1532345019075_0002
    18/08/03 09:06:43 INFO mapreduce.Job: The url to track the job: http://m-002-obj001.hd:8088/proxy/application_1532345019075_0002/ 
    18/08/03 09:06:43 INFO mapreduce.Job: Running job: job_1532345019075_0002
    18/08/03 09:06:49 INFO mapreduce.Job: Job job_1532345019075_0002 running in uber mode : false
    18/08/03 09:06:49 INFO mapreduce.Job:  map 0% reduce 0%
    18/08/03 09:06:57 INFO mapreduce.Job:  map 75% reduce 0%
    18/08/03 09:07:38 INFO mapreduce.Job:  map 100% reduce 0%
    18/08/03 09:07:40 INFO mapreduce.Job: Job job_1532345019075_0002 completed successfully
    18/08/03 09:07:40 INFO mapreduce.Job: Counters: 30 
             File System Counters
                      FILE: Number of bytes read=0
                      FILE: Number of bytes written=683688
                      FILE: Number of read operations=0
                      FILE: Number of large read operations=0
                      FILE: Number of write operations=0
                      S3A: Number of bytes read=497
                      S3A: Number of bytes written=853
                      S3A: Number of read operations=44
                      S3A: Number of large read operations=0
                      S3A: Number of write operations=36
             Job Counters
                      Launched map tasks=4
                      Other local map tasks=4
                      Total time spent by all maps in occupied slots (ms)=277580
                      Total time spent by all reduces in occupied slots (ms)=0
                      Total time spent by all map tasks (ms)=55516
                      Total vcore-milliseconds taken by all map tasks=55516
                      Total megabyte-milliseconds taken by all map tasks=170545152
             Map-Reduce Framework
                      Map input records=14
                      Map output records=14
                      Input split bytes=497
                      Spilled Records=0
                      Failed Shuffles=0
                      Merged Map outputs=0
                      GC time elapsed (ms)=248
                      CPU time spent (ms)=5690
                      Physical memory (bytes) snapshot=1087877120
                      Virtual memory (bytes) snapshot=12255760384
                      Total committed heap usage (bytes)=702545920
             File Input Format Counters
                      Bytes Read=0
             File Output Format Counters
                      Bytes Written=853
    18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 62.4343 seconds (0 bytes/sec)
    18/08/03 09:07:40 INFO mapreduce.ImportJobBase: Retrieved 14 records.
    
    Caution

    The table to import must have Primary Key. If you do not have the Primary Key, the following error occurs.

    18/08/03 09:00:25 ERROR tool.ImportTool: Error during import: No primary key could be found for table SQOOP_TEST. Please specify one with --split-by or perform a sequential import with '-m 1'.
    

    4. Check HDFS data on Hive

    1. Check if the table is saved properly through Hive.
      • For instructions on how to use Hive, see the Using Hive guide.
    2. Write the query that verifies if the imported table is stored properly and then run it.
    SELECT * FROM sqoop_workspace.sqoop_test;
    

    chadoop-4-1-101_en.png

    Check the FAQs first.

    Q. I want to use Sqoop Client on the master node of the Cloud Hadoop service. Do I also have to add jdbc driver to datanode?
    A. You do not need to install jdbc driver on datanode. You can use the jdbc driver on the server if there is one to use the Sqoop command.


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.