Data Migration Using Object Storage
    • PDF

    Data Migration Using Object Storage

    • PDF

    Article Summary

    Available in VPC

    The Cloud Hadoop of the NAVER Cloud Platform configures the Hadoop Distributed File System (HDFS) in the Block Storage and uses it as the default storage, while supporting integration with Object Storage. Since the Object Storage provides the public DNS, data can be easily stored and downloaded from any environment where the Internet connection is available. Its advantage is being able to migrate large-volume data that needs to be analyzed to NAVER Cloud Platform using Object Storage outside of the NAVER Cloud Platform.

    hadoop-chadoop-use-ex6_0-0

    The guide describes how to migrate external source data to the Object Storage of the NAVER Cloud Platform as well as how to migrate it from the Object Storage to the Cloud Hadoop HDFS as illustrated above.

    Preparations

    Create a bucket to store data in the Object Storage.

    Note

    For more information on creation of buckets, see the Object Storage Usage Guide.

    Migrate data from the outside to the Object Storage

    Because the object Storage of the NAVER Cloud Platform is compatible with AWS S3, you can use AWS CLI without changes. For more information, see the Object Storage CLI Usage Guide.

    Note

    For more information on AWS CLI, see the awscli document.

    1. Prepare sample data

    Download sample data for testing data migration from the outside to the Object Storage.

    • In this guide, data migration is performed using AllstarFull.csv file of the sample data.
    Note

    The provided sample data is a portion of Lahman's Baseball Database 2012 version and all copyrights of the data belong to Sean Lahman.

    2. Install AWS CLI

    After connecting to an edge node via SSH, install AWS CLI using the pip install command.

    [sshuser@e-001-hadoop-example-hd ~]$ sudo pip install awscli==1.15.85
    DEPRECATION: Python 3.4 support has been deprecated. pip 19.1 will be the last one supporting it. Please upgrade your Python as Python 3.4 won't be maintained after March 2019 (cf PEP 429).
    Collecting awscli==1.15.85
    Downloading https://files.pythonhosted.org/packages/2a/2a/e5ae9191c388db103bc197a444260c8fd4f4f44a8183eb922cd5ebf183cf/awscli-1.15.85-py2.py3-none-any.whl (1.3MB)
    100% |████████████████████████████████| 1.3MB 12.6MB/s
    Collecting PyYAML<=3.13,>=3.10 (from awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/9e/a3/1d13970c3f36777c583f136c136f804d70f500168edc1edea6daa7200769/PyYAML-3.13.tar.gz (270kB)
    100% |████████████████████████████████| 276kB 26.3MB/s
    Collecting s3transfer<0.2.0,>=0.1.12 (from awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl (59kB)
    100% |████████████████████████████████| 61kB 19.9MB/s
    Collecting rsa<=3.5.0,>=3.1.2 (from awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/e1/ae/baedc9cb175552e95f3395c43055a6a5e125ae4d48a1d7a924baca83e92e/rsa-3.4.2-py2.py3-none-any.whl (46kB)
    100% |████████████████████████████████| 51kB 20.2MB/s
    Collecting colorama<=0.3.9,>=0.2.5 (from awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/db/c8/7dcf9dbcb22429512708fe3a547f8b6101c0d02137acbd892505aee57adf/colorama-0.3.9-py2.py3-none-any.whl
    Collecting docutils>=0.10 (from awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
    100% |████████████████████████████████| 552kB 22.5MB/s
    Collecting botocore==1.10.84 (from awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/01/b7/cb08cd1af2bb0d0dfb393101a93b6ab6fb80f109ab7b37f2f34386c11351/botocore-1.10.84-py2.py3-none-any.whl (4.5MB)
    100% |████████████████████████████████| 4.5MB 5.7MB/s
    Collecting pyasn1>=0.1.3 (from rsa<=3.5.0,>=3.1.2->awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
    100% |████████████████████████████████| 81kB 25.1MB/s
    Collecting jmespath<1.0.0,>=0.7.1 (from botocore==1.10.84->awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1; python_version >= "2.7" (from botocore==1.10.84->awscli==1.15.85)
    Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
    100% |████████████████████████████████| 235kB 25.9MB/s
    Requirement already satisfied: six>=1.5 in /usr/lib/python3.4/site-packages (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore==1.10.84->awscli==1.15.85) (1.15.0)
    Installing collected packages: PyYAML, jmespath, python-dateutil, docutils, botocore, s3transfer, pyasn1, rsa, colorama, awscli
    Running setup.py install for PyYAML ... done
    Successfully installed PyYAML-3.13 awscli-1.15.85 botocore-1.10.84 colorama-0.3.9 docutils-0.15.2 jmespath-0.10.0 pyasn1-0.4.8 python-dateutil-2.8.1 rsa-3.4.2 s3transfer-0.1.13
    

    3. Check authentication key information

    To access the Object Storage bucket created, you need the authentication key information for NAVER Cloud Platform.
    After creating a new API authentication key in the [Authentication Key Management] of [My Page] in the NAVER Cloud Platform, check the Access Key ID and the Secret Key.

    4. Set environments

    Use the following command to configure the use environment, and enter the Object Storage endpoint address to check.

    • Example bucket name: example
    [sshuser@e-001-hadoop-example-hd ~]$ sudo aws configure
    AWS Access Key ID [None]: ACCESS_KEY_ID
    AWS Secret Access Key [None]: SECRET_KEY
    Default region name [None]:
    Default output format [None]:
    
    • Check the created bucket with example
    [sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 ls 
    2020-11-25 08:53:42 example 
    

    5. Upload data

    Upload data to the Object Storage using the cp command of AWS CLI, and then check if it was uploaded successfully.

    [sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 cp AllstarFull.csv s3://example/
    upload: ./AllstarFull.csv to s3://example/AllstarFull.csv
    
    [sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 ls s3://example/
    2020-11-25 09:37:50 1708674492 AllstarFull.csv
    
    Note

    s3://[YOUR-BUCKET-NAME]/

    Migrate data from the Object Storage to the Cloud Hadoop HDFS

    You can move the migrated data in the Object Storage to the Cloud Hadoop HDFS.

    1. Access the Cloud Hadoop edge node

    Connect to the edge node of the Cloud Hadoop cluster you want to work with.
    For more information on connection to the edge node, see the Connection to Cluster Node via SSH guide.

    2. Check access

    Use the following command to check if the edge node can access the Object Storage bucket.

    [sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls hdfs://hadoop-example/
    Found 12 items
    drwxrwxrwx - yarn hadoop 0 2020-11-25 10:17 hdfs://hadoop-example/app-logs
    drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/apps
    drwxr-xr-x - yarn hadoop 0 2020-11-25 10:15 hdfs://hadoop-example/ats
    drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/hdp
    drwx------ - livy hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/livy-recovery
    drwx------ - livy hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/livy2-recovery
    drwxr-xr-x - mapred hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/mapred
    drwxrwxrwx - mapred hadoop 0 2020-11-25 10:15 hdfs://hadoop-example/mr-history
    drwxrwxrwx - spark hadoop 0 2020-11-25 10:20 hdfs://hadoop-example/spark-history
    drwxrwxrwx - spark hadoop 0 2020-11-25 10:20 hdfs://hadoop-example/spark2-history
    drwxrwxrwx - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/tmp
    drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/user
    
    [sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -mkdir hdfs://hadoop-example/sampledata/
    
    [sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls s3a://example/
    
    Note

    hadoop fs -ls hdfs://[YOUR-CLUSTER-NAME]/
    hadoop fs -ls s3a://[YOUR-BUCKET-NAME]/

    3. Migrate data

    Use distcp, the command for copying large-volume files in Hadoop, to migrate data, and then check if the files were migrated successfully.

    [sshuser@e-001-hadoop-example-hd ~]$ hadoop distcp -m 10 -bandwidth 100 s3a://example/* hdfs://hadoop-example/sampledata/
    20/11/25 10:30:14 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=10, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://example/*], targetPath=hdfs://hadoop-example/sampledata, targetPathExists=true, filtersFile='null', verboseLog=false}
    20/11/25 10:30:15 INFO client.AHSProxy: Connecting to Application History server at m-002-hadoop-example-hd/10.41.73.166:10200
    20/11/25 10:30:16 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
    20/11/25 10:30:16 INFO tools.SimpleCopyListing: Build file listing completed.
    20/11/25 10:30:16 INFO tools.DistCp: Number of paths in the copy list: 1
    20/11/25 10:30:16 INFO tools.DistCp: Number of paths in the copy list: 1
    20/11/25 10:30:16 INFO client.AHSProxy: Connecting to Application History server at m-002-hadoop-example-hd/10.41.73.166:10200
    20/11/25 10:30:17 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
    20/11/25 10:30:17 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
    20/11/25 10:30:17 INFO mapreduce.JobSubmitter: number of splits:1
    20/11/25 10:30:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606266944151_0003
    20/11/25 10:30:18 INFO impl.YarnClientImpl: Submitted application application_1606266944151_0003
    20/11/25 10:30:18 INFO mapreduce.Job: The url to track the job: http://m-002-hadoop-example-hd:8088/proxy/application_1606266944151_0003/
    20/11/25 10:30:18 INFO tools.DistCp: DistCp job-id: job_1606266944151_0003
    20/11/25 10:30:18 INFO mapreduce.Job: Running job: job_1606266944151_0003
    20/11/25 10:30:26 INFO mapreduce.Job: Job job_1606266944151_0003 running in uber mode : false
    20/11/25 10:30:26 INFO mapreduce.Job: map 0% reduce 0%
    20/11/25 10:30:39 INFO mapreduce.Job: map 100% reduce 0%
    20/11/25 10:33:13 INFO mapreduce.Job: Job job_1606266944151_0003 completed successfully
    20/11/25 10:33:13 INFO mapreduce.Job: Counters: 38
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=158446
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=394
    HDFS: Number of bytes written=1708674492
    HDFS: Number of read operations=13
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=4
    S3A: Number of bytes read=1708674492
    S3A: Number of bytes written=0
    S3A: Number of read operations=3
    S3A: Number of large read operations=0
    S3A: Number of write operations=0
    Job Counters
    Launched map tasks=1
    Other local map tasks=1
    Total time spent by all maps in occupied slots (ms)=328500
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=164250
    Total vcore-milliseconds taken by all map tasks=164250
    Total megabyte-milliseconds taken by all map tasks=168192000
    Map-Reduce Framework
    Map input records=1
    Map output records=0
    Input split bytes=138
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=392
    CPU time spent (ms)=27370
    Physical memory (bytes) snapshot=231428096
    Virtual memory (bytes) snapshot=2531233792
    Total committed heap usage (bytes)=150994944
    File Input Format Counters
    Bytes Read=256
    File Output Format Counters
    Bytes Written=0
    org.apache.hadoop.tools.mapred.CopyMapper$Counter
    BYTESCOPIED=1708674492
    BYTESEXPECTED=1708674492
    COPY=1
    
    [sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls hdfs://hadoop-example/sampledata/
    Found 1 items
    -rw-r--r-- 2 sshuser hdfs 1708674492 2020-11-25 10:33 hdfs://hadoop-example/sampledata/AllstarFull.csv
    
    Note

    Syntax format: hadoop distcp -m 10 -bandwidth 100 s3a://[YOUR-BUCKET-NAME]/* hdfs://[YOUR-CLUSTER-NAME]/sampledata/

    Note

    If you want to copy a single file, then you can migrate data using the hadoop put command.


    Was this article helpful?

    What's Next
    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.