- Print
- PDF
Transferring data using Object Storage
- Print
- PDF
Available in Classic
Cloud Hadoop configures Hadoop Distributed File System (HDFS) on Block Storage and uses it as the default storage, and it supports the Object Storage linkage. Since the Object Storage provides public DNS, data can easily be stored and downloaded from any environment where the internet connection is available. Its advantage is being able to transfer large-volume data that needs to be analyzed to NAVER Cloud Platform using Object Storage from outside of the NAVER Cloud Platform.
This guide explains the method for transferring the source data to the NAVER Cloud Platform's Object Storage as shown in the image above, as well as the method for transferring it from Object Storage to Cloud Hadoop HDFS.
Preparations
Create a bucket to save data in Object Storage.
Please refer to Object Storage Guide for more information about creating buckets.
Transfer external data to Object Storage
Since NAVER Cloud Platform's Object Storage is a storage compatible with AWS S3, you can use AWS CLI as is.
For more details, please refer to Object Storage CLI Guide.
Please refer to awscli for more information about AWS CLI.
1. Prepare sample data
Use the following command to download sample data for testing data migration from external to Object Storage.
- This guide uses the
AllstarFull.csv
file from the sample data to perform data migration.
The sample data provided is part of the version Lahman's Baseball Database 2012, all copyrights of the data belong to Sean Lahman.
2. Install AWS CLI
After connecting to the edge node with SSH, use the pip install
command to install the AWS CLI.
[sshuser@e-001-hadoop-example-hd ~]$ sudo pip install awscli==1.15.85
DEPRECATION: Python 3.4 support has been deprecated. pip 19.1 will be the last one supporting it. Please upgrade your Python as Python 3.4 won't be maintained after March 2019 (cf PEP 429).
Collecting awscli==1.15.85
Downloading https://files.pythonhosted.org/packages/2a/2a/e5ae9191c388db103bc197a444260c8fd4f4f44a8183eb922cd5ebf183cf/awscli-1.15.85-py2.py3-none-any.whl (1.3MB)
100% |████████████████████████████████| 1.3MB 12.6MB/s
Collecting PyYAML<=3.13,>=3.10 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/9e/a3/1d13970c3f36777c583f136c136f804d70f500168edc1edea6daa7200769/PyYAML-3.13.tar.gz (270kB)
100% |████████████████████████████████| 276kB 26.3MB/s
Collecting s3transfer<0.2.0,>=0.1.12 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl (59kB)
100% |████████████████████████████████| 61kB 19.9MB/s
Collecting rsa<=3.5.0,>=3.1.2 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/e1/ae/baedc9cb175552e95f3395c43055a6a5e125ae4d48a1d7a924baca83e92e/rsa-3.4.2-py2.py3-none-any.whl (46kB)
100% |████████████████████████████████| 51kB 20.2MB/s
Collecting colorama<=0.3.9,>=0.2.5 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/db/c8/7dcf9dbcb22429512708fe3a547f8b6101c0d02137acbd892505aee57adf/colorama-0.3.9-py2.py3-none-any.whl
Collecting docutils>=0.10 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
100% |████████████████████████████████| 552kB 22.5MB/s
Collecting botocore==1.10.84 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/01/b7/cb08cd1af2bb0d0dfb393101a93b6ab6fb80f109ab7b37f2f34386c11351/botocore-1.10.84-py2.py3-none-any.whl (4.5MB)
100% |████████████████████████████████| 4.5MB 5.7MB/s
Collecting pyasn1>=0.1.3 (from rsa<=3.5.0,>=3.1.2->awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
100% |████████████████████████████████| 81kB 25.1MB/s
Collecting jmespath<1.0.0,>=0.7.1 (from botocore==1.10.84->awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
Collecting python-dateutil<3.0.0,>=2.1; python_version >= "2.7" (from botocore==1.10.84->awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
100% |████████████████████████████████| 235kB 25.9MB/s
Requirement already satisfied: six>=1.5 in /usr/lib/python3.4/site-packages (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore==1.10.84->awscli==1.15.85) (1.15.0)
Installing collected packages: PyYAML, jmespath, python-dateutil, docutils, botocore, s3transfer, pyasn1, rsa, colorama, awscli
Running setup.py install for PyYAML ... done
Successfully installed PyYAML-3.13 awscli-1.15.85 botocore-1.10.84 colorama-0.3.9 docutils-0.15.2 jmespath-0.10.0 pyasn1-0.4.8 python-dateutil-2.8.1 rsa-3.4.2 s3transfer-0.1.13
3. Check authentication key information
To access the Object Storage bucket created, you need the authentication key information for NAVER Cloud Platform.
Check the access key ID and secret key from [Manage authentication key] of [My Page] in the NAVER Cloud Platform portal.
- Please refer to the Manage authentication key for more information about authentication keys.
4. Settings
Use the following command to configure the use environment, and enter the Object Storage endpoint address to check.
- Example bucket name:
example
[sshuser@e-001-hadoop-example-hd ~]$ sudo aws configure
AWS Access Key ID [None]: ACCESS_KEY_ID
AWS Secret Access Key [None]: SECRET_KEY
Default region name [None]:
Default output format [None]:
[sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 ls
2020-11-25 08:53:42 example
5. Upload data
Use AWS CLI's cp
command to upload data to Object Storage, and then check if it was uploaded successfully.
[sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 cp AllstarFull.csv s3://example/
upload: ./AllstarFull.csv to s3://example/AllstarFull.csv
[sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 ls s3://example/
2020-11-25 09:37:50 1708674492 AllstarFull.csv
s3://[YOUR-BUCKET-NAME]/
Transfer data from Object Storage to Cloud Hadoop HDFS
The data transferred to Object Storage can be again moved to Cloud Hadoop HDFS.
1. Access Cloud Hadoop edge node
Connect to the edge node of the Cloud Hadoop cluster you want to work with.
Please refer to How to connect via SSH to cluster node for information about how to connect to cluster edge node.
2. Check access
Use the following command to check if the edge node can access the Object Storage bucket.
[sshuser@e-001-hadoop-example-hd ~]# hadoop fs -ls hdfs://hadoop-example/
Found 12 items
drwxrwxrwx - yarn hadoop 0 2020-11-25 10:17 hdfs://hadoop-example/app-logs
drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/apps
drwxr-xr-x - yarn hadoop 0 2020-11-25 10:15 hdfs://hadoop-example/ats
drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/hdp
drwx------ - livy hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/livy-recovery
drwx------ - livy hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/livy2-recovery
drwxr-xr-x - mapred hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/mapred
drwxrwxrwx - mapred hadoop 0 2020-11-25 10:15 hdfs://hadoop-example/mr-history
drwxrwxrwx - spark hadoop 0 2020-11-25 10:20 hdfs://hadoop-example/spark-history
drwxrwxrwx - spark hadoop 0 2020-11-25 10:20 hdfs://hadoop-example/spark2-history
drwxrwxrwx - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/tmp
drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/user
[sshuser@e-001-hadoop-example-hd ~]# hadoop fs -mkdir hdfs://hadoop-example/sampledata/
[sshuser@e-001-hadoop-example-hd ~]# hadoop fs -ls s3a://example/
hadoop fs -ls hdfs://[YOUR-CLUSTER-NAME]/
hadoop fs -ls s3a://[YOUR-BUCKET-NAME]/
3. Transfer data
Use distcp
, the command for copying large-volume files in Hadoop, to transfer data, and then check if the files were transferred successfully.
[sshuser@e-001-hadoop-example-hd ~]$ hadoop distcp -m 10 -bandwidth 100 s3a://example/* hdfs://hadoop-example/sampledata/
20/11/25 10:30:14 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=10, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://example/*], targetPath=hdfs://hadoop-example/sampledata, targetPathExists=true, filtersFile='null', verboseLog=false}
20/11/25 10:30:15 INFO client.AHSProxy: Connecting to Application History server at m-002-hadoop-example-hd/10.41.73.166:10200
20/11/25 10:30:16 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
20/11/25 10:30:16 INFO tools.SimpleCopyListing: Build file listing completed.
20/11/25 10:30:16 INFO tools.DistCp: Number of paths in the copy list: 1
20/11/25 10:30:16 INFO tools.DistCp: Number of paths in the copy list: 1
20/11/25 10:30:16 INFO client.AHSProxy: Connecting to Application History server at m-002-hadoop-example-hd/10.41.73.166:10200
20/11/25 10:30:17 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
20/11/25 10:30:17 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
20/11/25 10:30:17 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 10:30:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606266944151_0003
20/11/25 10:30:18 INFO impl.YarnClientImpl: Submitted application application_1606266944151_0003
20/11/25 10:30:18 INFO mapreduce.Job: The url to track the job: http://m-002-hadoop-example-hd:8088/proxy/application_1606266944151_0003/
20/11/25 10:30:18 INFO tools.DistCp: DistCp job-id: job_1606266944151_0003
20/11/25 10:30:18 INFO mapreduce.Job: Running job: job_1606266944151_0003
20/11/25 10:30:26 INFO mapreduce.Job: Job job_1606266944151_0003 running in uber mode : false
20/11/25 10:30:26 INFO mapreduce.Job: map 0% reduce 0%
20/11/25 10:30:39 INFO mapreduce.Job: map 100% reduce 0%
20/11/25 10:33:13 INFO mapreduce.Job: Job job_1606266944151_0003 completed successfully
20/11/25 10:33:13 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=158446
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=394
HDFS: Number of bytes written=1708674492
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
S3A: Number of bytes read=1708674492
S3A: Number of bytes written=0
S3A: Number of read operations=3
S3A: Number of large read operations=0
S3A: Number of write operations=0
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=328500
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=164250
Total vcore-milliseconds taken by all map tasks=164250
Total megabyte-milliseconds taken by all map tasks=168192000
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=138
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=392
CPU time spent (ms)=27370
Physical memory (bytes) snapshot=231428096
Virtual memory (bytes) snapshot=2531233792
Total committed heap usage (bytes)=150994944
File Input Format Counters
Bytes Read=256
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=1708674492
BYTESEXPECTED=1708674492
COPY=1
[sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls hdfs://hadoop-example/sampledata/
Found 1 items
-rw-r--r-- 2 sshuser hdfs 1708674492 2020-11-25 10:33 hdfs://hadoop-example/sampledata/AllstarFull.csv
Statement format: hadoop distcp -m 10 -bandwidth 100 s3a://[YOUR-BUCKET-NAME]/* hdfs://[YOUR-CLUSTER-NAME]/sampledata/
If you want to copy a single file, then you can transfer data using the hadoop put
command.