Object Storageを利用したデータ移行

VPC環境で利用できます。

NAVERクラウドプラットフォームの Cloud Hadoopは、Block Storageに HDFS(Hadoop Distributed File System)を構成して基本リポジトリとして使用しており、Object Storageとの連携をサポートしています。Object Storageは Public DNSを提供しているため、インターネットに接続できる環境の場合にはデータを簡単に保存してダウンロードできます。NAVERクラウドプラットフォームの外部からも、Object Storageを利用して分析が必要な大規模なデータを NAVERクラウドプラットフォームに移行できるというメリットがあります。

hadoop-chadoop-use-ex6_0-0

このガイドでは、上図のような形で外部ソースのデータを NAVERクラウドプラットフォームの Object Storageに移行する方法と、Object Storageから Cloud Hadoop HDFSに移行する方法を説明します。

事前タスク

Object Storageでデータを保存するバケットを作成します。

参考

バケット作成に関する詳細は、Object Storage ご利用ガイドをご参照ください。

外部から Object Storageへのデータ移行

NAVERクラウドプラットフォームの Object Storageは AWS S3に対して互換性を持つストレージであるため、AWS CLIをそのまま使用できます。
詳細は、Object Storage CLI ご利用ガイドをご参照ください。

参考

AWS CLIに関する情報は、awscliドキュメントをご参照ください。

1.サンプルデータの準備

外部から Object Storageへのデータ移行テストを行うためのサンプルデータをダウンロードします。

本ガイドでは、サンプルデータの AllstarFull.csv ファイルを用いてデータの移行タスクを実行しました。

参考

提供されるサンプルデータは Lahman's Baseball Database 2012バージョンの一部であり、データのすべての著作権は Sean Lahmanにあります。

2.AWS CLIのインストール

エッジノードに SSHでアクセスした後、pip install コマンドを使用して AWS CLIをインストールします。

[sshuser@e-001-hadoop-example-hd ~]$ sudo pip install awscli==1.15.85
DEPRECATION: Python 3.4 support has been deprecated. pip 19.1 will be the last one supporting it. Please upgrade your Python as Python 3.4 won't be maintained after March 2019 (cf PEP 429).
Collecting awscli==1.15.85
Downloading https://files.pythonhosted.org/packages/2a/2a/e5ae9191c388db103bc197a444260c8fd4f4f44a8183eb922cd5ebf183cf/awscli-1.15.85-py2.py3-none-any.whl (1.3MB)
100% |████████████████████████████████| 1.3MB 12.6MB/s
Collecting PyYAML<=3.13,>=3.10 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/9e/a3/1d13970c3f36777c583f136c136f804d70f500168edc1edea6daa7200769/PyYAML-3.13.tar.gz (270kB)
100% |████████████████████████████████| 276kB 26.3MB/s
Collecting s3transfer<0.2.0,>=0.1.12 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl (59kB)
100% |████████████████████████████████| 61kB 19.9MB/s
Collecting rsa<=3.5.0,>=3.1.2 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/e1/ae/baedc9cb175552e95f3395c43055a6a5e125ae4d48a1d7a924baca83e92e/rsa-3.4.2-py2.py3-none-any.whl (46kB)
100% |████████████████████████████████| 51kB 20.2MB/s
Collecting colorama<=0.3.9,>=0.2.5 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/db/c8/7dcf9dbcb22429512708fe3a547f8b6101c0d02137acbd892505aee57adf/colorama-0.3.9-py2.py3-none-any.whl
Collecting docutils>=0.10 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
100% |████████████████████████████████| 552kB 22.5MB/s
Collecting botocore==1.10.84 (from awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/01/b7/cb08cd1af2bb0d0dfb393101a93b6ab6fb80f109ab7b37f2f34386c11351/botocore-1.10.84-py2.py3-none-any.whl (4.5MB)
100% |████████████████████████████████| 4.5MB 5.7MB/s
Collecting pyasn1>=0.1.3 (from rsa<=3.5.0,>=3.1.2->awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
100% |████████████████████████████████| 81kB 25.1MB/s
Collecting jmespath<1.0.0,>=0.7.1 (from botocore==1.10.84->awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
Collecting python-dateutil<3.0.0,>=2.1; python_version >= "2.7" (from botocore==1.10.84->awscli==1.15.85)
Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
100% |████████████████████████████████| 235kB 25.9MB/s
Requirement already satisfied: six>=1.5 in /usr/lib/python3.4/site-packages (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore==1.10.84->awscli==1.15.85) (1.15.0)
Installing collected packages: PyYAML, jmespath, python-dateutil, docutils, botocore, s3transfer, pyasn1, rsa, colorama, awscli
Running setup.py install for PyYAML ... done
Successfully installed PyYAML-3.13 awscli-1.15.85 botocore-1.10.84 colorama-0.3.9 docutils-0.15.2 jmespath-0.10.0 pyasn1-0.4.8 python-dateutil-2.8.1 rsa-3.4.2 s3transfer-0.1.13

3.認証キー情報確認

作成しておいた Object Storageバケットにアクセスするには、NAVERクラウドプラットフォームの API認証キーを発行する必要です。

API認証キーを発行して確認する方法は、APIガイドをご参照ください。

4.環境設定

以下のコマンドで動作環境を構成した後、Object Storageのエンドポイントアドレスを入力して確認します。

例のバケット名: example

[sshuser@e-001-hadoop-example-hd ~]$ sudo aws configure
AWS Access Key ID [None]: ACCESS_KEY_ID
AWS Secret Access Key [None]: SECRET_KEY
Default region name [None]:
Default output format [None]:

exampleで作成したバケットの確認

[sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 ls 
2020-11-25 08:53:42 example

5.データアップロード

AWS CLIの cp コマンドを使用してデータを Object Storageにアップロードした後、正常にアップロードされたか確認します。

[sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 cp AllstarFull.csv s3://example/
upload: ./AllstarFull.csv to s3://example/AllstarFull.csv

[sshuser@e-001-hadoop-example-hd ~]$ sudo aws --endpoint-url=https://kr.object.ncloudstorage.com s3 ls s3://example/
2020-11-25 09:37:50 1708674492 AllstarFull.csv

参考

s3://[YOUR-BUCKET-NAME]/

Object Storageから Cloud Hadoop HDFSへのデータ移行

Object Storageに移行したデータを Cloud Hadoop HDFSに移行できます。

1.Cloud Hadoopエッジノードへのアクセス

タスクを行いたい Cloud Hadoopクラスタのエッジノードにアクセスします。
エッジノードへのアクセスに関する詳細は、SSHでクラスタノードにアクセスガイドをご参照ください。

2.アクセス確認

以下のコマンドを使用してエッジノードで Object Storageバケットにアクセスできるか確認します。

[sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls hdfs://hadoop-example/
Found 12 items
drwxrwxrwx - yarn hadoop 0 2020-11-25 10:17 hdfs://hadoop-example/app-logs
drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/apps
drwxr-xr-x - yarn hadoop 0 2020-11-25 10:15 hdfs://hadoop-example/ats
drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/hdp
drwx------ - livy hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/livy-recovery
drwx------ - livy hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/livy2-recovery
drwxr-xr-x - mapred hdfs 0 2020-11-25 10:15 hdfs://hadoop-example/mapred
drwxrwxrwx - mapred hadoop 0 2020-11-25 10:15 hdfs://hadoop-example/mr-history
drwxrwxrwx - spark hadoop 0 2020-11-25 10:20 hdfs://hadoop-example/spark-history
drwxrwxrwx - spark hadoop 0 2020-11-25 10:20 hdfs://hadoop-example/spark2-history
drwxrwxrwx - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/tmp
drwxr-xr-x - hdfs hdfs 0 2020-11-25 10:16 hdfs://hadoop-example/user

[sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -mkdir hdfs://hadoop-example/sampledata/

[sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls s3a://example/

参考

hadoop fs -ls hdfs://[YOUR-CLUSTER-NAME]/
hadoop fs -ls s3a://[YOUR-BUCKET-NAME]/

3.データ移行

Hadoopの大量ファイルコピーのためのコマンドの distcpを以下のように使用してデータ移行を行った後、正常に移行されたか確認します。

[sshuser@e-001-hadoop-example-hd ~]$ sudo -u {アカウント名} hadoop distcp -m 10 -bandwidth 100 s3a://example/* hdfs://hadoop-example/sampledata/
20/11/25 10:30:14 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=10, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://example/*], targetPath=hdfs://hadoop-example/sampledata, targetPathExists=true, filtersFile='null', verboseLog=false}
20/11/25 10:30:15 INFO client.AHSProxy: Connecting to Application History server at m-002-hadoop-example-hd/10.41.73.166:10200
20/11/25 10:30:16 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
20/11/25 10:30:16 INFO tools.SimpleCopyListing: Build file listing completed.
20/11/25 10:30:16 INFO tools.DistCp: Number of paths in the copy list: 1
20/11/25 10:30:16 INFO tools.DistCp: Number of paths in the copy list: 1
20/11/25 10:30:16 INFO client.AHSProxy: Connecting to Application History server at m-002-hadoop-example-hd/10.41.73.166:10200
20/11/25 10:30:17 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
20/11/25 10:30:17 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
20/11/25 10:30:17 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 10:30:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606266944151_0003
20/11/25 10:30:18 INFO impl.YarnClientImpl: Submitted application application_1606266944151_0003
20/11/25 10:30:18 INFO mapreduce.Job: The url to track the job: http://m-002-hadoop-example-hd:8088/proxy/application_1606266944151_0003/
20/11/25 10:30:18 INFO tools.DistCp: DistCp job-id: job_1606266944151_0003
20/11/25 10:30:18 INFO mapreduce.Job: Running job: job_1606266944151_0003
20/11/25 10:30:26 INFO mapreduce.Job: Job job_1606266944151_0003 running in uber mode : false
20/11/25 10:30:26 INFO mapreduce.Job: map 0% reduce 0%
20/11/25 10:30:39 INFO mapreduce.Job: map 100% reduce 0%
20/11/25 10:33:13 INFO mapreduce.Job: Job job_1606266944151_0003 completed successfully
20/11/25 10:33:13 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=158446
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=394
HDFS: Number of bytes written=1708674492
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
S3A: Number of bytes read=1708674492
S3A: Number of bytes written=0
S3A: Number of read operations=3
S3A: Number of large read operations=0
S3A: Number of write operations=0
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=328500
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=164250
Total vcore-milliseconds taken by all map tasks=164250
Total megabyte-milliseconds taken by all map tasks=168192000
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=138
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=392
CPU time spent (ms)=27370
Physical memory (bytes) snapshot=231428096
Virtual memory (bytes) snapshot=2531233792
Total committed heap usage (bytes)=150994944
File Input Format Counters
Bytes Read=256
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=1708674492
BYTESEXPECTED=1708674492
COPY=1

[sshuser@e-001-hadoop-example-hd ~]$ hadoop fs -ls hdfs://hadoop-example/sampledata/
Found 1 items
-rw-r--r-- 2 sshuser hdfs 1708674492 2020-11-25 10:33 hdfs://hadoop-example/sampledata/AllstarFull.csv

参考

構文の形式: hadoop distcp -m 10 -bandwidth 100 s3a://[YOUR-BUCKET-NAME]/* hdfs://[YOUR-CLUSTER-NAME]/sampledata/

参考

単一ファイルをコピーする際は、hadoop put コマンドを使用してデータを移行できます。