- Print
- PDF
Getting started with Data Forest
- Print
- PDF
Available in VPC
If you've checked application specifications provided by Data Forest and duly noted all the scenarios, then you are now ready to start using Data Forest. This guide describes the process of creating a notebook and configuring the client environment to access Data Forest and the Data Forest app.
Create notebook
The following describes how to create a notebook.
Preparations
Create a VPC and a subnet to establish effective network access control.
- Click the Services > Big Data & Analytics > Data Forest menus, in that order.
- Click the [Create notebook] button from Notebooks.
- Enter the notebook settings information, and then click the [Next] button.
- Account name: enter "df123"
- Notebook name: enter "my-notebook"
- VPC/Subnet: enter the information you have created during the preparations for VPC/Subnet
- If user settings are required, enter the relevant information.
- Select an authentication key that you have from Set authentication key or create a new one, and click the [Next] button.
- After the final check, click the [Create] button.
Set up development environments from notebook nodes
Once the notebook creation has been complete, you can configure the development environment to easily access the Data Forest cluster and its app through docker containers in the VPC environment.
This scenario assumes that the host is CentOS 7.3.
Step 1. Connect to notebook node and docker
To access the docker running on the notebook node, you have 2 available methods.
- Accessing the docker through the notebook WEB UI
- Connecting to the running docker after SSH access to the notebook node
- For more information about how to connect to the notebook, see Create and manage notebook.
- In the Notebook docker provided by Data Forest, a distinct overlay network configuration is established to access the network where the Data Forest app is running.
Step 2. Confirm and authenticate keytab
To access Data Forest components, you must complete Kerberos authentication. Use the keytab file downloaded from the access information after creating the account.
When creating a notebook node, the keytab file of the Data Forest account is downloaded into the docker. To locate the file, go to the following path.
- User keytab download path
- /Home/forest/keytab, the home directory for the forest account
Run commands as follows to authenticate.
[forest@0242f09990ad ~][df]$ cd keytabs/
[forest@0242f09990ad keytabs][df]$ ll
total 4
-rw-r--r-- 1 forest forest 218 Dec 21 15:19 df.example.keytab
[forest@0242f09990ad keytabs][df]$ kinit example -kt df.example.keytab
[forest@0242f09990ad keytabs][df]$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: example@KR.DF.NAVERNCP.COM
Valid starting Expires Service principal
12/21/2020 17:07:42 12/22/2020 17:07:42 krbtgt/KR.DF.NAVERNCP.COM@KR.DF.NAVERNCP.COM
renew until 12/28/2020 17:07:42
Run the kdestroy
command to delete the authentication history.
[forest@0242f09990ad keytabs][df]$ kdestroy
[forest@0242f09990ad keytabs][df]$ klist
klist: No credentials cache found (filename: /tmp/krb5cc_500)
User authentication can't be made without a keytab file. Permission errors may occur in all actions.
Step 3. Use development environment
1. Confirm environment variables
The environment variables required for using commands such as hadoop
, yarn
, and spark-submit
have already been specified.
[forest@0242f09990ad keytabs][df]$ cat /etc/profile.d/zz-df-env.sh
# DO NOT EDIT THIS LINE
# FOR CLUSTER df ENVIRONMENTS
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
export SPARK_CONF_DIR=/etc/spark2/conf
...
2. Run various commands
[forest@0242f09990ad keytabs][df]$ hadoop fs -touch /user/example/test.txt
[forest@0242f09990ad keytabs][df]$ hadoop fs -ls
Found 4 items
drwxr-xr-x - example services 0 2020-12-21 16:33 .sparkStaging
drwxr-x--- - example services 0 2020-12-21 15:21 .yarn
-rw------- 3 example services 0 2020-12-21 17:10 test.txt
- You can't access the files located in paths other than the user's HDFS HOME (
/user/${USER}
). - If user authentication hasn't been completed, the message
xxxxx
appears when running commands. For authentication, see Authenticate and delete authentication history.
You can view the applications users created and change their status as follows:
[forest@0242f09990ad keytabs][df]$ yarn app -list
20/12/21 17:11:43 INFO client.AHSProxy: Connecting to Application History server at rm1.kr.df.naverncp.com/10.213.198.24:10200
Total number of applications (application-types: [], states: [SUBMITTED, ACCEPTED, RUNNING] and tags: []):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1608526482493_0002 dev yarn-service example dev RUNNING UNDEFINED 100% N/A
You can view Oozie jobs as follows:
[forest@0242f09990ad keytabs][df]$ oozie jobs
Job ID App Name Status User Group Started Ended
------------------------------------------------------------------------------------------------------------------------------------
0000000-201125175300661-oz-df-W no-op-wf SUCCEEDED example - 2020-11-25 08:56 GMT 2020-11-25 08:56 GMT
------------------------------------------------------------------------------------------------------------------------------------
You can run commands using spark-shell as follows:
[forest@f095a749f891 ~][df]$ spark-shell --master local
Warning: Ignoring non-spark config property: history.server.spnego.keytab.file=/etc/security/keytabs/spnego.service.keytab
Warning: Ignoring non-spark config property: history.server.spnego.kerberos.principal=HTTP/_HOST@KR.DF.NAVERNCP.COM
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://f095a749f891:4040
Spark context available as 'sc' (master = local, app id = local-1608542188370).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_\ version 2.3.2.3.1.0.0-78
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@b90c5a5
You can use spark-submit to submit JAR files. In the example below, the build was done under the name "example.jar," and Spark2's README.md file was used as the input text file. You can check the word count result in the stdout
log of the application.
[forest@090aea7192a2 ~][df]$ spark-submit --class com.naverncp.example.SparkWordCount \
--master yarn --deploy-mode cluster --executor-memory 1g --name wordcount --conf "spark.app.id=wordcount" \
example.jar file:///usr/hdp/current/spark2-client/README.md
The SparkWordCount.scala
code is as follows:
package com.naverncp.example
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("WordCount Example"))
val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
println(wordCounts.collect().mkString(", "))
}
}
3. Configure client for app
Steps 1 to 3 were about how to check the client configuration for a multi-tenant cluster. This chapter explains how to configure the client for the HBASE-2.0.0
, HBASE-2.2.3
, and KAFKA-2.4.0
apps. Additional environment variables should be set up before configuring the client.
Run get-app-env.sh
to automatically set up client environment variables for the Data Forest app.
$ pwd
/home/forest
$ mkdir ${DIR}
$ sh /home/forest/get-app-env.sh ${APP_NAME} ~/${DIR}
HBASE-2.0.0
The following describes how to configure the client for the HBASE-2.0.0
app. (app name: secure-hbase
)
[forest@0242f09990ad ~][df]$ mkdir secure-hbase
[forest@0242f09990ad ~][df]$ sh /home/forest/get-app-env.sh secure-hbase ~/secure-hbase
[forest@0242f09990ad ~][df]$ sh /home/forest/get-app-env.sh secure-hbase ~/secure-hbase
[/home/forest/get-app-env.sh] Apptype: HBASE-2.0.0
[/home/forest/get-app-env.sh] Download install-client script for HBASE-2.0.0
[/home/forest/get-app-env.sh] Install client on /home/forest/secure-hbase
current secure-hbase: .yarn/services/secure-hbase/components/v1
HBase-2.0.0 Client has been installed on /home/forest/secure-hbase
==============================================================================================
kinit <user>
export HBASE_CONF_DIR=/home/forest/secure-hbase
hbase shell
==============================================================================================
HBASE-2.2.3
The following describes how to configure the client for the HBASE-2.2.3
app. (app name: unsecure-hbase
)
$ mkdir unsecure-hbase
$ sh /home/forest/get-app-env.sh unsecure-hbase ~/unsecure-hbase
KAFKA-2.4.0
The following describes how to configure the client for the KAFKA-2.4.0
app. (app name: kafka
)
$ mkdir kafka
$ sh /home/forest/get-app-env.sh kafka ~/kafka