Getting started with Data Forest

release/20240425
English

Getting started with Data Forest

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in VPC

If you've checked application specifications provided by Data Forest and duly noted all the scenarios, then you are now ready to start using Data Forest. This guide describes the process of creating a notebook and configuring the client environment to access Data Forest and the Data Forest app.

Create notebook

The following describes how to create a notebook.

Preparations

Create a VPC and a subnet to establish effective network access control.

Click the Services > Big Data & Analytics > Data Forest menus, in that order.
Click the [Create notebook] button from Notebooks.
Enter the notebook settings information, and then click the [Next] button.
- Account name: enter "df123"
- Notebook name: enter "my-notebook"
- VPC/Subnet: enter the information you have created during the preparations for VPC/Subnet
If user settings are required, enter the relevant information.
Select an authentication key that you have from Set authentication key or create a new one, and click the [Next] button.
After the final check, click the [Create] button.

Set up development environments from notebook nodes

Once the notebook creation has been complete, you can configure the development environment to easily access the Data Forest cluster and its app through docker containers in the VPC environment.

Note

This scenario assumes that the host is CentOS 7.3.

Step 1. Connect to notebook node and docker

To access the docker running on the notebook node, you have 2 available methods.

Accessing the docker through the notebook WEB UI
Connecting to the running docker after SSH access to the notebook node

Note

For more information about how to connect to the notebook, see Create and manage notebook.
In the Notebook docker provided by Data Forest, a distinct overlay network configuration is established to access the network where the Data Forest app is running.

Step 2. Confirm and authenticate keytab

To access Data Forest components, you must complete Kerberos authentication. Use the keytab file downloaded from the access information after creating the account.

When creating a notebook node, the keytab file of the Data Forest account is downloaded into the docker. To locate the file, go to the following path.

User keytab download path
- /Home/forest/keytab, the home directory for the forest account

Run commands as follows to authenticate.

[forest@0242f09990ad ~][df]$ cd keytabs/
[forest@0242f09990ad keytabs][df]$ ll
total 4
-rw-r--r-- 1 forest forest 218 Dec 21 15:19 df.example.keytab
[forest@0242f09990ad keytabs][df]$ kinit example -kt df.example.keytab
[forest@0242f09990ad keytabs][df]$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: example@KR.DF.NAVERNCP.COM

Valid starting       Expires              Service principal
12/21/2020 17:07:42  12/22/2020 17:07:42  krbtgt/KR.DF.NAVERNCP.COM@KR.DF.NAVERNCP.COM
	renew until 12/28/2020 17:07:42

Run the kdestroy command to delete the authentication history.

[forest@0242f09990ad keytabs][df]$ kdestroy
[forest@0242f09990ad keytabs][df]$ klist
klist: No credentials cache found (filename: /tmp/krb5cc_500)

Note

User authentication can't be made without a keytab file. Permission errors may occur in all actions.

Step 3. Use development environment

1. Confirm environment variables

The environment variables required for using commands such as hadoop, yarn, and spark-submit have already been specified.

[forest@0242f09990ad keytabs][df]$ cat /etc/profile.d/zz-df-env.sh
# DO NOT EDIT THIS LINE
# FOR CLUSTER df ENVIRONMENTS
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
export SPARK_CONF_DIR=/etc/spark2/conf
...

2. Run various commands

[forest@0242f09990ad keytabs][df]$ hadoop fs -touch /user/example/test.txt
[forest@0242f09990ad keytabs][df]$ hadoop fs -ls
Found 4 items
drwxr-xr-x   - example       services          0 2020-12-21 16:33 .sparkStaging
drwxr-x---   - example       services          0 2020-12-21 15:21 .yarn
-rw-------   3 example       services          0 2020-12-21 17:10 test.txt

Note

You can't access the files located in paths other than the user's HDFS HOME (/user/${USER}).
If user authentication hasn't been completed, the message xxxxx appears when running commands. For authentication, see Authenticate and delete authentication history.

You can view the applications users created and change their status as follows:

[forest@0242f09990ad keytabs][df]$ yarn app -list
20/12/21 17:11:43 INFO client.AHSProxy: Connecting to Application History server at rm1.kr.df.naverncp.com/10.213.198.24:10200
Total number of applications (application-types: [], states: [SUBMITTED, ACCEPTED, RUNNING] and tags: []):1
                Application-Id	    Application-Name	    Application-Type	      User     Queue	             State	       Final-State	       Progress	                       Tracking-URL
application_1608526482493_0002	                 dev	        yarn-service	   example       dev	           RUNNING	         UNDEFINED	           100%	                                N/A

You can view Oozie jobs as follows:

[forest@0242f09990ad keytabs][df]$ oozie jobs
Job ID                                   App Name     Status    User      Group     Started                 Ended
------------------------------------------------------------------------------------------------------------------------------------
0000000-201125175300661-oz-df-W          no-op-wf     SUCCEEDED example -         2020-11-25 08:56 GMT    2020-11-25 08:56 GMT
------------------------------------------------------------------------------------------------------------------------------------

You can run commands using spark-shell as follows:

[forest@f095a749f891 ~][df]$ spark-shell --master local
Warning: Ignoring non-spark config property: history.server.spnego.keytab.file=/etc/security/keytabs/spnego.service.keytab
Warning: Ignoring non-spark config property: history.server.spnego.kerberos.principal=HTTP/_HOST@KR.DF.NAVERNCP.COM
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://f095a749f891:4040
Spark context available as 'sc' (master = local, app id = local-1608542188370).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_\   version 2.3.2.3.1.0.0-78
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@b90c5a5

You can use spark-submit to submit JAR files. In the example below, the build was done under the name "example.jar," and Spark2's README.md file was used as the input text file. You can check the word count result in the stdout log of the application.

[forest@090aea7192a2 ~][df]$ spark-submit --class com.naverncp.example.SparkWordCount \
--master yarn --deploy-mode cluster --executor-memory 1g --name wordcount --conf "spark.app.id=wordcount" \
example.jar file:///usr/hdp/current/spark2-client/README.md

The SparkWordCount.scala code is as follows:

package com.naverncp.example

import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount {
  def main(args: Array[String]): Unit = {
    val sc = new SparkContext(new SparkConf().setAppName("WordCount Example"))
    val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
    val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
    println(wordCounts.collect().mkString(", "))
  }
}

3. Configure client for app

Steps 1 to 3 were about how to check the client configuration for a multi-tenant cluster. This chapter explains how to configure the client for the HBASE-2.0.0, HBASE-2.2.3, and KAFKA-2.4.0 apps. Additional environment variables should be set up before configuring the client.

Run get-app-env.sh to automatically set up client environment variables for the Data Forest app.

$ pwd
/home/forest
$ mkdir ${DIR}
$ sh /home/forest/get-app-env.sh ${APP_NAME} ~/${DIR}

HBASE-2.0.0
The following describes how to configure the client for the HBASE-2.0.0 app. (app name: secure-hbase)

[forest@0242f09990ad ~][df]$ mkdir secure-hbase
[forest@0242f09990ad ~][df]$ sh /home/forest/get-app-env.sh secure-hbase ~/secure-hbase
[forest@0242f09990ad ~][df]$ sh /home/forest/get-app-env.sh secure-hbase ~/secure-hbase
[/home/forest/get-app-env.sh] Apptype: HBASE-2.0.0
[/home/forest/get-app-env.sh] Download install-client script for HBASE-2.0.0
[/home/forest/get-app-env.sh] Install client on /home/forest/secure-hbase
current secure-hbase: .yarn/services/secure-hbase/components/v1
HBase-2.0.0 Client has been installed on /home/forest/secure-hbase
==============================================================================================
kinit <user>
export HBASE_CONF_DIR=/home/forest/secure-hbase
hbase shell
==============================================================================================

HBASE-2.2.3
The following describes how to configure the client for the HBASE-2.2.3 app. (app name: unsecure-hbase)

$ mkdir unsecure-hbase
$ sh /home/forest/get-app-env.sh unsecure-hbase ~/unsecure-hbase

KAFKA-2.4.0
The following describes how to configure the client for the KAFKA-2.4.0 app. (app name: kafka)

$ mkdir kafka
$ sh /home/forest/get-app-env.sh kafka ~/kafka

Was this article helpful?

What's Next

Create and manage account

Table of contents

Create notebook
Set up development environments from notebook nodes