Registering Spark batch jobs to Oozie scheduler

Available in VPC

This guide explains how to register Spark batch jobs to Apache Oozie scheduler in Data Forest. This batch application's source data is saved in Elasticsearch.

df-usercase1_01_vpc_ko

Step 1. Preparations

1. Create Data Forest account

Create an account called example.
Download Kerberos Keytab in advance because it is required to submit jobs. (You can download it from Cluster access information > Download Kerberos Keytab.)

2. Create Data Forest app

Elasticsearch: It's a storage where raw data (logs, etc.) is accumulated.
Kibana: It's used to search and visualize indices. The Elasticsearch app must be created first when creating the Kibana app. Select the name of the Elasticsearch app created earlier in App linkage information.
Dev: It's used as a client to submit workflows to an Oozie server.

3. Inject sample data

This example uses sample Sample eCommerce orders provided by the Kibana app as raw data, rather than having a separate data collection module.
Create a sample index by clicking [Load a data set and a Kibana dashboard] > [Sample eCommerce orders] in the Kibana UI.
Kibana's address can be viewed in the Kibana app's Quick links.

Step 2. Build Spark app

You have to build a Spark app to run. The Spark app used in this example refines the kibana_sample_data_ecommerce indices saved in Elasticsearch to be used as the index again. This job imports the data from 24 hours before the point of execution, process it, and then save that data as the index again. You can also download the already built JAR file to use.

Step 3. Run workflow with Oozie coordinator

All jobs are executed in the Dev app. All client configuration required to use Data Forest are built in the Dev app.

Oozie is a scheduler system that manages Apache Hadoop's MapReduce jobs' workflow. A workflow is a Directed Acyclic Graph (DAG) consisted of flow nodes (start, end, decision, etc.) and action nodes (mr, pig, shell, etc.) You can see the flow at a glance and monitor it by adding multiple batch applications and scripts required before/after the application as a workflow.

We'll take creating a very simple form of workflow as an example as shown in the image below. Let us run the workflow with a coordinator so it runs once a day.

df-use-ex1-workflow_vpc

1. Write shell action script

This is an example of running spark-submit by adding a shell action to the workflow.
Write an sh script to run. The SPARK_JAR previously created has the Elasticsearch node address as an argument, so find elasticsearch.hosts.inside-of-cluster in Quick links and use it.

Caution

You should use the address specified in elasticsearch.hosts when accessing the Elasticsearch app from outside of the cluster.

spark_submit_action.sh

#!/bin/sh 
CLASS=$1
SPARK_JAR=$2
QUEUE=$3

export HDP_VERSION=3.1.0.0-78

# Authenticate Kerberos with keytab
kinit {user name} -kt ./{enter your keytab}
{user name}@KR.DF.NAVERNCP.COM

# Spark configuration path
export SPARK_HOME=./spark2.tar.gz/spark2
export SPARK_CONF_DIR=./spark_conf.tar.gz/conf

# Hadoop configuration path
export HADOOP_CONF_DIR=./hadoop_conf.tar.gz/conf

# Submit Spark job
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue $QUEUE \
--num-executors 2 \
--executor-memory 1G \
--driver-memory 1G \
--executor-cores 1 \
--name $CLASS \
--class $CLASS \
$SPARK_JAR "{enter your elasticsearch.hosts.inside-of-cluster address}"

Check the paths of SPARK_HOME, SPARK_CONF_DIR, and HADOOP_CONF_DIR in the above script. All files used in the script are the file path uploaded to the distributed cache. Data Forest does not provide a Spark package, so you have to compress the package and the build information with "TAR" and upload it to the distributed cache. Refer to Build and deploy directory for how to upload it.

2. Write job.properties, workflow.xml, and coordinator.xml

Define parameters to be used in XML files in job.properties. This file does not need to be deployed separately. It only has to be on the device to run the Oozie CLI.

# cluster configuration 
nameNode=hdfs://koya
jobTracker=rm1

# job directory information
homeDir=${nameNode}/user/{user name}/myproject/oozie
workflowDir=${homeDir}/workflow

# oozie configuration
# this is where you deploy coordinator.xml into
oozie.coord.application.path=${homeDir}/coordinator

# user information
user.name= {user name}

# job configuration
class=ecomm.BatchProducerRunnerSpark
queueName=longlived
shellActionScript=spark_submit_action.sh
sparkJar=sample-analyzer-1-assembly-0.1.jar

workflow.xml can define a shell action as shown below. You can see that the parameter defined in job.properties is used.

<workflow-app name="exampleWorkflowJob" xmlns="uri:oozie:workflow:0.5">
	<start to="exampleShellAction"/>

	<action name="exampleShellAction">
		<shell xmlns="uri:oozie:shell-action:0.1">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>

			<exec>./action.sh</exec>
      <argument>${class}</argument>
			<argument>${sparkJar}</argument>
			<argument>${queueName}</argument>

			<file>${workflowDir}/${shellActionScript}#action.sh</file>
			<file>${workflowDir}/${sparkJar}</file>
      		<file>${homeDir}/{enter your keytab}</file>
			<archive>${homeDir}/hadoop_conf.tar.gz</archive>
			<archive>${homeDir}/spark_conf.tar.gz</archive>
			<archive>${homeDir}/spark2.tar.gz</archive>

			<capture-output/>
		</shell>
		<ok to="end"/>
		<error to="fail"/>
	</action>

	<kill name="fail">
		<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
	</kill>
	<end name="end"/>
</workflow-app>

coordinator.xml is a file that contains details of the coordinator that will schedule the workflow. It is set up to run once a day. It is in the same XML format as the workflow. Set the starting time as an option when running the Oozie CLI.

<coordinator-app name="exampleCoordinatorJob" frequency="${coord:days(1)}" start="${start}" end="9999-12-31T23:59+0900" timezone="Asia/Seoul"
		xmlns="uri:oozie:coordinator:0.4">
	<controls>
		<timeout>10</timeout>
		<concurrency>1</concurrency>
	</controls>
	<action>
		<workflow>
			<app-path>${workflowDir}</app-path>
		</workflow>
	</action>
</coordinator-app>

3. Build and deploy directory

Build the directory as follows. (example)

hdfs://koya/user/example/myproject/oozie # User's Oozie job directory
├── workflow
│   ├── spark_submit_action.sh # Bash script to be used as the workflow shell action
│   ├── sample-analyzer-1-assembly-0.1.jar # JAR to run in spark-submit
│   └── workflow.xml
├── coordinator
│   └── coordinator.xml 
├── spark2.tar.gz # JAR archive to run Spark
├── hadoop_conf.tar.gz # Hadoop configuration archive
└── spark_conf.tar.gz # Spark configuration archive

Create a directory to be used as the Oozie project home.

$ hadoop fs -mkdir -p /user/example/myproject/oozie/workflow
$ hadoop fs -mkdir -p /user/example/myproject/oozie/coordinator

Archive the Spark package.

$ tar cvfz spark2.tar.gz -C /usr/hdp/3.1.0.0-78/ spark2

Archive the Spark and Hadoop configurations. The default build must be modified.

# The user can't make changes to the files under /etc/hadoop/conf and /etc/spark2/conf themselves.
# Make a copy of the files to edit.
$ mkdir hadoop_conf
$ cp -R /etc/hadoop/conf ./hadoop_conf
$ mkdir spark_conf
$ cp -R /etc/spark2/conf ./spark_conf

Process the following line as the annotation in ./spark_conf/conf/spark-env.sh.
Since HADOOP_CONF_DIR has to be a distributed cache path, you might not be able to find the file with the information below.
```
# export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.1.0.0-78/hadoop/conf}
```

Archive the configuration files once the editing is completed.

$ tar cvfz hadoop_conf.tar.gz -C ./hadoop_conf conf
$ tar cvfz spark_conf.tar.gz -C ./spark_conf conf

Upload the archived package and all files required to run the workflow to HDFS.

$ hadoop fs -copyFromLocal -f spark2.tar.gz hadoop_conf.tar.gz \
spark_conf.tar.gz df.test01.keytab /user/example/myproject/oozie

$ hadoop fs -copyFromLocal -f workflow.xml spark_submit_action.sh sample-analyzer-1-assembly-0.1.jar \
 /user/example/myproject/oozie/workflow

 $ hadoop fs -copyFromLocal -f  coordinator.xml  /user/example/myproject/oozie/coordinator

4. Run workflow

The Dev app has the OOZIE_URL environment variable already set up.

$ oozie job -oozie $OOZIE_URL -config job.properties -Dstart=`TZ="Asia/Seoul" date "+%Y-%m-%dT%H:00"`+"0900" -run

Access the OOZIE_URL in a browser to view the workflow and coordinator you've just run. You can access Resource Manager to see the Spark application running from the spark-submit shell action defined in the workflow.

df-usercase1_06_vpc_ko

df-usercase1_07_vpc_ko

When the Spark application is ended, you can see the execution history of the application in the [History Server].
df-usercase1_08_vpc_ko

Step 4. Check data in Kibana app

You will be able to find the indices with the form of ecomm_data.order.1d.${date} in the Kibana app.
df-usercase1_09_vpc_ko