Registering Spark batch jobs to Oozie scheduler
    • PDF

    Registering Spark batch jobs to Oozie scheduler

    • PDF

    Article Summary

    Available in VPC

    This guide explains how to register Spark batch jobs to Apache Oozie scheduler in Data Forest. This batch application's source data is saved in Elasticsearch.

    df-usercase1_01_vpc_ko

    Step 1. Preparations

    1. Create Data Forest account

    • Create an account called example.
    • Download Kerberos Keytab in advance because it is required to submit jobs. (You can download it from Cluster access information > Download Kerberos Keytab.)
      df-usercase1_02_vpc_en

    2. Create Data Forest app

    • Elasticsearch: It's a storage where raw data (logs, etc.) is accumulated.
    • Kibana: It's used to search and visualize indices. The Elasticsearch app must be created first when creating the Kibana app. Select the name of the Elasticsearch app created earlier in App linkage information.
      df-usercase1_1-1_en
    • Dev: It's used as a client to submit workflows to an Oozie server.
      df-usercase1_03_vpc_en)

    3. Inject sample data

    • This example uses sample Sample eCommerce orders provided by the Kibana app as raw data, rather than having a separate data collection module.
    • Create a sample index by clicking [Load a data set and a Kibana dashboard] > [Sample eCommerce orders] in the Kibana UI.
    • Kibana's address can be viewed in the Kibana app's Quick links.
      df-usercase1_04_vpc

    Step 2. Build Spark app

    You have to build a Spark app to run. The Spark app used in this example refines the kibana_sample_data_ecommerce indices saved in Elasticsearch to be used as the index again. This job imports the data from 24 hours before the point of execution, process it, and then save that data as the index again. You can also download the already built JAR file to use.

    Step 3. Run workflow with Oozie coordinator

    All jobs are executed in the Dev app. All client configuration required to use Data Forest are built in the Dev app.

    Oozie is a scheduler system that manages Apache Hadoop's MapReduce jobs' workflow. A workflow is a Directed Acyclic Graph (DAG) consisted of flow nodes (start, end, decision, etc.) and action nodes (mr, pig, shell, etc.) You can see the flow at a glance and monitor it by adding multiple batch applications and scripts required before/after the application as a workflow.

    We'll take creating a very simple form of workflow as an example as shown in the image below. Let us run the workflow with a coordinator so it runs once a day.

    df-use-ex1-workflow_vpc

    1. Write shell action script

    This is an example of running spark-submit by adding a shell action to the workflow.
    Write an sh script to run. The SPARK_JAR previously created has the Elasticsearch node address as an argument, so find elasticsearch.hosts.inside-of-cluster in Quick links and use it.

    Caution

    You should use the address specified in elasticsearch.hosts when accessing the Elasticsearch app from outside of the cluster.

    spark_submit_action.sh

    #!/bin/sh 
    CLASS=$1
    SPARK_JAR=$2
    QUEUE=$3
    
    export HDP_VERSION=3.1.0.0-78
    
    # Authenticate Kerberos with keytab
    kinit {user name} -kt ./{enter your keytab}
    {user name}@KR.DF.NAVERNCP.COM
    
    # Spark configuration path
    export SPARK_HOME=./spark2.tar.gz/spark2
    export SPARK_CONF_DIR=./spark_conf.tar.gz/conf
    
    # Hadoop configuration path
    export HADOOP_CONF_DIR=./hadoop_conf.tar.gz/conf
    
    # Submit Spark job
    ${SPARK_HOME}/bin/spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --queue $QUEUE \
    --num-executors 2 \
    --executor-memory 1G \
    --driver-memory 1G \
    --executor-cores 1 \
    --name $CLASS \
    --class $CLASS \
    $SPARK_JAR "{enter your elasticsearch.hosts.inside-of-cluster address}"
    

    Check the paths of SPARK_HOME, SPARK_CONF_DIR, and HADOOP_CONF_DIR in the above script. All files used in the script are the file path uploaded to the distributed cache. Data Forest does not provide a Spark package, so you have to compress the package and the build information with "TAR" and upload it to the distributed cache. Refer to Build and deploy directory for how to upload it.

    2. Write job.properties, workflow.xml, and coordinator.xml

    Define parameters to be used in XML files in job.properties. This file does not need to be deployed separately. It only has to be on the device to run the Oozie CLI.

    # cluster configuration 
    nameNode=hdfs://koya
    jobTracker=rm1
    
    # job directory information
    homeDir=${nameNode}/user/{user name}/myproject/oozie
    workflowDir=${homeDir}/workflow
    
    # oozie configuration
    # this is where you deploy coordinator.xml into
    oozie.coord.application.path=${homeDir}/coordinator
    
    # user information
    user.name= {user name}
    
    # job configuration
    class=ecomm.BatchProducerRunnerSpark
    queueName=longlived
    shellActionScript=spark_submit_action.sh
    sparkJar=sample-analyzer-1-assembly-0.1.jar
    

    workflow.xml can define a shell action as shown below. You can see that the parameter defined in job.properties is used.

    <workflow-app name="exampleWorkflowJob" xmlns="uri:oozie:workflow:0.5">
    	<start to="exampleShellAction"/>
    
    	<action name="exampleShellAction">
    		<shell xmlns="uri:oozie:shell-action:0.1">
    			<job-tracker>${jobTracker}</job-tracker>
    			<name-node>${nameNode}</name-node>
    
    			<exec>./action.sh</exec>
          <argument>${class}</argument>
    			<argument>${sparkJar}</argument>
    			<argument>${queueName}</argument>
    
    			<file>${workflowDir}/${shellActionScript}#action.sh</file>
    			<file>${workflowDir}/${sparkJar}</file>
          		<file>${homeDir}/{enter your keytab}</file>
    			<archive>${homeDir}/hadoop_conf.tar.gz</archive>
    			<archive>${homeDir}/spark_conf.tar.gz</archive>
    			<archive>${homeDir}/spark2.tar.gz</archive>
    
    			<capture-output/>
    		</shell>
    		<ok to="end"/>
    		<error to="fail"/>
    	</action>
    
    	<kill name="fail">
    		<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    	</kill>
    	<end name="end"/>
    </workflow-app>
    

    coordinator.xml is a file that contains details of the coordinator that will schedule the workflow. It is set up to run once a day. It is in the same XML format as the workflow. Set the starting time as an option when running the Oozie CLI.

    <coordinator-app name="exampleCoordinatorJob" frequency="${coord:days(1)}" start="${start}" end="9999-12-31T23:59+0900" timezone="Asia/Seoul"
    		xmlns="uri:oozie:coordinator:0.4">
    	<controls>
    		<timeout>10</timeout>
    		<concurrency>1</concurrency>
    	</controls>
    	<action>
    		<workflow>
    			<app-path>${workflowDir}</app-path>
    		</workflow>
    	</action>
    </coordinator-app>
    

    3. Build and deploy directory

    1. Build the directory as follows. (example)

      hdfs://koya/user/example/myproject/oozie # User's Oozie job directory
      ├── workflow
      │   ├── spark_submit_action.sh # Bash script to be used as the workflow shell action
      │   ├── sample-analyzer-1-assembly-0.1.jar # JAR to run in spark-submit
      │   └── workflow.xml
      ├── coordinator
      │   └── coordinator.xml 
      ├── spark2.tar.gz # JAR archive to run Spark
      ├── hadoop_conf.tar.gz # Hadoop configuration archive
      └── spark_conf.tar.gz # Spark configuration archive
      
    2. Create a directory to be used as the Oozie project home.

      $ hadoop fs -mkdir -p /user/example/myproject/oozie/workflow
      $ hadoop fs -mkdir -p /user/example/myproject/oozie/coordinator
      
    3. Archive the Spark package.

      $ tar cvfz spark2.tar.gz -C /usr/hdp/3.1.0.0-78/ spark2
      
    4. Archive the Spark and Hadoop configurations. The default build must be modified.

      # The user can't make changes to the files under /etc/hadoop/conf and /etc/spark2/conf themselves.
      # Make a copy of the files to edit.
      $ mkdir hadoop_conf
      $ cp -R /etc/hadoop/conf ./hadoop_conf
      $ mkdir spark_conf
      $ cp -R /etc/spark2/conf ./spark_conf
      
    5. Process the following line as the annotation in ./spark_conf/conf/spark-env.sh.
      Since HADOOP_CONF_DIR has to be a distributed cache path, you might not be able to find the file with the information below.

      # export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.1.0.0-78/hadoop/conf}
      
    6. Archive the configuration files once the editing is completed.

      $ tar cvfz hadoop_conf.tar.gz -C ./hadoop_conf conf
      $ tar cvfz spark_conf.tar.gz -C ./spark_conf conf
      
    7. Upload the archived package and all files required to run the workflow to HDFS.

      $ hadoop fs -copyFromLocal -f spark2.tar.gz hadoop_conf.tar.gz \
      spark_conf.tar.gz df.test01.keytab /user/example/myproject/oozie
      
      $ hadoop fs -copyFromLocal -f workflow.xml spark_submit_action.sh sample-analyzer-1-assembly-0.1.jar \
       /user/example/myproject/oozie/workflow
      
       $ hadoop fs -copyFromLocal -f  coordinator.xml  /user/example/myproject/oozie/coordinator
      

    4. Run workflow

    The Dev app has the OOZIE_URL environment variable already set up.

    $ oozie job -oozie $OOZIE_URL -config job.properties -Dstart=`TZ="Asia/Seoul" date "+%Y-%m-%dT%H:00"`+"0900" -run
    

    Access the OOZIE_URL in a browser to view the workflow and coordinator you've just run. You can access Resource Manager to see the Spark application running from the spark-submit shell action defined in the workflow.

    df-usercase1_06_vpc_ko

    df-usercase1_07_vpc_ko

    When the Spark application is ended, you can see the execution history of the application in the [History Server].
    df-usercase1_08_vpc_ko

    Step 4. Check data in Kibana app

    You will be able to find the indices with the form of ecomm_data.order.1d.${date} in the Kibana app.
    df-usercase1_09_vpc_ko


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.