Submitting Spark Scala job
  • PDF

Submitting Spark Scala job

  • PDF

It can be used in a VPC environment.

This guide explains how to create Spark Scala jobs and submit them to a Cloud Hadoop cluster.

Write Scala code and compile

You can compose a Spark application with Scala and package in JAR in two following ways.

  1. Use Scala in terminal
  2. Use IntelliJ SBT plugin

1. Use Scala in terminal

In this example, we'll create a simple Scala code that outputs HelloScala from the terminal and compile it into a .jar package.

Download Scala binary file

Download the Scala binary and unzip it.

If you're using Homebrew on macOS, then you can install it by executing the following command.

brew install scala

Configure environment variables

Use the following command to set the environment variable SCALA_HOME in the executable file (e.g., .bashrc), and add $SCALA_HOME to the PATH.

export SCALA_HOME=/usr/local/share/scala
export PATH=$PATH:$SCALA_HOME/

Compose Spark application

The following describes how to compose a Spark application.

  1. Run Scala REPL, and run scala.
❯ scala
# Welcome to Scala version ...
# Type in expressions to have them evaluated.
# Type :help for more information.
# scala>
  1. Write the HelloWorld.scala class and save as follows.
object HelloWorld {
def main(args: Array[String]): Unit = {
println("Hello, world!")
  }
}
scala> :save HelloWorld.scala
scala> :q
  1. Run the following command and compile with scalac.
❯ scalac  HelloWorld.scala
  1. Use the ls command to check the .class file to check if the compiling has been done successfully.
❯ ls HelloWorld*.class
HelloWorld$.class HelloWorld.class

Create JAR file

You can create JAR files as follows.

Note

To execute the jar command, Java SE and JRE must be installed.

  1. Go to the directory containing the HelloWorld*.class file, and use the following command to package the class file into a .jar file.
❯ jar cvfe HelloWorld.jar HelloWorld HelloWorld*.class
added manifest
adding: HelloWorld$.class(in = 670) (out= 432)(deflated 35%)
adding: HelloWorld.class(in = 645) (out= 524)(deflated 18%)
  1. In the packaged JAR file, check if the HelloWorld class is set as the entry point of the application in MANIFEST.MF.
❯ unzip -q -c HelloWorld.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Created-By: 1.8.0_181 (Oracle Corporation)
Main-Class: HelloWorld # entry point

2. Use IntelliJ SBT plugin

This guide explains how to configure the environment for developing and debugging Spark applications in IntelliJ and build a Word Count job named Hello Scala with an example.

  • Build manager: SBT
  • Example composition environment: macOS Mojave, IntelliJ Ultimate 2018.2.5

Create project

The following shows how to create a project.

  1. Start IntelliJ, and then click Configure > Plugins > Browse Repositories in order.

    chadoop-4-6-001_en.png

  2. Search and install Scala in the Browse Repositories page, and restart IntelliJ to have the plugin reflected.

    chadoop-4-6-002_en.png

  3. Click Create New Project at the home screen.

    chadoop-4-6-003_en.png

  4. Click and select Scala > sbt, and then click the [Next] button.

    chadoop-4-6-004_en.png

  5. Set the creation information, and then click the [Finish] button.

    • Project name: Specify as WordCount
    • Select the versions for sbt and Scala

    chadoop-4-6-005_en.png

  6. Check if the project has been created successfully.

  • Once a project is created, directories and files in the following structure can be seen by default.

    • .idea: IntelliJ configuration file
    • project: file used for compiling
    • src: source code. Most of the application code should be in src/main. src/test is a space for test scripts.
    • target: Compiled project is saved in this location
    • build.sbt: SBT configuration file

    chadoop-4-6-006_en.png

Import SBT library

For IntelliJ to recognize Spark codes, the spark-core library and documents must be imported.

Note
  • Spark-core libraries are compatible with Scala in a specific version, so check each of the spark-core and Scala versions if you import the libraries.
  1. Check the Scala version that is compatible with the spark-core library and Artifact ID in the mvn repository.

    chadoop-4-6-008_en.png

  2. Click Target > build.sbt, and then add the following to the script window.

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
  1. Check the Build console to confirm if the library is successfully imported.

    chadoop-4-6-007_en.png

Note

When importing a library in SBT, use the following syntax.

Group Id %% Artifact Id % Revision

Compose Spark application

Here, a text file containing Shakespeare's sonnet (shakespeare.txt) is taken as a dataset and how to create an application for counting words included in the sonnet is explained.

  1. Download shakespeare.txt and save in src/main/resources.

    • When running this application in a Cloud Hadoop cluster, upload the dataset to an S3 bucket or HDFS.

    chadoop-4-6-009_en.png

  2. Select src > main to expand the directory, right-click the scala directory, and then click New > Scala Class.

  3. Create a class under WordCount/src/main/scala.

    • Kind: Object

    chadoop-4-6-010_en.png

  4. Write the sample code as below in WordCount.scala and run it (Ctrl + Shift + R) to check if the configuration has been set successfully.

object WordCount {
    def main(args: Array[String]): Unit = {
      println("This is WordCount application")
    }
}
  1. Check if the message as below has been printed successfully.

chadoop-4-6-011_en.png

  1. Delete the sample code applied to WordCount.scala, and write the code to count the words in the Shakespeare sonnet text file as below.
import org.apache.spark.{SparkConf, SparkContext}

object WordCount {

  def main(args: Array[String]) {

    //Create a SparkContext to initialize Spark
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("Word Count")
    val sc = new SparkContext(conf)

    // Load the text into a Spark RDD, which is a distributed representation of each line of text
    val textFile = sc.textFile("src/main/resources/shakespeare.txt")

    //word count
    val counts = textFile.flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)

    counts.foreach(println)
    System.out.println("Total words: " + counts.count());
    counts.saveAsTextFile("/tmp/sonnetWordCount")
  }

}
Note

Master URLs
Master URLs differ depending on the Spark deployment environment.

  • Local (pseudo-cluster): local, local[N], local[*] (It depends on the number of threads used. "*" means as many threads as the maximum number of processors available in JVM.)

  • Clustered
    Spark Standalone: spark://host:port,host1:port1...
    Spark on Hadoop YARN: yarn
    Spark on Apache Mesos: mesos://

  1. Run WordCount.scala (Ctrl + Shift + R) and check the output result.

    chadoop-4-6-012_en.png

Create JAR file

  1. Upload the dataset to an Object Storage bucket, and change the source code's resource file path as follows.
    • If you want upload the dataset to HDFS and then use it, use hdfs:// instead of s3a://.
// FROM
conf.setMaster("local")

// TO
conf.setMaster("yarn-cluster")

// FROM
val textFile = sc.textFile("src/main/resources/shakespeare.txt")
// TO
val textFile = sc.textFile("s3a://deepdrive-hue/tmp/shakespeare.txt")

// FROM
counts.saveAsTextFile("/tmp/sonnetWordCount");

// TO
counts.saveAsTextFile("s3a://deepdrive-hue/tmp/sonnetWordCount");
Note

As this example is based on Spark 1.6, you should specify yarn-cluster in conf.setMaster(). You can use yarn in Spark 2 or later.

  1. Go to the project home, and use the following command to package the updated code into a compiled JAR file that can be submitted to a Cloud Hadoop cluster.
    • The JAR file contains the application code and all the dependencies defined in build.sbt.
    • The sbt package command creates wordcount_2.11-0.1.jar under $PROJECT_HOME/target/scala-2.11.
cd ~/IdeaProjects/WordCount # PROJECT_HOME
sbt package

Submit Spark job to Cloud Hadoop cluster

How to deploy and submit a Spark application (.jar) created in your local machine to the Cloud Hadoop cluster is explained.

Upload JAR to Object Storage

Copy shakespeare.txt and .jar to the Object Storage bucket, using HUE's S3 browser or Object Storage console.

  • Please refer to Using HUE for more information about accessing and using HUE.
  • Please refer to Object Storage Guide for more information about Object Storage buckets.

chadoop-4-6-013_en.png

Submit job

Two ways to submit JAR files to a cluster are explained.

Note

The following properties should be properly configured in spark-defaults.conf.

spark.hadoop.fs.s3a.access.key=OBEJCT-STORAGE-ACCESS-KEY
spark.hadoop.fs.s3a.endpoint=http://kr.objectstorage.ncloud.com
spark.hadoop.fs.s3a.secret.key=OBJECT-STORAGE-SECRET-KEY
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
  • Use HUE's Spark Submit Jar

chadoop-4-6-014_en.png

  • Submit from the spark client nodes
  1. Execute the spark-submit command from the node in which the Spark client is installed in the cluster, as shown below.
spark-submit --class WordCount --master yarn-cluster --deploy-mode cluster s3a://deepdrive-hue/tmp/wordcount_2.11-0.1.jar
  1. When the job is completed, check if the results are stored in the specified bucket path as shown below.

chadoop-4-6-015_en.png

Note

Deploy mode depends on the location where the driver (Spark Context) is run in the deployment environment. The mode has the following options:

  • client (default): The driver is executed in the machine where the Spark application is running
  • cluster: The driver is executed in a random node in the cluster

You can change the mode to spark.submit.deployMode by using the --deploy-mode cli option of the spark-submit command or in Spark properties configuration.


Was this article helpful?