- Print
- PDF
Submitting Spark Scala job
- Print
- PDF
Available in VPC
This guide explains how to create Spark Scala jobs and submit them to a Cloud Hadoop cluster.
Write Scala code and compile
You can compose a Spark application with Scala and package in JAR in two following ways.
1. Use Scala in terminal
In this example, we'll create a simple Scala code that outputs HelloScala
from the terminal and compile it into a .jar package.
Download Scala binary file
Download the Scala binary and unzip it.
If you're using Homebrew on macOS, then you can install it by executing the following command.
brew install scala
Configure environment variables
Use the following command to set the environment variable SCALA_HOME
in the executable file (e.g., .bashrc
), and add $SCALA_HOME
to the PATH
.
export SCALA_HOME=/usr/local/share/scala
export PATH=$PATH:$SCALA_HOME/
Compose Spark application
The following describes how to compose a Spark application.
- Run Scala REPL, and run
scala
.
❯ scala
# Welcome to Scala version ...
# Type in expressions to have them evaluated.
# Type :help for more information.
# scala>
- Write the
HelloWorld.scala
class and save as follows.
object HelloWorld {
def main(args: Array[String]): Unit = {
println("Hello, world!")
}
}
scala> :save HelloWorld.scala
scala> :q
- Run the following command and compile with
scalac
.
❯ scalac HelloWorld.scala
- Use the
ls
command to check the.class
file to check if the compiling has been done successfully.
❯ ls HelloWorld*.class
HelloWorld$.class HelloWorld.class
Create JAR file
You can create JAR files as follows.
To execute the jar command, Java SE and JRE must be installed.
- Go to the directory containing the
HelloWorld*.class
file, and use the following command to package the class file into a .jar file.
❯ jar cvfe HelloWorld.jar HelloWorld HelloWorld*.class
added manifest
adding: HelloWorld$.class(in = 670) (out= 432)(deflated 35%)
adding: HelloWorld.class(in = 645) (out= 524)(deflated 18%)
- In the packaged JAR file, check if the
HelloWorld
class is set as the entry point of the application in MANIFEST.MF.
❯ unzip -q -c HelloWorld.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Created-By: 1.8.0_181 (Oracle Corporation)
Main-Class: HelloWorld # entry point
2. Use IntelliJ SBT plugin
This guide explains how to configure the environment for developing and debugging Spark applications in IntelliJ and build a WordCount job named Hello Scala
with an example.
- Build manager: SBT
- Example composition environment: Windows OS, IntelliJ Ultimate 2022.1.4
Create project
Here's how to create a project:
Please run IntelliJ.
Search for Scala in the Plugins menu on the left and install it.
A restart is required for the plugin to take effect. Click the [Restart IDE] button to restart IntelliJ.
Click Projects in the menu on the left side of the home screen, then click New Project.
After selecting Scala and sbt as follows, click the [Create] button.
- Project name: Specify as
WordCount
- Select the versions for sbt and Scala
- Project name: Specify as
Check if the project has been created successfully.
- Once a project is created, directories and files in the following structure can be seen by default.
- .idea: IntelliJ configuration file
- project: file used for compiling
- src: source code. Most of the application code should be in
src/main
.src/test
is a space for test scripts. - target: Compiled project is saved in this location
- build.sbt: SBT configuration file
- Once a project is created, directories and files in the following structure can be seen by default.
Import SBT library
For IntelliJ to recognize Spark codes, the spark-core library and documents must be imported.
- Spark-core libraries are compatible with Scala in a specific version, so check each of the spark-core and Scala versions if you import the libraries.
Check the Scala version that is compatible with the spark-core library and Artifact ID in the mvn repository.
Click Target > build.sbt, and then add the following to the script window.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
- Check the Build console to confirm if the library is successfully imported.
When importing a library in SBT, use the following syntax.
Group Id %% Artifact Id % Revision
Compose Spark application
Here, a text file containing Shakespeare's sonnet (shakespeare.txt
) is taken as a dataset and how to create an application for counting words included in the sonnet is explained.
Download shakespeare.txt and save in
src/main/resources
.- When running this application in a Cloud Hadoop cluster, upload the dataset to an S3 bucket or HDFS.
- When running this application in a Cloud Hadoop cluster, upload the dataset to an S3 bucket or HDFS.
Select
src > main
to expand the directory, right-click thescala
directory, and then click New > Scala Class.Create a class under
WordCount/src/main/scala
.- Kind:
Object
- Kind:
Write the sample code as below in
WordCount.scala
and run it to check if the configuration has been set successfully.
object WordCount {
def main(args: Array[String]): Unit = {
println("This is WordCount application")
}
}
Check if the message as below has been printed successfully.
Delete the sample code applied to
WordCount.scala
, and write the code to count the words in the Shakespeare sonnet text file as below.
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]) : Unit = {
//Create a SparkContext to initialize Spark
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("Word Count")
val sc = new SparkContext(conf)
// Load the text into a Spark RDD, which is a distributed representation of each line of text
val textFile = sc.textFile("src/main/resources/shakespeare.txt")
//word count
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.foreach(println)
System.out.println("Total words: " + counts.count());
counts.saveAsTextFile("/tmp/sonnetWordCount")
}
}
Master URLs
Master URLs differ depending on the Spark deployment environment.
Local (pseudo-cluster):
local
,local[N]
,local[*]
(It depends on the number of threads used. "*" means as many threads as the maximum number of processors available in JVM.)Clustered
Spark Standalone:spark://host:port,host1:port1...
Spark on Hadoop YARN:yarn
Spark on Apache Mesos:mesos://
- Run
WordCount.scala
and check the output result.
Create JAR file
- Upload the dataset to an Object Storage bucket, and change the source code's
resource
file path as follows.- If you want upload the dataset to
HDFS
and then use it, usehdfs://
instead ofs3a://
.
- If you want upload the dataset to
// FROM
conf.setMaster("local")
// TO
conf.setMaster("yarn-cluster")
// FROM
val textFile = sc.textFile("src/main/resources/shakespeare.txt")
// TO
val textFile = sc.textFile("s3a://deepdrive-hue/tmp/shakespeare.txt")
// FROM
counts.saveAsTextFile("/tmp/sonnetWordCount");
// TO
counts.saveAsTextFile("s3a://deepdrive-hue/tmp/sonnetWordCount");
As this example is based on Spark 1.6, you should specify yarn-cluster
in conf.setMaster()
. You can use yarn
in Spark 2 or later.
- Use the following command in the Terminal console to package the updated code into a compiled jar so that you can submit it to your Cloud Hadoop cluster.
- The JAR file contains the application code and all the dependencies defined in
build.sbt
. - The
sbt package
command createswordcount_2.11-0.1.jar
under$PROJECT_HOME/target/scala-2.11
.
- The JAR file contains the application code and all the dependencies defined in
> cd ~/IdeaProjects/WordCount # PROJECT_HOME
> sbt package
Submit Spark job to Cloud Hadoop cluster
How to deploy and submit a Spark application (.jar) created in your local machine to the Cloud Hadoop cluster is explained.
Upload JAR to Object Storage
Copy shakespeare.txt
and .jar to the Object Storage bucket, using HUE's S3 browser or Object Storage console.
- Please refer to Using HUE for more information about accessing and using HUE.
- Please refer to Object Storage Guide for more information about Object Storage buckets.
Submit job
Two ways to submit JAR files to a cluster are explained.
The following properties should be properly configured in spark-defaults.conf
.
spark.hadoop.fs.s3a.access.key=<OBEJCT-STORAGE-ACCESS-KEY>
spark.hadoop.fs.s3a.endpoint=http://kr.objectstorage.ncloud.com
spark.hadoop.fs.s3a.secret.key=<OBJECT-STORAGE-SECRET-KEY>
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
Use HUE's Spark Submit Jar
Submit from the spark client nodes
- Execute the
spark-submit
command from the node in which the Spark client is installed in the cluster, as shown below.
spark-submit --class WordCount --master yarn-cluster --deploy-mode cluster s3a://deepdrive-hue/tmp/wordcount_2.11-0.1.jar
- When the job is completed, check if the results are stored in the specified bucket path as shown below.
Deploy mode depends on the location where the driver (Spark Context) is run in the deployment environment. The mode has the following options:
client
(default): The driver is executed in the machine where the Spark application is runningcluster
: The driver is executed in a random node in the cluster
You can change the mode to spark.submit.deployMode
by using the --deploy-mode
cli option of the spark-submit
command or in Spark properties configuration.