Available in VPC

This guide explains how to create Spark Scala jobs and submit them to a Cloud Hadoop cluster.

Write Scala code and compile

You can compose a Spark application with Scala and package in JAR in two following ways.

1. Use Scala in terminal

In this example, we'll create a simple Scala code that outputs HelloScala from the terminal and compile it into a .jar package.

Download Scala binary file

Download the Scala binary and unzip it.

If you're using Homebrew on macOS, then you can install it by executing the following command.

brew install scala

Configure environment variables

Use the following command to set the environment variable SCALA_HOME in the executable file (e.g., .bashrc), and add $SCALA_HOME to the PATH.

export SCALA_HOME=/usr/local/share/scala
export PATH=$PATH:$SCALA_HOME/

Compose Spark application

The following describes how to compose a Spark application.

Run Scala REPL, and run scala.

❯ scala
# Welcome to Scala version ...
# Type in expressions to have them evaluated.
# Type :help for more information.
# scala>

Write the HelloWorld.scala class and save as follows.

object HelloWorld {
def main(args: Array[String]): Unit = {
println("Hello, world!")
  }
}

scala> :save HelloWorld.scala
scala> :q

Run the following command and compile with scalac.

❯ scalac  HelloWorld.scala

Use the ls command to check the .class file to check if the compiling has been done successfully.

❯ ls HelloWorld*.class
HelloWorld$.class HelloWorld.class

Create JAR file

You can create JAR files as follows.

Note

To execute the jar command, Java SE and JRE must be installed.

Go to the directory containing the HelloWorld*.class file, and use the following command to package the class file into a .jar file.

❯ jar cvfe HelloWorld.jar HelloWorld HelloWorld*.class
added manifest
adding: HelloWorld$.class(in = 670) (out= 432)(deflated 35%)
adding: HelloWorld.class(in = 645) (out= 524)(deflated 18%)

In the packaged JAR file, check if the HelloWorld class is set as the entry point of the application in MANIFEST.MF.

❯ unzip -q -c HelloWorld.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Created-By: 1.8.0_181 (Oracle Corporation)
Main-Class: HelloWorld # entry point

2. Use IntelliJ SBT plugin

This guide explains how to configure the environment for developing and debugging Spark applications in IntelliJ and build a WordCount job named Hello Scala with an example.

Build manager: SBT
Example composition environment: Windows OS, IntelliJ Ultimate 2022.1.4

Create project

Here's how to create a project:

Please run IntelliJ.

Search for Scala in the Plugins menu on the left and install it.
A restart is required for the plugin to take effect. Click the [Restart IDE] button to restart IntelliJ.
Click Projects in the menu on the left side of the home screen, then click New Project.
After selecting Scala and sbt as follows, click the [Create] button.
- Project name: Specify as WordCount
- Select the versions for sbt and Scala
Check if the project has been created successfully.
- Once a project is created, directories and files in the following structure can be seen by default.
  - .idea: IntelliJ configuration file
  - project: file used for compiling
  - src: source code. Most of the application code should be in src/main. src/test is a space for test scripts.
  - target: Compiled project is saved in this location
  - build.sbt: SBT configuration file

Import SBT library

For IntelliJ to recognize Spark codes, the spark-core library and documents must be imported.

Note

Spark-core libraries are compatible with Scala in a specific version, so check each of the spark-core and Scala versions if you import the libraries.

Check the Scala version that is compatible with the spark-core library and Artifact ID in the mvn repository.
Click Target > build.sbt, and then add the following to the script window.

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"

Check the Build console to confirm if the library is successfully imported.

Note

When importing a library in SBT, use the following syntax.

Group Id %% Artifact Id % Revision

Compose Spark application

Here, a text file containing Shakespeare's sonnet (shakespeare.txt) is taken as a dataset and how to create an application for counting words included in the sonnet is explained.

Download shakespeare.txt and save in src/main/resources.
- When running this application in a Cloud Hadoop cluster, upload the dataset to an S3 bucket or HDFS.
Select src > main to expand the directory, right-click the scala directory, and then click New > Scala Class.
Create a class under WordCount/src/main/scala.
- Kind: Object
Write the sample code as below in WordCount.scala and run it to check if the configuration has been set successfully.

object WordCount {
    def main(args: Array[String]): Unit = {
      println("This is WordCount application")
    }
}

Check if the message as below has been printed successfully.
Delete the sample code applied to WordCount.scala, and write the code to count the words in the Shakespeare sonnet text file as below.

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {

  def main(args: Array[String]) : Unit = {

    //Create a SparkContext to initialize Spark
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("Word Count")
    val sc = new SparkContext(conf)

    // Load the text into a Spark RDD, which is a distributed representation of each line of text
    val textFile = sc.textFile("src/main/resources/shakespeare.txt")

    //word count
    val counts = textFile.flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)

    counts.foreach(println)
    System.out.println("Total words: " + counts.count());
    counts.saveAsTextFile("/tmp/sonnetWordCount")
  }

}

Note

Master URLs
Master URLs differ depending on the Spark deployment environment.

Local (pseudo-cluster): local, local[N], local[*] (It depends on the number of threads used. "*" means as many threads as the maximum number of processors available in JVM.)
Clustered
Spark Standalone: spark://host:port,host1:port1...
Spark on Hadoop YARN: yarn
Spark on Apache Mesos: mesos://

Run WordCount.scala and check the output result.

Create JAR file

Upload the dataset to an Object Storage bucket, and change the source code's resource file path as follows.
- If you want upload the dataset to HDFS and then use it, use hdfs:// instead of s3a://.

// FROM
conf.setMaster("local")
// TO
conf.setMaster("yarn-cluster")

// FROM
val textFile = sc.textFile("src/main/resources/shakespeare.txt")
// TO
val textFile = sc.textFile("s3a://deepdrive-hue/tmp/shakespeare.txt")

// FROM
counts.saveAsTextFile("/tmp/sonnetWordCount");
// TO
counts.saveAsTextFile("s3a://deepdrive-hue/tmp/sonnetWordCount");

Note

As this example is based on Spark 1.6, you should specify yarn-cluster in conf.setMaster(). You can use yarn in Spark 2 or later.

Use the following command in the Terminal console to package the updated code into a compiled jar so that you can submit it to your Cloud Hadoop cluster.
- The JAR file contains the application code and all the dependencies defined in build.sbt.
- The sbt package command creates wordcount_2.11-0.1.jar under $PROJECT_HOME/target/scala-2.11.

> cd ~/IdeaProjects/WordCount # PROJECT_HOME
> sbt package

Submit Spark job to Cloud Hadoop cluster

How to deploy and submit a Spark application (.jar) created in your local machine to the Cloud Hadoop cluster is explained.

Upload JAR to Object Storage

Copy shakespeare.txt and .jar to the Object Storage bucket, using HUE's S3 browser or Object Storage console.

Please refer to Using HUE for more information about accessing and using HUE.
Please refer to Object Storage Guide for more information about Object Storage buckets.

Submit job

Two ways to submit JAR files to a cluster are explained.

Note

The following properties should be properly configured in spark-defaults.conf.

spark.hadoop.fs.s3a.access.key=<OBEJCT-STORAGE-ACCESS-KEY>
spark.hadoop.fs.s3a.endpoint=http://kr.objectstorage.ncloud.com
spark.hadoop.fs.s3a.secret.key=<OBJECT-STORAGE-SECRET-KEY>
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Use HUE's Spark Submit Jar
Submit from the spark client nodes

Execute the spark-submit command from the node in which the Spark client is installed in the cluster, as shown below.

spark-submit --class WordCount --master yarn-cluster --deploy-mode cluster s3a://deepdrive-hue/tmp/wordcount_2.11-0.1.jar

When the job is completed, check if the results are stored in the specified bucket path as shown below.

Note

Deploy mode depends on the location where the driver (Spark Context) is run in the deployment environment. The mode has the following options:

client (default): The driver is executed in the machine where the Spark application is running
cluster: The driver is executed in a random node in the cluster

You can change the mode to spark.submit.deployMode by using the --deploy-mode cli option of the spark-submit command or in Spark properties configuration.