Submitting Spark Scala job
    • PDF

    Submitting Spark Scala job

    • PDF

    Article Summary

    Available in VPC

    This guide explains how to create Spark Scala jobs and submit them to a Cloud Hadoop cluster.

    Write Scala code and compile

    You can compose a Spark application with Scala and package in JAR in two following ways.

    1. Use Scala in terminal
    2. Use IntelliJ SBT plugin

    1. Use Scala in terminal

    In this example, we'll create a simple Scala code that outputs HelloScala from the terminal and compile it into a .jar package.

    Download Scala binary file

    Download the Scala binary and unzip it.

    If you're using Homebrew on macOS, then you can install it by executing the following command.

    brew install scala
    

    Configure environment variables

    Use the following command to set the environment variable SCALA_HOME in the executable file (e.g., .bashrc), and add $SCALA_HOME to the PATH.

    export SCALA_HOME=/usr/local/share/scala
    export PATH=$PATH:$SCALA_HOME/
    

    Compose Spark application

    The following describes how to compose a Spark application.

    1. Run Scala REPL, and run scala.
    ❯ scala
    # Welcome to Scala version ...
    # Type in expressions to have them evaluated.
    # Type :help for more information.
    # scala>
    
    1. Write the HelloWorld.scala class and save as follows.
    object HelloWorld {
    def main(args: Array[String]): Unit = {
    println("Hello, world!")
      }
    }
    
    scala> :save HelloWorld.scala
    scala> :q
    
    1. Run the following command and compile with scalac.
    ❯ scalac  HelloWorld.scala
    
    1. Use the ls command to check the .class file to check if the compiling has been done successfully.
    ❯ ls HelloWorld*.class
    HelloWorld$.class HelloWorld.class
    

    Create JAR file

    You can create JAR files as follows.

    Note

    To execute the jar command, Java SE and JRE must be installed.

    1. Go to the directory containing the HelloWorld*.class file, and use the following command to package the class file into a .jar file.
    ❯ jar cvfe HelloWorld.jar HelloWorld HelloWorld*.class
    added manifest
    adding: HelloWorld$.class(in = 670) (out= 432)(deflated 35%)
    adding: HelloWorld.class(in = 645) (out= 524)(deflated 18%)
    
    1. In the packaged JAR file, check if the HelloWorld class is set as the entry point of the application in MANIFEST.MF.
    ❯ unzip -q -c HelloWorld.jar META-INF/MANIFEST.MF
    Manifest-Version: 1.0
    Created-By: 1.8.0_181 (Oracle Corporation)
    Main-Class: HelloWorld # entry point
    

    2. Use IntelliJ SBT plugin

    This guide explains how to configure the environment for developing and debugging Spark applications in IntelliJ and build a WordCount job named Hello Scala with an example.

    • Build manager: SBT
    • Example composition environment: Windows OS, IntelliJ Ultimate 2022.1.4

    Create project

    Here's how to create a project:

    Please run IntelliJ.

    1. Search for Scala in the Plugins menu on the left and install it.
      chadoop-4-6-002_ko.png

    2. A restart is required for the plugin to take effect. Click the [Restart IDE] button to restart IntelliJ.
      chadoop-4-6-restart_ko.png

    3. Click Projects in the menu on the left side of the home screen, then click New Project.
      chadoop-4-6-003_ko.png

    4. After selecting Scala and sbt as follows, click the [Create] button.

      • Project name: Specify as WordCount
      • Select the versions for sbt and Scala
        chadoop-4-6-004_ko.png
    5. Check if the project has been created successfully.

      • Once a project is created, directories and files in the following structure can be seen by default.
        • .idea: IntelliJ configuration file
        • project: file used for compiling
        • src: source code. Most of the application code should be in src/main. src/test is a space for test scripts.
        • target: Compiled project is saved in this location
        • build.sbt: SBT configuration file
          chadoop-4-6-006_ko.png

    Import SBT library

    For IntelliJ to recognize Spark codes, the spark-core library and documents must be imported.

    Note
    • Spark-core libraries are compatible with Scala in a specific version, so check each of the spark-core and Scala versions if you import the libraries.
    1. Check the Scala version that is compatible with the spark-core library and Artifact ID in the mvn repository.

      chadoop-4-6-008_ko.png

    2. Click Target > build.sbt, and then add the following to the script window.

    libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
    
    1. Check the Build console to confirm if the library is successfully imported.
      chadoop-4-6-007_ko.png
    Note

    When importing a library in SBT, use the following syntax.

    Group Id %% Artifact Id % Revision
    

    Compose Spark application

    Here, a text file containing Shakespeare's sonnet (shakespeare.txt) is taken as a dataset and how to create an application for counting words included in the sonnet is explained.

    1. Download shakespeare.txt and save in src/main/resources.

      • When running this application in a Cloud Hadoop cluster, upload the dataset to an S3 bucket or HDFS.
        chadoop-4-6-009_ko.png
    2. Select src > main to expand the directory, right-click the scala directory, and then click New > Scala Class.

    3. Create a class under WordCount/src/main/scala.

      • Kind: Object
        chadoop-4-6-010_ko.png
    4. Write the sample code as below in WordCount.scala and run it to check if the configuration has been set successfully.

    object WordCount {
        def main(args: Array[String]): Unit = {
          println("This is WordCount application")
        }
    }
    
    1. Check if the message as below has been printed successfully.
      chadoop-4-6-011_ko.png

    2. Delete the sample code applied to WordCount.scala, and write the code to count the words in the Shakespeare sonnet text file as below.

    import org.apache.spark.{SparkConf, SparkContext}
    
    object WordCount {
    
      def main(args: Array[String]) : Unit = {
    
        //Create a SparkContext to initialize Spark
        val conf = new SparkConf()
        conf.setMaster("local")
        conf.setAppName("Word Count")
        val sc = new SparkContext(conf)
    
        // Load the text into a Spark RDD, which is a distributed representation of each line of text
        val textFile = sc.textFile("src/main/resources/shakespeare.txt")
    
        //word count
        val counts = textFile.flatMap(line => line.split(" "))
          .map(word => (word, 1))
          .reduceByKey(_ + _)
    
        counts.foreach(println)
        System.out.println("Total words: " + counts.count());
        counts.saveAsTextFile("/tmp/sonnetWordCount")
      }
    
    }
    
    Note

    Master URLs
    Master URLs differ depending on the Spark deployment environment.

    • Local (pseudo-cluster): local, local[N], local[*] (It depends on the number of threads used. "*" means as many threads as the maximum number of processors available in JVM.)

    • Clustered
      Spark Standalone: spark://host:port,host1:port1...
      Spark on Hadoop YARN: yarn
      Spark on Apache Mesos: mesos://

    1. Run WordCount.scala and check the output result.
      chadoop-4-6-012_ko.png

    Create JAR file

    1. Upload the dataset to an Object Storage bucket, and change the source code's resource file path as follows.
      • If you want upload the dataset to HDFS and then use it, use hdfs:// instead of s3a://.
    // FROM
    conf.setMaster("local")
    // TO
    conf.setMaster("yarn-cluster")
    
    // FROM
    val textFile = sc.textFile("src/main/resources/shakespeare.txt")
    // TO
    val textFile = sc.textFile("s3a://deepdrive-hue/tmp/shakespeare.txt")
    
    // FROM
    counts.saveAsTextFile("/tmp/sonnetWordCount");
    // TO
    counts.saveAsTextFile("s3a://deepdrive-hue/tmp/sonnetWordCount");
    
    Note

    As this example is based on Spark 1.6, you should specify yarn-cluster in conf.setMaster(). You can use yarn in Spark 2 or later.

    1. Use the following command in the Terminal console to package the updated code into a compiled jar so that you can submit it to your Cloud Hadoop cluster.
      • The JAR file contains the application code and all the dependencies defined in build.sbt.
      • The sbt package command creates wordcount_2.11-0.1.jar under $PROJECT_HOME/target/scala-2.11.
    > cd ~/IdeaProjects/WordCount # PROJECT_HOME
    > sbt package
    

    chadoop-4-6-terminal_ko.png

    Submit Spark job to Cloud Hadoop cluster

    How to deploy and submit a Spark application (.jar) created in your local machine to the Cloud Hadoop cluster is explained.

    Upload JAR to Object Storage

    Copy shakespeare.txt and .jar to the Object Storage bucket, using HUE's S3 browser or Object Storage console.

    • Please refer to Using HUE for more information about accessing and using HUE.
    • Please refer to Object Storage Guide for more information about Object Storage buckets.
      chadoop-4-6-013_ko.png

    Submit job

    Two ways to submit JAR files to a cluster are explained.

    Note

    The following properties should be properly configured in spark-defaults.conf.

    spark.hadoop.fs.s3a.access.key=<OBEJCT-STORAGE-ACCESS-KEY>
    spark.hadoop.fs.s3a.endpoint=http://kr.objectstorage.ncloud.com
    spark.hadoop.fs.s3a.secret.key=<OBJECT-STORAGE-SECRET-KEY>
    spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
    
    • Use HUE's Spark Submit Jar
      chadoop-4-6-014_ko.png

    • Submit from the spark client nodes

    1. Execute the spark-submit command from the node in which the Spark client is installed in the cluster, as shown below.
    spark-submit --class WordCount --master yarn-cluster --deploy-mode cluster s3a://deepdrive-hue/tmp/wordcount_2.11-0.1.jar
    
    1. When the job is completed, check if the results are stored in the specified bucket path as shown below.
      chadoop-4-6-015_ko.png
    Note

    Deploy mode depends on the location where the driver (Spark Context) is run in the deployment environment. The mode has the following options:

    • client (default): The driver is executed in the machine where the Spark application is running
    • cluster: The driver is executed in a random node in the cluster

    You can change the mode to spark.submit.deployMode by using the --deploy-mode cli option of the spark-submit command or in Spark properties configuration.


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.