Using Spark 3 version
    • PDF

    Using Spark 3 version

    • PDF

    Article Summary

    Available in VPC

    Users can configure an arbitrary Spark execution environment.

    This guide introduces the method for configuring a Spark execution environment by installing the Spark 3 version on Cloud Hadoop.

    Preparations

    For this example, we'll assume that we're executing it in an environment where the client is already applied, and proceed from there.
    Only follow Steps 1 to 4 for preparation below if you need to configure a client using the Server.

    1. Check communications between server and cluster

    Check if communications between the server and cluster are available.
    The server needs to be registered to the ACG where the Cloud Hadoop cluster is configured.
    Please refer to ACG settings for more information about ACG.

    2. Register host name and IP information for the Cloud Hadoop cluster

    Register the host name and IP information of the Cloud Hadoop cluster to /etc/hosts.
    This information can be viewed from the Ambari UI.
    Please refer to Ambari UI for more information about accessing and using the Ambari UI.
    hadoop-chadoop-use-ex2_0-1_en

    • The method to register the host name and IP information of the Cloud Hadoop cluster in /etc/hosts is as follows.
    # root user
    # echo 'IP            host name'      >> /etc/hosts
    echo  '1**.**.*.*  e-001-*****-**'  >> /etc/hosts
    echo  '1**.**.*.*  m-001-*****-**'  >> /etc/hosts
    echo  '1**.**.*.*  m-002-*****-**'  >> /etc/hosts
    echo  '1**.**.*.*  d-001-*****-**'  >> /etc/hosts
    echo  '1**.**.*.*  d-002-*****-**'  >> /etc/hosts
    

    3. Configure Hadoop client

    Since Spark uses the environment variables of Hadoop, you need to configure a Hadoop client.
    You can install the hadoop-client package through simple repo settings and yum command.

    The following describes how to install the hadoop-client package.

    1. Use the following command to configure the /etc/yum.repos.d/ambari-hdp-1.repo file.

      $ cat /etc/yum.repos.d/ambari-hdp-1.repo
      [HDP-3.1-repo-1]
      name=HDP-3.1-repo-1
      baseurl=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0
      path=/
      enabled=1
      gpgcheck=0
      [HDP-3.1-GPL-repo-1]
      name=HDP-3.1-GPL-repo-1
      baseurl=http://public-repo-1.hortonworks.com/HDP-GPL/centos7/3.x/updates/3.1.0.0
      path=/
      enabled=1
      gpgcheck=0
      [HDP-UTILS-1.1.0.22-repo-1]
      name=HDP-UTILS-1.1.0.22-repo-1
      baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/centos7
      path=/
      enabled=1
      
    2. Use the following command to check if hadoop-client has been created under /usr/hdp/current/.

      $ yum clean all 
      $ yum install hadoop-client
      $ curl -u $AMBARI_ID:$AMBARI_PASS -H "X-Requested-By: ambari" -X GET http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/services/HDFS/components/HDFS_CLIENT?format=client_config_tar > hdfs_client_conf.tar.gz
      $ tar -xvf hdfs_client_conf.tar.gz
      $ cp ~hdfs_client_conf/conf/* /usr/hdp/current/hadoop-client/conf/
      

    4. Check the installation status of JDK and Python 3

    JDK and Python 3 must be installed in advance.
    Python 2 could be used for the previous Spark versions, but starting from Spark 3.0.0, only Python 3 can be used.

    Run the following command to install Python 3.

    $ yum install -y python3
    

    Apply Spark 3.0.1 version

    1. Download Spark package

    Use the following command to download and decompress the Spark package you want to use in the server.

    • Spark 3.0.1 download page: https://archive.apache.org/dist/spark/spark-3.0.1/
    • Since we're executing it from an environment where the Hadoop client is already configured, download Pre-built with user-provided Apache Hadoop (spark-3.0.1-bin-without-hadoop.tgz) and decompress it in any directory.
      hadoop-chadoop-use-ex2_1-1_en
    $ wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-without-hadoop.tgz
    $ tar xvfz spark-3.0.1-bin-without-hadoop.tgz
    

    2. Configure the Spark environment variable

    Configure Spark environment variables using the following command, and copy the Hadoop jar to the decompressed package.

    # Specify the decompressed Spark directory.
    $ SPARK_HOME=/path/to/spark-3.0.1-bin-without-hadoop
    $ SPARK_CONF_DIR=/path/to/spark-3.0.1-bin-without-hadoop/conf
    
    
    # Copy config file
    $ cp /usr/hdp/current/spark2-client/conf/* $SPARK_CONF_DIR/
    
    
    # Copy the JAR related to Hadoop to the Spark jars directory.
    $ cp -n /usr/hdp/current/spark2-client/jars/hadoop-*.jar $SPARK_HOME/jars
    

    Set following environment variable from the location where the Spark-submit is executed.

    $ export SPARK_HOME=/path/to/spark-3.0.1-bin-without-hadoop
    $ export SPARK_CONF_DIR=/path/to/spark-3.0.1-bin-without-hadoop/conf
    $ export SPARK_SUBMIT_OPTS="-Dhdp.version=3.1.0.0-78"
    $ export PATH=$SPARK_HOME/bin:$PATH
    $ export SPARK_DIST_CLASSPATH=`$HADOOP_COMMON_HOME/bin/hadoop classpath`
    

    3. Check operation

    Use the following command to check if it runs with the installed version information.
    If you see the screen as below, then it means that you can use Spark 3.0.1.

    $ pyspark --version
    

    hadoop-chadoop-use-ex2_1-2_en

    4. Grant owner permissions

    Use the following command to create a dedicated user folder under /user, and grant the owner permissions.
    Spark jobs run normally only when the folder of the user account {USER} is under /user of HDFS.

    In the following example, the user is sshuser.

    $ sudo -u hdfs hadoop fs -mkdir /user/sshuser
    $ sudo -u hdfs hadoop fs -chown -R sshuser:hdfs /user/sshuser/
    

    5. Execute PySpark and spark-shell

    Here's how to run PySpark and spark-shell.

    1. When executing PySpark, add the following options to run.

      $ pyspark --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
      --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
      --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \
      --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
      
    2. Use the command below to execute the Spark-shell as well.

      spark-shell --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
      --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
      --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \
      --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
      --conf spark.kerberos.access.hadoopFileSystems=hdfs://<specify the name node to be used>
      

    Was this article helpful?

    What's Next
    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.