Using Spark 3 version
  • PDF

Using Spark 3 version

  • PDF

Available in VPC

Users can configure an arbitrary Spark execution environment.

This guide introduces the method for configuring a Spark execution environment by installing the Spark 3 version on Cloud Hadoop.

Preparations

For this example, we'll assume that we're executing it in an environment where the client is already applied, and proceed from there.
Only follow Steps 1 to 4 for preparation below if you need to configure a client using the Server.

1. Check communications between server and cluster

Check if communications between the server and cluster are available.
The server needs to be registered to the ACG where the Cloud Hadoop cluster is configured.
Please refer to ACG settings for more information about ACG.

2. Register host name and IP information for the Cloud Hadoop cluster

Register the host name and IP information of the Cloud Hadoop cluster to /etc/hosts.
This information can be viewed from the Ambari UI.
Please refer to Ambari UI for more information about accessing and using the Ambari UI.
hadoop-chadoop-use-ex2_0-1_en

  • The method to register the host name and IP information of the Cloud Hadoop cluster in /etc/hosts is as follows.
# root user
# echo 'IP            host name'      >> /etc/hosts
echo  '1**.**.*.*  e-001-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  m-001-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  m-002-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  d-001-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  d-002-*****-**'  >> /etc/hosts

3. Configure Hadoop client

Since Spark uses the environment variables of Hadoop, you need to configure a Hadoop client.
You can install the hadoop-client package through simple repo settings and yum command.

The following describes how to install the hadoop-client package.

  1. Use the following command to configure the /etc/yum.repos.d/ambari-hdp-1.repo file.

    $ cat /etc/yum.repos.d/ambari-hdp-1.repo
    [HDP-3.1-repo-1]
    name=HDP-3.1-repo-1
    baseurl=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0
    path=/
    enabled=1
    gpgcheck=0
    [HDP-3.1-GPL-repo-1]
    name=HDP-3.1-GPL-repo-1
    baseurl=http://public-repo-1.hortonworks.com/HDP-GPL/centos7/3.x/updates/3.1.0.0
    path=/
    enabled=1
    gpgcheck=0
    [HDP-UTILS-1.1.0.22-repo-1]
    name=HDP-UTILS-1.1.0.22-repo-1
    baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/centos7
    path=/
    enabled=1
    
  2. Use the following command to check if hadoop-client has been created under /usr/hdp/current/.

    $ yum clean all 
    $ yum install hadoop-client
    $ curl -u $AMBARI_ID:$AMBARI_PASS -H "X-Requested-By: ambari" -X GET http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/services/HDFS/components/HDFS_CLIENT?format=client_config_tar > hdfs_client_conf.tar.gz
    $ tar -xvf hdfs_client_conf.tar.gz
    $ cp ~hdfs_client_conf/conf/* /usr/hdp/current/hadoop-client/conf/
    

4. Check the installation status of JDK and Python 3

JDK and Python 3 must be installed in advance.
Python 2 could be used for the previous Spark versions, but starting from Spark 3.0.0, only Python 3 can be used.

Run the following command to install Python 3.

$ yum install -y python3

Apply Spark 3.0.1 version

1. Download Spark package

Use the following command to download and decompress the Spark package you want to use in the server.

  • Spark 3.0.1 download page: https://archive.apache.org/dist/spark/spark-3.0.1/
  • Since we're executing it from an environment where the Hadoop client is already configured, download Pre-built with user-provided Apache Hadoop (spark-3.0.1-bin-without-hadoop.tgz) and decompress it in any directory.
    hadoop-chadoop-use-ex2_1-1_en
$ wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-without-hadoop.tgz
$ tar xvfz spark-3.0.1-bin-without-hadoop.tgz

2. Configure the Spark environment variable

Configure Spark environment variables using the following command, and copy the Hadoop jar to the decompressed package.

# Specify the decompressed Spark directory.
$ SPARK_HOME=/path/to/spark-3.0.1-bin-without-hadoop
$ SPARK_CONF_DIR=/path/to/spark-3.0.1-bin-without-hadoop/conf


# Copy config file
$ cp /usr/hdp/current/spark2-client/conf/* $SPARK_CONF_DIR/


# Copy the JAR related to Hadoop to the Spark jars directory.
$ cp -n /usr/hdp/current/spark2-client/jars/hadoop-*.jar $SPARK_HOME/jars

Set following environment variable from the location where the Spark-submit is executed.

$ export SPARK_HOME=/path/to/spark-3.0.1-bin-without-hadoop
$ export SPARK_CONF_DIR=/path/to/spark-3.0.1-bin-without-hadoop/conf
$ export SPARK_SUBMIT_OPTS="-Dhdp.version=3.1.0.0-78"
$ export PATH=$SPARK_HOME/bin:$PATH
$ export SPARK_DIST_CLASSPATH=`$HADOOP_COMMON_HOME/bin/hadoop classpath`

3. Check operation

Use the following command to check if it runs with the installed version information.
If you see the screen as below, then it means that you can use Spark 3.0.1.

$ pyspark --version

hadoop-chadoop-use-ex2_1-2_en

4. Grant owner permissions

Use the following command to create a dedicated user folder under /user, and grant the owner permissions.
Spark jobs run normally only when the folder of the user account {USER} is under /user of HDFS.

In the following example, the user is sshuser.

$ sudo -u hdfs hadoop fs -mkdir /user/sshuser
$ sudo -u hdfs hadoop fs -chown -R sshuser:hdfs /user/sshuser/

5. Execute PySpark and spark-shell

Here's how to run PySpark and spark-shell.

  1. When executing PySpark, add the following options to run.

    $ pyspark --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
    --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
    --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \
    --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
    
  2. Use the command below to execute the Spark-shell as well.

    spark-shell --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
    --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
    --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \
    --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
    --conf spark.kerberos.access.hadoopFileSystems=hdfs://<specify the name node to be used>
    

Was this article helpful?

What's Next