- Print
- PDF
Using Spark 3 version
- Print
- PDF
Available in VPC
Users can configure an arbitrary Spark execution environment.
This guide introduces the method for configuring a Spark execution environment by installing the Spark 3 version on Cloud Hadoop.
Preparations
For this example, we'll assume that we're executing it in an environment where the client is already applied, and proceed from there.
Only follow Steps 1 to 4 for preparation below if you need to configure a client using the Server.
1. Check communications between server and cluster
Check if communications between the server and cluster are available.
The server needs to be registered to the ACG where the Cloud Hadoop cluster is configured.
Please refer to ACG settings for more information about ACG.
2. Register host name and IP information for the Cloud Hadoop cluster
Register the host name and IP information of the Cloud Hadoop cluster to /etc/hosts
.
This information can be viewed from the Ambari UI.
Please refer to Ambari UI for more information about accessing and using the Ambari UI.
- The method to register the host name and IP information of the Cloud Hadoop cluster in
/etc/hosts
is as follows.
# root user
# echo 'IP host name' >> /etc/hosts
echo '1**.**.*.* e-001-*****-**' >> /etc/hosts
echo '1**.**.*.* m-001-*****-**' >> /etc/hosts
echo '1**.**.*.* m-002-*****-**' >> /etc/hosts
echo '1**.**.*.* d-001-*****-**' >> /etc/hosts
echo '1**.**.*.* d-002-*****-**' >> /etc/hosts
3. Configure Hadoop client
Since Spark uses the environment variables of Hadoop, you need to configure a Hadoop client.
You can install the hadoop-client package through simple repo settings and yum command.
The following describes how to install the hadoop-client package.
Use the following command to configure the
/etc/yum.repos.d/ambari-hdp-1.repo
file.$ cat /etc/yum.repos.d/ambari-hdp-1.repo [HDP-3.1-repo-1] name=HDP-3.1-repo-1 baseurl=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0 path=/ enabled=1 gpgcheck=0 [HDP-3.1-GPL-repo-1] name=HDP-3.1-GPL-repo-1 baseurl=http://public-repo-1.hortonworks.com/HDP-GPL/centos7/3.x/updates/3.1.0.0 path=/ enabled=1 gpgcheck=0 [HDP-UTILS-1.1.0.22-repo-1] name=HDP-UTILS-1.1.0.22-repo-1 baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/centos7 path=/ enabled=1
Use the following command to check if hadoop-client has been created under
/usr/hdp/current/
.$ yum clean all $ yum install hadoop-client $ curl -u $AMBARI_ID:$AMBARI_PASS -H "X-Requested-By: ambari" -X GET http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/services/HDFS/components/HDFS_CLIENT?format=client_config_tar > hdfs_client_conf.tar.gz $ tar -xvf hdfs_client_conf.tar.gz $ cp ~hdfs_client_conf/conf/* /usr/hdp/current/hadoop-client/conf/
4. Check the installation status of JDK and Python 3
JDK and Python 3 must be installed in advance.
Python 2 could be used for the previous Spark versions, but starting from Spark 3.0.0, only Python 3 can be used.
Run the following command to install Python 3.
$ yum install -y python3
Apply Spark 3.0.1 version
1. Download Spark package
Use the following command to download and decompress the Spark package you want to use in the server.
- Spark 3.0.1 download page: https://archive.apache.org/dist/spark/spark-3.0.1/
- Since we're executing it from an environment where the Hadoop client is already configured, download Pre-built with user-provided Apache Hadoop (
spark-3.0.1-bin-without-hadoop.tgz
) and decompress it in any directory.
$ wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-without-hadoop.tgz
$ tar xvfz spark-3.0.1-bin-without-hadoop.tgz
2. Configure the Spark environment variable
Configure Spark environment variables using the following command, and copy the Hadoop jar to the decompressed package.
# Specify the decompressed Spark directory.
$ SPARK_HOME=/path/to/spark-3.0.1-bin-without-hadoop
$ SPARK_CONF_DIR=/path/to/spark-3.0.1-bin-without-hadoop/conf
# Copy config file
$ cp /usr/hdp/current/spark2-client/conf/* $SPARK_CONF_DIR/
# Copy the JAR related to Hadoop to the Spark jars directory.
$ cp -n /usr/hdp/current/spark2-client/jars/hadoop-*.jar $SPARK_HOME/jars
Set following environment variable from the location where the Spark-submit is executed.
$ export SPARK_HOME=/path/to/spark-3.0.1-bin-without-hadoop
$ export SPARK_CONF_DIR=/path/to/spark-3.0.1-bin-without-hadoop/conf
$ export SPARK_SUBMIT_OPTS="-Dhdp.version=3.1.0.0-78"
$ export PATH=$SPARK_HOME/bin:$PATH
$ export SPARK_DIST_CLASSPATH=`$HADOOP_COMMON_HOME/bin/hadoop classpath`
3. Check operation
Use the following command to check if it runs with the installed version information.
If you see the screen as below, then it means that you can use Spark 3.0.1.
$ pyspark --version
4. Grant owner permissions
Use the following command to create a dedicated user folder under /user
, and grant the owner permissions.
Spark jobs run normally only when the folder of the user account {USER} is under /user
of HDFS.
In the following example, the user is sshuser
.
$ sudo -u hdfs hadoop fs -mkdir /user/sshuser
$ sudo -u hdfs hadoop fs -chown -R sshuser:hdfs /user/sshuser/
5. Execute PySpark and spark-shell
Here's how to run PySpark and spark-shell.
When executing PySpark, add the following options to run.
$ pyspark --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \ --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \ --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \ --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
Use the command below to execute the Spark-shell as well.
spark-shell --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \ --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \ --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \ --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro --conf spark.kerberos.access.hadoopFileSystems=hdfs://<specify the name node to be used>