Available in VPC

You can configure an arbitrary Spark execution environment.

This guide introduces how to install the Spark 3 version in Cloud Hadoop and configure the Spark execution environment.

Preliminary task

This example is based on the assumption that it is run in an environment already applied as a client.
Only when you need to configure the client using the server should you proceed with the following 1-4 preliminary tasks.

1. Check communication between the server and the cluster

Check whether the communication between the server and the cluster is available.
The server must be registered in ACG where the Cloud Hadoop cluster is configured.
For more information about ACG, see the Set ACG guide.

2. Register the host name and IP information of the Cloud Hadoop cluster

Register the host name and Private IP information of the Cloud Hadoop cluster in /etc/hosts.
You can check this information in Ambari UI > Hosts.

For more information about accessing and using the Ambari UI, see the Ambari UI guide.
To register the host name and private IP information of the Cloud Hadoop cluster in /etc/hosts:

# root user
# echo 'IP            host name’      >> /etc/hosts
echo  '1**.**.*.*  e-001-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  m-001-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  m-002-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  d-001-*****-**'  >> /etc/hosts
echo  '1**.**.*.*  d-002-*****-**'  >> /etc/hosts

3. Configure Hadoop client

Note

The Hadoop client package is already installed in all nodes within the cluster.
If it is not installed, install it through the following procedure.
This procedure must be performed while connected to the edge node via SSH.

Hadoop environment variables are used in Spark, so Hadoop client configuration is required.
You can install the hadoop-client package with simple repo settings and yum command.

To install the hadoop-client package:

Check if the repo file is configured as follows using the following command.

$ cat /etc/yum.repos.d/ambari-nch-1.repo # Cloud Hadoop 2.x

[NCH-3.1-repo-1]
name=NCH-3.1-repo-1
baseurl=http://nch-repo.navercorp.com/yum/nch/redhat/NCH-3.1.1.3.1.2
path=/
enabled=1
gpgcheck=0
[NCH-3.1-GPL-repo-1]
name=NCH-3.1-GPL-repo-1
baseurl=http://nch-repo.navercorp.com/yum/nch/redhat/NCH-GPL-3.1.0.0
path=/
enabled=1
gpgcheck=0
[NCH-UTILS-1.1.0.22-repo-1]
name=NCH-UTILS-1.1.0.22-repo-1
baseurl=http://nch-repo.navercorp.com/yum/nch/redhat/NCH-UTILS-1.1.0.22
path=/
enabled=1

$ cat /etc/yum.repos.d/ambari-hdp-1.repo # Cloud Hadoop 1.x

[HDP-3.1-repo-1]
name=HDP-3.1-repo-1
baseurl=http://dist.kr.hadoop3.naverncp.com/HDP/centos7/3.x/updates/3.1.0.0
path=/
enabled=1
gpgcheck=0
[HDP-3.1-GPL-repo-1]
name=HDP-3.1-GPL-repo-1
baseurl=http://dist.kr.hadoop3.naverncp.com/HDP-GPL/centos7/3.x/updates/3.1.0.0
path=/
enabled=1
gpgcheck=0
[HDP-UTILS-1.1.0.22-repo-1]
name=HDP-UTILS-1.1.0.22-repo-1
baseurl=http://dist.kr.hadoop3.naverncp.com/HDP-UTILS-1.1.0.22/repos/centos7
path=/
enabled=1

Check whether hadoop-client is created using the following command.

$ yum clean all 
$ yum install hadoop-client
$ curl -u $AMBARI_ID:$AMBARI_PASS -H "X-Requested-By: ambari" -X GET http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/services/HDFS/components/HDFS_CLIENT?format=client_config_tar > hdfs_client_conf.tar.gz
$ tar -xvf hdfs_client_conf.tar.gz
$ sudo cp ~/* /usr/nch/current/hadoop-client/conf/ # Cloud Hadoop 2.x
$ sudo cp ~/* /usr/hdp/current/hadoop-client/conf/ # Cloud Hadoop 1.x

4. Check whether JDK and Python3 are installed

JDK and Python3 must be already installed.
You can use Python2 in the previous Spark versions, but you can only use Python3 from Spark 3.0.0.

Run the following command to install Python3.

$ yum install -y python3

Apply the Spark 3.0.1 version

1. Download Spark package

Download and unzip the spark package you want to use in the server using the following command.

Spark 3.0.1 download page: https://archive.apache.org/dist/spark/spark-3.0.1/
Because it is run in an environment where the Hadoop client is already configured, download Pre-built with user-provided Apache Hadoop (spark-3.0.1-bin-without-hadoop.tgz), and then unzip it in an arbitrary directory.

$ wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-without-hadoop.tgz
$ tar xvfz spark-3.0.1-bin-without-hadoop.tgz

2. Set Spark environment variables

Set the Spark environment variables using the following command, and copy Hadoop jar to the unzipped package.

# Configure environment variables
export SPARK_HOME=/home1/sshuser/spark-3.0.1-bin-without-hadoop
export SPARK_CONF_DIR=/home1/sshuser/spark-3.0.1-bin-without-hadoop/conf
export SPARK_SUBMIT_OPTS="-Dhdp.version=3.1.0.0-78"
export PATH=$SPARK_HOME/bin:$PATH
export SPARK_DIST_CLASSPATH=`$HADOOP_COMMON_HOME/bin/hadoop classpath`

# Copy the config file
$ sudo cp /usr/hdp/current/spark2-client/conf/* $SPARK_CONF_DIR/


# Copy the JARs related to Hadoop to the Spark jars directory.
$ cp -n /usr/nch/current/spark2-client/jars/hadoop-*.jar $SPARK_HOME/jars

3. Check starting

Check whether it is run with the installed version information using the following command.
When the following screen is displayed, Spark 3.0.1 is ready for use.

$ pyspark --version

hadoop-chadoop-use-ex2_1-2_ko

4. Grant owner permissions

Create the folder only for user under /user using the following command, and then grant owner permissions.
Spark job runs normally only when the user account {USER}'s folder is under /user of HDFS.

The following example is a case where USER is sshuser.

$ sudo -u hdfs hadoop fs -mkdir /user/sshuser
$ sudo -u hdfs hadoop fs -chown -R sshuser:hdfs /user/sshuser/

5. Run PySpark and spark-shell

To run PySpark and spark-shell:

When you run PySpark, add the following options to run it.

$ pyspark --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro

Run spark-shell with the following command.

spark-shell --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
--conf spark.kerberos.access.hadoopFileSystems=hdfs://<indicate the name node to be used>