Using Spark History Server

Available in VPC

You can create a private Spark History Server with the Spark History Server app and view only tasks that you have performed. Data Forest supports the SPARK-HISTORYSERVER-3.1.2 app type.

Check Spark History Server app details

Once the app is created, you can view its details. If the Status in the app details is Stable, the app is running normally.
To view app details:

In the VPC environment on the NAVER Cloud Platform console, navigate to > Services > Big Data & Analytics > Data Forest.
Click Data Forest > Apps on the left.
Select the account that owns the app.
Click the app to view its details.
Review the app details.
- Quick links
  - Spark History REST API: REST API provided by the Spark History Server.
  - Spark History UI: URL for accessing the Spark History UI.
  - shell-shs-0: Web shell URL for the container where Spark History is installed. Log in using your account name and password.
  - supervisor-shs-0: Web shell URL for the container where Supervisor is installed. Log in using your account name and password.
- Component: SPARK-HISTROYSERVER-3.1.2 type consists of a single shs component.
  - shs: Requests 1 Core/4 GB memory as default to run.

Access Spark History Server

The following shows the Spark History UI interface accessed from Quick links.

df-shs_04-1_vpc

Spark History Server provides REST API.
Access the Spark History REST API URL from the Quick links list in the app details.
The following shows the access interface.

df-shs_05-1_vpc

When you use REST API through the web shell, you can use as follows using the Spark History REST API address confirmed earlier. The following is an example of the dataforest-test user.

$ curl -i -u ys https://dataforest-test--sparkhs-new--shs--18080.proxy.kr.df.naverncp.com/api/v1/version

Enter host password for user 'dataforest-test':
HTTP/1.1 200 OK
Server: nginx/1.14.0
Date: Fri, 14 Oct 2022 08:14:24 GMT
Content-Type: application/json
Content-Length: 25
Connection: keep-alive
Set-Cookie: hadoop.auth="u=dataforest-test&p=dataforest-test&t=authnz-ldap&e=1665771263843&s=v37ewQQe7TSTjntpg5rqUfZsRrRuCvfQux0P2onFy7I="; HttpOnly
Cache-Control: no-cache, no-store, must-revalidate
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Vary: Accept-Encoding, User-Agent

{
  "spark" : "3.1.2-1"
}

Set Spark tasks

To use a private Spark History Server, enter the following in Spark and complete the task configuration.

spark.eventLog.enabledspark.eventLog.enabled : true
spark.eventLog.dir: Same as the spark.history.fs.logDirectory setting of the Spark History Server app. The default value is hdfs://koya/user/{USER}/spark2-history/. Enter the user account name in {USER}.
spark.yarn.historyServer.address: Address of the history server. After creating the app, enter the URL of Spark History UI in Quick links.

Example:
The following example applies to the dataforest-test user.

Property Name	Info
spark.eventLog.enabled	`true`
spark.eventLog.dir	`hdfs://koya/user/dataforest-test/spark2-history/`
spark.yarn.historyServer.address	`https://dataforest-test--spark-historyserver--shs--18080.proxy.kr.df.naverncp.com`

If you submit the task after changing settings, you can view the task information submitted in a private Spark History Server.

Configure private Spark tasks

To add settings for a private Spark task:

$ vi $SPARK_CONF_DIR/spark-defaults.conf
...
spark.eventLog.dir hdfs://koya/user/dataforest-test/spark2-history/
spark.eventLog.enabled true
spark.yarn.historyServer.address={Spark History UI}
...

Note

When you use the web shell, edit the following configuration file that was transferred in advance to use it.

  $ cd ~/conf
  $ vi spark-defaults.conf # Change configuration

Run Pyspark and spark-shell

To run Pyspark and spark-shell:

When running PySpark, add the following options.

$ pyspark --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro

Run spark-shell with the following command.

spark-shell --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/usr/hdp:/usr/hdp:ro
--conf spark.kerberos.access.hadoopFileSystems=hdfs://<indicate the name node to be used>

When using Zeppelin

You can use a private Spark History Server app in Spark interpreter of Apache Zeppelin. Add settings related to history servers by referring to how to set spark.yarn.queue in Using Zeppelin > Interpreter settings.