SLURM Sbatch Script for Spark Applications

Spark applications use parallelisme by distributing computations to multiple executors under the supervision of a driver process. Each executor uses its cores for running independent tasks from the application. The executor processes are placed on the nodes in a cluster under the supervision of a cluster manager. The Apache Spark software provides components to set up the standalone cluster on a set of nodes. The Spark standalone cluster has a master for controlling the driver, running on one core of a node, called the master host, and one or more workers controlling the executors, running on a prescribed number of cores on different worker nodes. The master host can simultaneous be one of the worker nodes.

The SLURM workload management allowes to allocate a set of hardware resources. It will give exclusive use to a number of cores on a number of different nodes on GWDG's compute cluster. The sbatch <scriptfile> command allows to specify in <scriptfile> resources to be allocated and the commands to be executed on these resources.

The following scriptfile start-standalone.script will start a Spark stanalone cluster on resources allocated by the SLURM management system:

#!/bin/bash

#SBATCH --partition=medium
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2048
#SBATCH --time=60:00
#SBATCH --output=outfile-%J
 
. setenv.sh
$SPARK_HOME/sbin/start-all.sh
. wait-worker.sh
 

sleep infinity

The Spark standalone cluster will provide as many executors as specified in the --ntask option, with a number of cores per executor as specified in the --cpus-per-task option and an amount of memory per executor equal to the number of cores times the value specified in the --mem-per-cpu option. The SLURM system will allocate a number of nodes to supply the resources needed for the specified configuration of the Sparrk application. The number of workers in the Spark standalone cluster will be equal to the number of allocated nodes. The Spark standalone cluster will be alive for the time required by the option --time, or until the command scancel <slurm_job_id> is launched, where <slurm_job_id> is the job-id returned from the sbatch command.

This configuration of the Spark standalone cluster and the SparkContext is set by sourcing the script setenv.sh:

setenv.sh

##setenv.sh##
module load JAVA/jdk1.8.0_31 spark

export SPARK_CONF_DIR=~/SparkConf
mkdir -p $SPARK_CONF_DIR

env=$SPARK_CONF_DIR/spark-env.sh
echo "export SPARK_LOG_DIR=~/SparkLog" > $env
echo "export SPARK_WORKER_DIR=~/SparkWorker" >> $env
echo "export SLURM_MEM_PER_CPU=$SLURM_MEM_PER_CPU" >> $env
echo 'export SPARK_WORKER_CORES=`nproc`' >> $env
echo 'export SPARK_WORKER_MEMORY=$(( $SPARK_WORKER_CORES*$SLURM_MEM_PER_CPU ))M' >> $env

echo "export SPARK_HOME=$SPARK_HOME" > ~/.bashrc
echo "export JAVA_HOME=$JAVA_HOME" >> ~/.bashrc
echo "export SPARK_CONF_DIR=$SPARK_CONF_DIR" >> ~/.bashrc

scontrol show hostname $SLURM_JOB_NODELIST > $SPARK_CONF_DIR/slaves

conf=$SPARK_CONF_DIR/spark-defaults.conf
echo "spark.default.parallelism" $(( $SLURM_CPUS_PER_TASK * $SLURM_NTASKS ))> $conf
echo "spark.submit.deployMode" client >> $conf
echo "spark.master" spark://`hostname`:7077 >> $conf
echo "spark.executor.cores" $SLURM_CPUS_PER_TASK >> $conf
echo "spark.executor.memory" $(( $SLURM_CPUS_PER_TASK*$SLURM_MEM_PER_CPU ))M >> $conf

The Spark standalone cluster is started by the start-all.sh command, which returns after starting the master und worker processes. The standalone cluster will not be ready for executing Spark applications, until connections between master and workers have been established and the worker processes have been registerd with the master process. Sourcing the script wait-workers.sh will wait until the registering of all workers has succeeded:

wait-worker.sh

##wait-worker.sh##
. $SPARK_CONF_DIR/spark-env.sh
num_workers=`cat $SPARK_CONF_DIR/slaves|wc -l`
echo number of workers to be registered: $num_workers
master_logfile=`ls -tr ${SPARK_LOG_DIR}/*master* |tail -1`
worker_logfiles=`ls -tr ${SPARK_LOG_DIR}/*worker* |tail -$num_workers`
steptime=3
for i in {1..100}
do
  sleep $steptime
  num_reg=` grep 'registered' $worker_logfiles|wc -l`
  if [ $num_reg -eq $num_workers ]
  then
     break
  fi
done
echo registered workers after $((i * steptime)) seconds  :
for file in $worker_logfiles
do
  grep 'registered' $file
done
grep 'Starting Spark master' $master_logfile
grep 'Registering worker' $master_logfile|tail -$num_workers

The wait-worker.sh script will signal the completion of the standalone cluster by writing corresponding information lines into the output file specified in the --output option of the batch script:

number of workers to be registered: 3
registered workers after 114 seconds :
19/08/11 12:26:25 INFO Worker: Successfully registered with master spark://gwdd034.global.gwdg.cluster:7077
19/08/11 12:27:07 INFO Worker: Successfully registered with master spark://gwdd034.global.gwdg.cluster:7077
19/08/11 12:26:29 INFO Worker: Successfully registered with master spark://gwdd034.global.gwdg.cluster:7077
19/08/11 12:26:24 INFO Master: Starting Spark master at spark://gwdd034.global.gwdg.cluster:7077
19/08/11 12:26:25 INFO Master: Registering worker 10.108.102.34:43775 with 4 cores, 8.0 GB RAM
19/08/11 12:26:29 INFO Master: Registering worker 10.108.102.46:35564 with 2 cores, 4.0 GB RAM
19/08/11 12:27:07 INFO Master: Registering worker 10.108.102.47:37907 with 2 cores, 4.0 GB RAM

Only after finding these lines in the output file, Spark applications can be submitted from the same login node from which the SLURM batch jab had been started:

gwdu102 > module load JAVA spark
gwdu102 > SPARK_CONF_DIR=~/SparkConf
gwdu102 > spark-submit  ~/SparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar

For executig the Spark application as a batch job, the batch sript start-standalone.sh has to be modified by replacing the command sleep infinity by

spark-submit ~/SparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
$SPARK_HOME/sbin/stop-all.sh

Values for the Spark environment variables $SPARK_CONF_DIR, SPARK_LOG_DIR and SPARK_WORKER_DIR have to be supplied in the setenv.sh script. The files from successive executions of Spark applications will be kept in the SPARK_LOG_DIR and SPARK_WORKER_DIR directories, they have to be deleted by hand if no longer needed. The second group of commands in the setenv.sh scriptfile creates a file ~/.bashrc. This file is sourced by the ssh commands to set the environment on the worker nodes before starting the worker processes. Any existing ~/.bashrc file therefore should be saved before sourcing the setenv.sh scriptfile.

An introduction to the Apache Spark system can be found in the chapter user provided application documentation of GWDG's HPC documentation. under the heading Parallel Processing with Spark on GWDG’s Scientific Compute Cluster . Detailed explanations for setting up a Spark standalone cluster on resources allocated by SLURM sbatch jobs are given in the sectionsSetting up a Spark Standalone Cluster, Submitting to the Spark Standalone Cluster and Running Spark Applications on GWDG's Scientific Compute Cluster.