This is an old revision of the document!


Running Jobs with SLURM

In the following the basic concepts will be described.

Cluster
A collection of networked computers intended to provide compute capabilities.

Node
One of these computers, also called host.

Frontend
A special node provided to interact with the cluster via shell commands. gwdu101, gwdu102 and gwdu103 are our frontends.

Task or (Job-)Slot
Compute capacity for one process (or “thread”) at a time, usually one processor core, or CPU for short.

Job
A compute task consisting of one or several parallel processes.

Batch System
The management system distributing job processes across job slots. In our case Slurm, which is operated by shell commands on the frontends.

Serial job
A job consisting of one process using one job slot.

SMP job
A job with shared memory parallelization (often realized with OpenMP), meaning that all processes need access to the memory of the same node. Consequently an SMP job uses several job slots on the same node.

MPI job
A Job with distributed memory parallelization, realized with MPI. Can use several job slots on several nodes and needs to be started with mpirun or a substitute.

Partition
A label to sort jobs by general requirements and intended execution nodes. Formerly called “queue”

The ''sbatch'' Command: Submitting Jobs to the Cluster

sbatch submits information on your job to the batch system:

  • What is to be done? (path to your program and required parameters)
  • What are the requirements? (for example queue, process number, maximum runtime)

Slurm then matches the job’s requirements against the capabilities of available job slots. Once sufficient suitable job slots are found, the job is started. Slurm considers jobs to be started in the order of their priority.

Available Partitions

We currently have two meta partitions, corresponding to broad application profiles:

medium
This is our general purpose partition, usable for serial and SMP jobs with up to 20 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.

fat
This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and 256 GB are available on one host. Maximum runtime is 48 hours.

fat+
This partition is meant for very memory intensive jobs. For more details see below (fat-fas+ and fat-fmz+).

These are called 'meta' partitions, because they are just collections of different partitions.
If you need more fine grained control over the on which kind of nodes your job runs, you can also directly use the underlying 'real' partitions:
medium-fas - Medium nodes at Faßberg
medium-fmz - Medium nodes at the Fernmeldezentrale
fat-fas - Fat nodes at Faßberg
fat-fmz - Fat nodes at the Fernmeldezentrale

fat-fas+ and fat-fmz+
These partitions are for jobs that require more than 256 GB RAM on single node. Nodes of fat+ partitions have 512 GB, 1.5 and 2 TB RAM. Due to limited amount of such nodes, there are restrictions of using the partition, which you can find at the page for experienced Users

gpu - A partition for nodes containing GPUs. Please refer to INSERT STUFF HERE

Available QOS

If the default time limits are not sufficient for your jobs, you can use a “Quality of Service” or QOS to modify those limits on a per job basis. We currently have two QOS.

long
Here, the maximum runtime is increased to 120 hours. Job slot availability is limited, though, so expect longer waiting times.

short
Here, the maximum runtime is decreased to two hours. In turn the queue has a higher base priority, but it also has limited job slot availability. That means that as long as only few jobs are submitted to the “-short” queues, there will be minimal waiting times. These queues are intended for testing and development, not for massive production.

How to submit jobs

Slurm supports different ways to submit jobs to the cluster: Interactively or in batch mode. We generally recommend using the batch mode. If you need to run a job interactively, you can find information about that LINK. Batch jobs are submitted to the cluster using the 'sbatch' command and a jobscript or a command:

sbatch <options> [jobscript.sh | --wrap=<command>]


sbatch can take a lot of options to give more information on the specifics of your job, e.g. where to run it, how long it will take and how many nodes it needs. We will examine a few of the options in the following paragraphs. For a full list of commands, refer to the manual of the command with 'man sbatch'.

"sbatch" options

-p <partition>
Specifies in which partition the job should run. Multiple partitions can be specified in a comma separated list.

--qos=<qos>
Submit the job using a special QOS.

-t <time>
Maximum runtime of the job. If this time is exceeded the job is killed. Acceptable <time> formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds” (example: 1-12:00:00 will request 1 day and 12 hours).

-o <file>
Store the job output in “file” (otherwise written to slurm-<jobid>). %J in the filename stands for the jobid.

Resource Selection

CPU Selection

-n <tasks>
The number of tasks for this job. The default is one task per node.

-N <minNodes,maxNodes>
Minimum and maximum number of nodes that the job should be executed on.

--ntasks-per-node=<ntasks>
Number of tasks per node. If -n and --ntasks-per-node is specified, this options specifies the maximum number tasks per node.

Memory Selection

By default, your available memory per node is the default memory per task times the number of tasks you have running on that node. You can get the default memory per task by looking at the DefMemPerCPU metric as reported by scontrol show partition <partition>

--mem=<size[units]>
Required memory per node. The Unit can be one of [K|M|G|T], but defaults to M. If your processes exceed this limit, they will be killed.

--mem-per-cpu=<size[units]>
Required memory per task instead of node. --mem and --mem-per-cpu is mutually exclusive

Example

-n 10 -N 2 --mem=5G Distributes a total of 10 tasks over 2 nodes and reserves 5G of memory on each node.

--ntasks-per-node=5 -N 2 --mem=5G Allocates 2 nodes and puts 5 tasks on each of them. Also reserves 5G of memory on each node.

-n 10 -N 2 --mem-per-cpu=1G Distributes a total of 10 tasks over 2 nodes and reserves 1G of memory for each task. So the memory per node depends on where the tasks are running.

The GWDG Scientific Compute Cluster

This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the “ehemalige Fernmeldezentrale” facility hosting the older resources and the shared /scratch file system and the “Faßberg” facility hosting the latest resources and the shared /scratch2 file system. The shared /scratch and /scratch2 are usually the best choices for temporary data in your jobs, but /scratch is only available at the “Fernmeldezentrale” (fmz) resources (select it with -C scratch) and /scratch2 is only available at the Faßberg (fas) resources (select it with -C scratch2). The scheme also shows the queues and resources by which nodes are selected using the -p (partition) and -C (constraint) options of sbatch.

''bsub'': Specifying node properties with ''-C''

-C scratch[2]
The node must have access to shared /scratch or /scratch2.

Using Job Scripts

A job script is a shell script with a special comment section: In each line beginning with #SBATCH the following text is interpreted as a sbatch option. Here is an example:

#!/bin/bash
#SBATCH -p medium
#SBATCH -t 10:00
#SBATCH -o outfile-%J

/bin/hostname


Job scripts are submitted by the following command:

sbatch <script name>  


Exclusive jobs

An exclusive job does use all of its allocated nodes exclusively, i.e., it never shares a node with another job. This is useful if you require all of a node's memory (but not all of its CPU cores), or for SMP/MPI hybrid jobs, for example.

Do not combine --exclusive and --mem=<x>. In that case you will get all available tasks on the node, but your memory will still be limited to what you specified with --mem

To submit an exclusive job add --exclusive to your sbatch options. For example, to submit a single task job, which uses a complete fat node, you could use:

sbatch --exclusive -p fat -t 12:00:00 --wrap="./mytask"

This allocates either a complete gwda nodes with 256GB, or a complete dfa node with 512GB.

For submitting an OpenMP/MPI hybrid job with a total of 8 MPI processes, spread evenly across 2 nodes, use:

export OMP_NUM_THREADS=4
sbatch --exclusive -p mpi -N 2 --ntasks-per-node=4 --wrap="mpirun ./hybrid_job"

(each MPI process creates 4 OpenMP threads in this case).

Disk Space Options

You have the following options for attributing disk space to your jobs:

/local
This is the local hard disk of the node. It is a fast - and in the case of the gwda, gwdd, dfa, dge, dmp, dsu and dte nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.

/scratch
This is the shared scratch space, available on gwda and gwdd nodes and frontends gwdu101 and gwdu102. You can use -C scratch to make sure to get a node with access to shared /scratch. It is very fast, there is no automatic file deletion, but also no backup! We may have to delete files manually when we run out of space. You will receive a warning before this happens.

/scratch2
This space is the same as scratch described above except it is ONLY available on the nodes dfa, dge, dmp, dsu and dte and on the frontend gwdu103. You can use -C scratch2 to make sure to get a node with access to that space.

$HOME
Your home directory is available everywhere, permanent, and comes with backup. Your attributed disk space can be increased. It is comparably slow, however.

Recipe: Using ''/scratch''

This recipe shows how to run Gaussian09 using /scratch for temporary files:

#!/bin/bash
#SBATCH -p fat
#SBATCH -N 1
#SBATCH -n 64
#SBATCH -C scratch
#SBATCH -t 1-00:00:00

export g09root="/usr/product/gaussian"
. $g09root/g09/bsd/g09.profile

mkdir -p /scratch/${USER}
MYSCRATCH=`mktemp -d /scratch/${USER}/g09.XXXXXXXX`
export GAUSS_SCRDIR=${MYSCRATCH}

g09 myjob.com myjob.log
 
rm -rf $MYSCRATCH


Using ''/scratch2''

Currently the latest nodes do NOT have an access to /scratch. They have an access only to shared /scratch2.

If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2:

#SBATCH -C "scratch|scratch2"

For that case /scratch2 is linked to /scratch on the latest nodes. You can just use /scratch/${USERID} for the temporary data (don't forget to create it on /scratch2). On the latest nodes data will then be stored in /scratch2 via the mentioned symlink.

Interactive session on the nodes

As stated before, sbatch is used to submit jobs to the cluster, but there is also srun command wich can be used to execute a task directly on the allocated nodes. That command is helpful to start interactive session on the node. You can use interactive session to avoid running large tests on the frontend (a good idea!) you can get an interactive session (with the bash shell) on one of the medium nodes with

srun --pty -p medium -N 1 -n 16 /bin/bash


--pty requests support for an interactive shell, and -p medium the corresponding partition. -n 16 ensures that you 16 cores on the node. You will get a shell prompt, as soon as a suitable node becomes available. Single thread, non-interactive jobs can be run with

srun -p medium ./myexecutable

GPU selection

In order to use a GPU you should submit your job to the gpu partition, and request GPU count and optionally the model. CPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting a single GPU on the node with 20 cores, you will get 5 cores reserved exclusively for you, the same is with memory. So for example, if you want 2 GPUs of model Nvidia GeForce GTX 1080, you can submit a job script with the following flags:

#SBATCH -p gpu
#SBATCH --gres=gpu:gtx1080:2

You can also omit the model selection, here is an example of selecting 1 GPU of any available model:

#SBATCH -p gpu
#SBATCH --gres=gpu:1

Currently we have several generations of NVidia GPUs in the cluster, namely:

gtx1080 : GeForce GTX 1080 
gtx980  : GeForce GTX 980
k40     : Nvidia Tesla k40

Most GPUs are commodity graphics cards, and only provide good performance for single precision calculations. If you need double precision performance, or error correcting memory (ECC RAM), you can select the Tesla GPUs with

#SBATCH -p gpu
#SBATCH --gres=gpu:k40:2

Our Tesla K40 are of the Kepler generation.

Miscallaneous Slurm Commands

While sbatch is arguably the most important Slurm command, you may also find the following commands useful:

sinfo
Shows current status of the cluster and queues

squeue
Lists current jobs (default: all users). Useful options are: -u $USER, -p <partition>, -j <jobid>.

scontrol show job <jobid>
Full job information. Only available while job is running and short time thereafter.

squeue --start -j <jobid>
Expected start time. This is a rough estimate.

sacct -j <jobid> --format=JobID,User,UID,JobName,MaxRSS,Elapsed,Timelimit
Get job Information even after the job has finished.

scancel
Cancels jobs. Examples:
scancel 1235 - Send the termination Signal (SIGTERM) to job 1235
scancel --signal=KILL 1235 - Send the kill Signal (SIGKILL) to job 1235
scancel --state=PENDING --user=$USER --partition=medium-fmz - Cancel all your pending jobs in partition medium-fmz

Have a look at the respective man pages of these commands to learn more about them!

LSF to Slurm Conversion Guide

This is a short guide on how to convert the most common options in your jobscripts from LSF to Slurm.

Description LSF Slurm Comment
Submit job bsub < job.sh sbatch job.sh No < in Slurm!
Scheduler Comment in Jobscript #BSUB -… #SBATCH -…
Queue/Partition -q <queue> -p <partition>
Walltime -W 48:00 -t 2-00:00:00 -t 48:00 means 48 min.
Stdout -o <outfile> -o <outfile> %J substituted for JobID
Stderr -e <errfile> -o <errfile> %J substituted for JobID
#Jobslots -n # -n #
One Host -R “span[hosts=1]” -N 1
Process Distribution -R “span[ptile=<x>]” --ntasks-per-node x
Exclusive Node -x --exclusive
Scratch -R scratch[2] -C “scratch[2]”
Queue → Partition Conversion
General Purpose -q mpi -p medium
-q mpi-short -p medium --qos=short
-q mpi-long -p medium --qos=long
-q fat -p fat
-q fat-short -p fat --qos=short
-q fat-long -p fat --qos=long

Getting Help

The following sections show you where you can get status Information and where you can get support in case of problems.

Information sources

Using the GWDG Support Ticket System

Write an email to hpc@gwdg.de. In the body:

  • State that your question is related to the batch system.
  • State your user id ($USER).
  • If you have a problem with your jobs please always send the complete standard output and error!
  • If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem.
  • If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.

not yet migrated to SLURM

MPI jobs

Note that a single thread job submitted like above will share its execution host with other jobs. It is therefore expected that it does not use more than the memory available per core! On the mpi nodes this amount is 4 GB, as well as on the newer fat nodes. If your job requires more, you must assign additional cores. For example, if your single thread job requires 64 GB of memory, you must submit it like this:

bsub -q mpi -n 16 ./myexecutable


OpenMPI jobs can be submitted as follows:

bsub -q mpi -n 256 -a openmpi mpirun.lsf ./myexecutable


For Intel MPI jobs it suffices to use -a intelmpi instead of -a openmpi. Please note that LSF will not load the correct modules (compiler, library, MPI) for you. You either have to do that before executing bsub, in which case your setup will be copied to the execution hosts, or you will have to use a job script and load the required modules there.

A new feature in LSF is pinning support. Pinning (in its most basic version) means instructing the operating system to not apply its standard scheduling algorithms to your workloads, but instead keep processes on the CPU core they have been started on. This may significantly improve performance for some jobs, especially on the fat nodes with their high CPU core count. Pinning is managed via the MPI library, and currently only OpenMPI is supported. There is not much experience with this feature, so we are interested in your feedback. Here is an example:

bsub -R "select[np16] span[ptile=16] affinity[core(1):cpubind=core]" -q mpi -n 256 -a openmpi mpirun.lsf ./myexecutable


The affinity string “affinity[core(1):cpubind=core]” means that each task is using one core and that the binding should be done based on cores (as opposed to sockets, NUMA units, etc). Because this example is for a pure MPI application, x in core(x) is one. In an SMP/MPI hybrid job, x would be equal to the number of threads per task (e. g., equal to OMP_NUM_THREADS for Openmp/MPI hybrid jobs).

SMP jobs

Shared memory parallelized jobs can be submitted with

bsub -q mpi -n 8,20 -R 'span[hosts=1]' -a openmp ./myexecutable


The span option is required, without it, LSF will assign cores to the job from several nodes, if that is advantageous from the scheduling perspective.

Using the fat+ queue

Nodes with a lot of memory are very expensive and should not normally be used for jobs which could also run on our other nodes. Therefore, please note the following policies:

  • Your job must need more than 250 GB RAM.
  • Your job must use at least a full 512 GB node or half a 1.5 TB or 2 TB node:
  • For a full 512 GB node:
#BSUB -x
#BSUB -R "maxmem < 600000"
  • For half a 1.5 TB node (your job needs more than 500 GB RAM):
#BSUB -n 20
#BSUB -R span[hosts=1]
#BSUB -R "maxmem < 1600000 && maxmem > 600000"
  • For a full 1.5 TB node (your job needs more than 700 GB RAM):
#BSUB -x
#BSUB -R "maxmem < 1600000 && maxmem > 600000"
  • For half a 2 TB node (your job needs more than 700 GB RAM):
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -R "maxmem > 1600000"
  • For a full 2 TB node (your job needs more than 1.5 TB RAM):
#BSUB -x
#BSUB -R "maxmem > 1600000"

The 512 GB nodes are also available in the fat queue, without these restrictions. However, fat jobs on these nodes have a lower priority compared to fat+ jobs.

CPU architecture selection

Our cluster provides four generations of Intel CPUs and two generations of AMD CPUs. However, the main difference between these CPU types is whether they support Intel's AVX2 or not. For selecting this we have introduced the x64inlvl (for x64 instruction level) label:

x64inlvl=1 : Supports only AVX
x64inlvl=2 : Supports AVX and AVX2

In order to choose an AVX2 capable node you therefore have to include

#BSUB -R "x64inlvl=2"

in your submission script.

If you need to be more specific, you can also directly choose the CPU generation:

amd=1 : Interlagos
amd=2 : Abu Dhabi

intel=1 : Sandy Bridge
intel=2 : Ivy Bridge
intel=3 : Haswell
intel=4 : Broadwell

So, in order to choose any AMD CPU:

#BSUB -R amd

In order to choose an Intel CPU of at least Haswell generation:

#BSUB -R "intel>=3"

This is equivalent to x64inlvl=2.

Memory selection

Note that the following paragraph is about selecting nodes with enough memory for a job. The mechanism to actually reserve that memory does not change: The memory you are allowed to use equals memory per core times slots (-n option) requested.

You can select a node either by currently available memory (mem) or by maximum available memory (maxmem). If you request complete nodes, the difference is actually very small, as a free node's available memory is close to its maximum memory. All requests are in MB.

To select a node with more than about 500 GB available memory use:

#BSUB -R "mem>500000"

To select a node with more than about 6 GB maximum memory per core use:

#BSUB -R "maxmem/ncpus>6000"

(Yes, you can do basic math in the requirement string!)

It bears repeating: None of the above is a memory reservation. If you actually want to reserve “mem” memory, the easiest way is to combine -R “mem>… with -x for an exclusive job.

Finally, note that the -M option just denotes the memory limit of your job per core (in KB). This is of no real consequence, as we do not enforce these limits and it has no influence on the host selection.

Besides the options shown in this article, you can of course use the options for controlling walltime limits (-W), output (-o), and your other requirements as usual. You can also continue to use job scripts instead of the command line (with the #BSUB <option> <value> syntax).

Please consult the LSF man pages if you need further information.

Scientific Computing

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies