OUTDATED Running Jobs

for new version see: Running Jobs Slurm

In the following the basic concepts will be described.

A collection of networked computers intended to provide compute capabilities.

One of these computers, also called host.

A special node provided to interact with the cluster via shell commands. gwdu101 and gwdu102 are our frontends.

Compute capacity for one process (or “thread”) at a time, usually one processor core, “core” for short.

A compute task consisting of one or several parallel processes.

Batch System
The management system distributing job processes across job slots. In our case Platform LSF, which is operated by shell commands on the frontends.

Serial job
A job consisting of one process using one job slot.

SMP job
A job with shared memory parallelization (often realized with OpenMP), meaning that all processes need access to the memory of the same node. Consequently an SMP job uses several job slots on the same node.

MPI job
A Job with distributed memory parallelization, realized with MPI. Can use several job slots on several nodes and needs to be started with mpirun or a substitute.

A label to sort jobs by general requirements and intended execution nodes.

The ''bsub'' Command: Submitting Jobs to the Cluster

bsub submits information on your job to the batch system:

  • What is to be done? (path to your program and required parameters)
  • What are the requirements? (for example queue, process number, maximum runtime)

LSF then matches the job’s requirements against the capabilities of available job slots. Once sufficient suitable job slots are found, the job is started. LSF considers jobs to be started in the order of their priority.

Available Queues

We have three generally available base queues, corresponding to broad application profiles:

interactive queue

This is our general purpose queue, usable for serial and SMP jobs with up to 20 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.

This is the queue for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this queue. Up to 64 cores and 256 GB are available on one host. Maximum runtime is 48 hours.

Both the “mpi” and “fat” queues are also available in manifestations, e.g. “mpi-long”, corresponding to special runtime requirements:

Here, the maximum runtime is increased to 120 hours. Job slot availability is limited, though, so expect longer waiting times.

Here, the maximum runtime is decreased to two hours. In turn the queue has a higher base priority, but it also has limited job slot availability. That means that as long as only few jobs are submitted to the “-short” queues, there will be minimal waiting times. These queues are intended for testing and development, not for massive production.

This queue is for jobs that require more than 256 GB RAM on single node. Nodes of fat+ queue have 512 GB, 1.5 and 2 TB RAM. Due to limited amount of such nodes, there are restrictions of using the queue, which you can find at the page for experienced Users

''bsub'' Syntax and Usage

bsub <bsub options> [mpirun.lsf] <path to program> <program parameters>

''bsub'' options for serial jobs

-q <queue>
Submission queue.

-W <hh:mm>
Maximum runtime. If this time is exceeded the job is killed.

-o <file>
Store the job output in “file” (otherwise it is sent by email). %J in the filename stands for the jobid.

''bsub'' options for parallel (SMP or MPI) jobs

-n <min>,<max>
The minimum and maximum process count. If “max” is left out, “min” is the exact number of job slots required.

-a <wrapper>
This option denotes a wrapper script required to run SMP or MPI jobs. The most important wrappers are openmp (for SMP jobs), intelmpi (for MPI jobs using the Intel MPI library), and openmpi (for MPI jobs using the OpenMPI library).

LSF’s substitute for mpirun. In MPI jobs mpirun.lsf needs to be put in front of the program path.

''bsub'': Specifying process distribution with ''-R''

-R span[hosts=1]
This puts all processes on one host. You always want to use this with SMP jobs.

-R span[ptile=<x>]
x denotes the exact number of job slots to be used on each host. If the total process number is not divisible by x, the residual processes will be put on one host.

-R span[ptile='!']
-R same[model]
With this special notation, x will become the maximum number of cores available on the node type used for the job. In other words, using '!' will acquire all job slots on all nodes the job runs on, provided the total number of job slots requested is divisible by x (otherwiese the residual will run on one shared host).

The GWDG Scientific Compute Cluster

This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the “ehemalige Fernmeldezentrale” facility hosting the older resources and the shared /scratch file system and the “Faßberg” facility hosting the latest resources and the shared /scratch2 file system. The shared /scratch2 is usually the best choice for temporary data in your jobs, but it is only available at the Faßberg resources (selectable with -R scratch2). The scheme also shows the queues and resources by which nodes are selected using the -q and -R options of bsub.

''bsub'': Specifying node properties with ''-R''

-R scratch[2]
The node must have access to shared /scratch or /scratch2.

-R work
The node must have access to one of the shared /work directories.

-R “ncpus=<x>“
Choose only nodes with a job slot count of x. This is useful with span[ptile=<x>].

-R big
Choose the nodes with the maximum memory per core available in the queue. Currently only distinguishes gwdaxxx from gwdpxxx nodes.

-R latest
Always use the latest (and usually most powerful) nodes available in the queue. To get a list of current latest nodes run the command bhosts -R latest on one of the frontends. You can also check the Latest Nodes page for more information.

''bsub'': Using Job Scripts

A job script is a shell script with a special comment section: In each line beginning with #BSUB the following text is interpreted as a bsub option. Here is an example:

#BSUB -q mpi
#BSUB -W 00:10
#BSUB -o out.%J


Job scripts are submitted by the following command:

bsub < <script name>  

Exclusive jobs

An exclusive job does use all of its allocated nodes exclusively, i.e., it never shares a node with another job. This is useful if you require all of a node's memory (but not all of its CPU cores), or for SMP/MPI hybrid jobs, for example.

To submit an exclusive job add -x to your bsub options. For example, to submit a single task job, which uses a complete fat node with 256 GB memory, you could use:

bsub -x -q fat -R big ./mytask

(-R big requests a 256 GB node, excluding the 128 GB nodes in the fat queue)

For submitting an OpenMP/MPI hybrid job with a total of 8 MPI processes, spread evenly across 2 nodes, use:

bsub -x -q mpi -n 8 -R span[ptile=4] -a intelmpi mpirun.lsf ./hybrid_job

(each MPI process creates 4 OpenMP threads in this case).

Please note that fairshare evaluation and accounting is done based on the number of job slots allocated. So the first example would count as 64 slots for both fairshare and accounting.

Using exclusive jobs does not require reserving all of a node's slots explicitly (e.g., with span[ptile='!']) and subsequently using the MPI library's mpiexec or mpiexec.hydra to set the process number, as we explain in our introductory course. This makes submitting a hybrid job as exclusive job more straightforward.

However, there is a disadvantage: LSF will not reserve the additional job slots required to get a node exclusively. Therefore, when the cluster is very busy, an exclusive job needing a lot of nodes may wait significantly longer.

A Note On Job Memory Usage

LSF will try to fill up each node with processes up to its job slot limit. Therefore each process in your job must not use more memory than available per core! If your per core memory requirements are too high, you have to add more job slots in order to allow your job to use the memory of these slots as well. If your job's memory usage increases with the number of processes, you have to leave additional job slots empty, i.e., do not run processes on them.

Recipe: Reserving Memory for OpenMP

The following job script recipe demonstrates using empty job slots for reserving memory for OpenMP jobs:

#BSUB -q fat
#BSUB -W 00:10
#BSUB -o out.%J
#BSUB -n 64
#BSUB -R big
#BSUB -R "span[hosts=1]"


Disk Space Options

You have the following options for attributing disk space to your jobs:

This is the local hard disk of the node. It is a fast - and in the case of the gwda, gwdd, dfa, dge, dmp, dsu and dte nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.

This is the shared scratch space, available on gwda and gwdd nodes and frontends gwdu101 and gwdu102. You can use -R scratch to make sure to get a node with access to shared /scratch. It is very fast, there is no automatic file deletion, but also no backup! We may have to delete files manually when we run out of space. You will receive a warning before this happens.

This space is the same as scratch described above except it is ONLY available on the nodes dfa, dge, dmp, dsu and dte and on the frontend gwdu103. You can use -R scratch2 to make sure to get a node with access to that space.

Your home directory is available everywhere, permanent, and comes with backup. Your attributed disk space can be increased. It is comparably slow, however.

Recipe: Using ''/scratch''

This recipe shows how to run Gaussian09 using /scratch for temporary files:

#BSUB -q fat
#BSUB -n 64
#BSUB -R "span[hosts=1]"
#BSUB -R scratch
#BSUB -W 24:00
#BSUB -C 0
#BSUB -a openmp

export g09root="/usr/product/gaussian"
. $g09root/g09/bsd/g09.profile

mkdir -p /scratch/${USER}
MYSCRATCH=`mktemp -d /scratch/${USER}/g09.XXXXXXXX`

g09 myjob.com myjob.log

Using ''/scratch2''

Currently the latest nodes do NOT have an access to /scratch. They have an access only to shared /scratch2.

If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2:

#BSUB -R "scratch||scratch2"

For that case /scratch2 is linked to /scratch on the latest nodes. You can just use /scratch/${USERID} for the temporary data (don't forget to create it on /scratch2). On the latest nodes data will then be stored in /scratch2 via the mentioned symlink.

Miscallaneous LSF Commands

While bsub is arguably the most important LSF command, you may also find the following commands useful:

Lists current jobs. Useful options are: -p, -l, -a, , <jobid>, -u all, -q <queue>, -m <host>.

Lists older jobs. Useful options are: -l, -n, <jobid>.

Status of cluster nodes. Useful options are: -l, <hostname>.

Status of batch queues. Useful options are: -l, <queue>.

Why do I have to wait? bhpart shows current user priorities. Useful options are: -r, <host partition>.

The Final Command. It has two use modes:

  1. bkill <jobid>: This kills a job with a specific jobid.
  2. bkill <selection options> 0: This kills all jobs fitting the selection options. Useful selection options are: -q <queue>, -m <host>.

Have a look at the respective man pages of these commands to learn more about them!

Getting Help

The following sections show you where you can get status Information and where you can get support in case of problems.

Information sources

Using the GWDG Support Ticket System

Write an email to hpc@gwdg.de. In the body:

  • State that your question is related to the batch system.
  • State your user id ($USER).
  • If you have a problem with your jobs please always send the complete standard output and error!
  • If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem.
  • If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.

Scientific Computing