Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:running_jobs_slurm [2019/04/05 09:23]
tehlers [LSF to Slurm Conversion Guide]
en:services:application_services:high_performance_computing:running_jobs_slurm [2020/04/14 14:47] (current)
mboden [''sbatch'': Specifying node properties with ''-C'']
Line 1: Line 1:
-======First SLURM docu (not yet released) ====== +====== Running Jobs with SLURM======
-====== Running Jobs ======+
  
 In the following the basic concepts will be described. In the following the basic concepts will be described.
Line 11: Line 10:
  
 **Frontend**\\ **Frontend**\\
-A special node provided to interact with the cluster via shell commands. gwdu101 ​and gwdu102 are our frontends.+A special node provided to interact with the cluster via shell commands. gwdu101gwdu102 ​and gwdu103 ​are our frontends.
  
 **Task or (Job-)Slot**\\ **Task or (Job-)Slot**\\
Line 20: Line 19:
  
 **Batch System**\\ **Batch System**\\
-The management system distributing job processes across job slots. In our case Platform LSF, which is operated by shell commands on the frontends.+The management system distributing job processes across job slots. In our case [[https://​slurm.schedmd.com|Slurm]], which is operated by shell commands on the frontends.
  
 **Serial job**\\ **Serial job**\\
Line 29: Line 28:
  
 **MPI job**\\ **MPI job**\\
-A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or substitute.+A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or the Slurm substitute ​''​srun''​.
  
 **Partition**\\ **Partition**\\
Line 47: Line 46:
  
 **medium**\\ **medium**\\
-This is our general purpose partition, usable for serial and SMP jobs with up to 20 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.+This is our general purpose partition, usable for serial and SMP jobs with up to 24 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.
  
 **fat**\\ **fat**\\
-This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and 256 GB are available on one host. Maximum runtime is 48 hours.+This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and up to 512 GB are available on one host. Maximum runtime is 48 hours.\\ 
 +The nodes of the fat+ partitions are also present in this partition, but will only be used, if they are not needed for bigger jobs submitted to the fat+ partition.
  
-These are called '​meta' ​partitions, because they are just collections +**fat+**\\ 
-of different ​partitions.\\ +This partition is meant for very memory intensive jobs. These partitions are for jobs that require more than 512 GB RAM on single node. Nodes of fat+ partitions ​have 1.5 and 2 TB RAM. You are required to have specify your memory needs on job submission to use these nodes (see [[en:​services:application_services:​high_performance_computing:​running_jobs_slurm#​resource_selection|resource selection]]).\\ 
-If you need more fine grained control over the on which kind of nodes your job runs, you can also directly ​use the underlying '​real'​ partitions:\\ +As general advice: Try your jobs on the smaller ​nodes in the fat partition first and work your way up and don't be afraid to ask for help here.
-**medium-fas** - Medium ​nodes at Faßberg\\ +
-**medium-fmz** - Medium nodes at the Fernmeldezentrale\\ +
-**fat-fas** - Fat nodes at Faßberg\\ +
-**fat-fmz** - Fat nodes at the Fernmeldezentrale+
  
-**fat-fas+** and **fat-fmz+**\\ +**gpu** - A partition ​for nodes containing GPUsPlease refer to [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#gpu_selection]] 
-These pratitions are for jobs that require more than 256 GB RAM on single node. Nodes of fat+ partitions have 512 GB, 1.5 and 2 TB RAMDue to limited amount of such nodes, there are restrictions of using the partition, which you can find at the [[en:​services:​application_services:​high_performance_computing:​running_jobs_for_experienced_users#using_the_fat_partitions|page for experienced Users]] +
  
-**gpu** - A partition for nodes containing GPUs. Please refer to INSERT STUFF HERE +====  ​Runtime limits (QoS)  ====
- +
-====  ​Available QOS  ====+
 If the default time limits are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS. If the default time limits are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS.
  
Line 79: Line 72:
 Interactively or in batch mode. We generally recommend using the Interactively or in batch mode. We generally recommend using the
 batch mode. If you need to run a job interactively,​ you can find batch mode. If you need to run a job interactively,​ you can find
-information about that LINK.+information about that in the [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​interactive_session_on_the_nodes|corresponding section]].
 Batch jobs are submitted to the cluster using the '​sbatch'​ command Batch jobs are submitted to the cluster using the '​sbatch'​ command
 and a jobscript or a command:\\ and a jobscript or a command:\\
Line 90: Line 83:
 manual of the command with 'man sbatch'​. manual of the command with 'man sbatch'​.
  
-====  "​sbatch"​ options ​for serial jobs  ====+====  "​sbatch"​ options ​ ==== 
 + 
 +**<​nowiki>​-A all</​nowiki>​**\\ 
 +Specifies the account '​all'​ for the job. This option is //​mandatory//​ for users who have access to special hardware and want to use the general partitions.
  
 **<​nowiki>​-p <​partition></​nowiki>​**\\ **<​nowiki>​-p <​partition></​nowiki>​**\\
Line 97: Line 93:
  
 **<​nowiki>​--qos=<​qos></​nowiki>​**\\ **<​nowiki>​--qos=<​qos></​nowiki>​**\\
-Ssubmit ​the job using a special QOS.+Submit ​the job using a special QOS.
  
-**<​nowiki>​-t ​%%[[d-]hh:​]%%mm:​ss</​nowiki>​**\\ +**<​nowiki>​-t ​<​time>​</​nowiki>​**\\ 
-Maximum runtime of the job. If this time is exceeded the job is +Maximum runtime of the job. If this time is exceeded the job is killed. Acceptable <​time>​ formats include "​minutes",​ "​minutes:​seconds",​ "​hours:​minutes:​seconds",​ "​days-hours",​ "​days-hours:​minutes"​ and "​days-hours:​minutes:​seconds"​ (example: 1-12:00:00 will request 1 day and 12 hours).
-killed.+
  
 **<​nowiki>​-o <​file></​nowiki>​**\\ **<​nowiki>​-o <​file></​nowiki>​**\\
 Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​). ''​%J''​ in the filename stands for the jobid.\\ Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​). ''​%J''​ in the filename stands for the jobid.\\
 +
 +**<​nowiki>​--noinfo</​nowiki>​**\\
 +Some metainformation about your job will be added to your output file. If you do not want that, you can suppress it with this flag.\\
 +
 +**<​nowiki>​--mail-type=[ALL|BEGIN|END]</​nowiki>​\\
 +<​nowiki>​--mail-user=your@mail.com</​nowiki>​** \\
 +Receive mails when the jobs start, end or both. There are even more options, refer to the sbatch man-page for more information about mail types. If you have a GWDG-mail-address,​ you do not need to specify the mail-user.\\
  
 ====  Resource Selection ​ ==== ====  Resource Selection ​ ====
Line 111: Line 113:
  
 **<​nowiki>​-n <​tasks></​nowiki>​**\\ **<​nowiki>​-n <​tasks></​nowiki>​**\\
-The number of tasks for this job. The default is one task per node+The number of tasks for this job. The default is one task per node.
  
-**<​nowiki>​-N <​minNodes,​maxNodes></​nowiki>​**\\ +**<​nowiki>​-c <cpus per task></​nowiki>​**\\ 
-Minimum and maximum number of nodes that the job should be executed on.\\+The number of cpus per tasks. The default is one cpu per task. 
 + 
 +**<​nowiki>​-c vs -n</​nowiki>​**\\ 
 +As a rule of thumb, if you run your code on a single node, use -c. For multi-node MPI-jobs, use -n.\\ 
 + 
 +**<​nowiki>​-N <​minNodes[,maxNodes]></​nowiki>​**\\ 
 +Minimum and maximum number of nodes that the job should be executed on. If only one number is specified, it is used as the precise node count.\\
  
 **<​nowiki>​--ntasks-per-node=<​ntasks></​nowiki>​**\\ **<​nowiki>​--ntasks-per-node=<​ntasks></​nowiki>​**\\
 Number of tasks per node. If -n and <​nowiki>​--ntasks-per-node</​nowiki>​ is specified, this options specifies the maximum number tasks per node. Number of tasks per node. If -n and <​nowiki>​--ntasks-per-node</​nowiki>​ is specified, this options specifies the maximum number tasks per node.
 \\ \\
- 
 === Memory Selection === === Memory Selection ===
  
-By default, your available memory per node is the default memory per task times the number of tasks you have running on that node. You can get the default memory per task by looking at the DefMemPerCPU metric as reported by ''​scontrol show partitionname=<​partition>​+By default, your available memory per node is the default memory per task times the number of tasks you have running on that node. You can get the default memory per task by looking at the DefMemPerCPU metric as reported by ''​scontrol show partition ​<​partition>​
 ''​ ''​
  
Line 129: Line 136:
  
 **<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\ **<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\
-Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki> ​is mutually exclusive\\+Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki> ​are mutually exclusive.\\
 === Example === === Example ===
  
Line 143: Line 150:
 {{ :​en:​services:​scientific_compute_cluster:​nodes-slurm.png?​1000 |}} {{ :​en:​services:​scientific_compute_cluster:​nodes-slurm.png?​1000 |}}
  
-This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the "​ehemalige Fernmeldezentrale"​ facility hosting the older resources and the shared /scratch file system and the "​Faßberg"​ facility hosting the latest resources and the shared /scratch2 file system. The shared /​scratch2 ​is usually the best choice ​for temporary data in your jobs, but it is only available at the Faßberg resources (selectable ​with ''​-C scratch2''​). The scheme also shows the queues and resources by which nodes are selected using the ''​-p''​ and ''​-C''​ options of ''​sbatch''​.+This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the "​ehemalige Fernmeldezentrale"​ facility hosting the older resources and the shared /scratch file system and the "​Faßberg"​ facility hosting the latest resources and the shared /scratch2 file system. The shared ​/scratch and /​scratch2 ​are usually the best choices ​for temporary data in your jobs, but /scratch is only available at the "​Fernmeldezentrale"​ (fmz) resources (select ​it with ''​-C scratch''​) and /​scratch2 ​is only available at the Faßberg ​(fas) resources (select it with ''​-C scratch2''​). The scheme also shows the queues and resources by which nodes are selected using the ''​-p'' ​(partition) ​and ''​-C'' ​(constraint) ​options of ''​sbatch''​.
  
-====  ''​bsub'':​ Specifying node properties with ''​-C'' ​ ====+====  ''​sbatch'':​ Specifying node properties with ''​-C'' ​ ====
  
 **-C scratch[2]**\\ **-C scratch[2]**\\
 The node must have access to shared ''/​scratch''​ or ''/​scratch2''​. The node must have access to shared ''/​scratch''​ or ''/​scratch2''​.
 +
 +**-C fmz / -C fas**\\
 +The node has to be at that location. It is pretty similar to -C scratch / -C scratch2, since the nodes in the FMZ have access to scratch and those at the Fassberg location have access to  scratch2. This is mainly for easy compatibility with our old partition naming scheme.
 +
 +**-C [architecture]**\\
 +request a specific CPU architecture. Available Options are: abu-dhabi, ivy-bridge, haswell, broadwell. See [[en:​services:​application_services:​high_performance_computing:​start#​hardware_overview|this table]] for the corresponding nodes.
  
  
 ====  Using Job Scripts ​ ==== ====  Using Job Scripts ​ ====
  
-A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. Here is an example:+A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. These options have to be at the top of the script before any other commands are executed. Here is an example:
  
 <​code>​ <​code>​
Line 185: Line 198:
 <​code>​ <​code>​
 export OMP_NUM_THREADS=4 export OMP_NUM_THREADS=4
-sbatch --exclusive -p mpi -N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"​+sbatch --exclusive -p medium ​-N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"​
 </​code>​ </​code>​
 (each MPI process creates 4 OpenMP threads in this case). (each MPI process creates 4 OpenMP threads in this case).
Line 194: Line 207:
  
 **/​local**\\ **/​local**\\
-This is the local hard disk of the node. It is a fast - and in the case of the ''​gwda,​ gwdd, dfa, dge, dmp, dsu''​ and ''​dte''​ nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.+This is the local hard disk of the node. It is a fast - and in the case of the ''​gwda,​ gwdd, dfa, dge, dmp, dsu''​ and ''​dte''​ nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.\\ 
 +A directory is automatically created for each job at ''/​local/​jobs/<​jobID>''​ and the path is exported as the environment variable ''​$TMP_LOCAL''​.
  
 **/​scratch**\\ **/​scratch**\\
Line 239: Line 253:
 For that case ''/​scratch2''​ is linked to ''/​scratch''​ on the latest nodes. You can just use ''/​scratch/​${USERID}''​ for the temporary data (don't forget to create it on ''/​scratch2''​). On the latest nodes data will then be stored in ''/​scratch2''​ via the mentioned symlink. For that case ''/​scratch2''​ is linked to ''/​scratch''​ on the latest nodes. You can just use ''/​scratch/​${USERID}''​ for the temporary data (don't forget to create it on ''/​scratch2''​). On the latest nodes data will then be stored in ''/​scratch2''​ via the mentioned symlink.
  
-=====  ​Miscallaneous ​Slurm Commands ​ =====+==== Interactive session on the nodes ==== 
 + 
 +As stated before, ''​sbatch''​ is used to submit jobs to the cluster, but there is also ''​srun''​ command wich can be used to execute a task directly on the allocated nodes. That command is helpful to start interactive session on the node. You can use interactive session to avoid running large tests on the frontend (a good idea!) you can get an interactive session (with the bash shell) on one of the ''​medium''​ nodes with 
 + 
 +<​code>​srun --pty -p medium -N 1 -c 16 /​bin/​bash</​code>​ 
 +\\ 
 +''<​nowiki>​--pty</​nowiki>''​ requests support for an interactive shell, and ''​-p medium''​ the corresponding partition. ''​-c 16''​ ensures that you 16 cpus on the node. You will get a shell prompt, as soon as a suitable node becomes available. Single thread, non-interactive jobs can be run with 
 +<​code>​srun -p medium ./​myexecutable</​code>​ 
 + 
 +==== GPU selection ==== 
 + 
 +In order to use a GPU you should submit your job to the ''​gpu''​ partition, and request GPU count and optionally the model. CPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting a single GPU on the node with 20 cores and 4 GPUs, you can get up to 5 cores reserved exclusively for you, the same is with memory. So for example, if you want 2 GPUs of model Nvidia GeForce GTX 1080 with 10 CPUs, you can submit a job script with the following flags: 
 +<​code>​ 
 +#SBATCH -p gpu 
 +#SBATCH -n 10 
 +#SBATCH -G gtx1080:2 
 +</​code>​ 
 + 
 + 
 +You can also omit the model selection, here is an example of selecting 1 GPU of any available model:<​code>​ 
 +#SBATCH -p gpu 
 +#SBATCH -n 10 
 +#SBATCH -G 1 
 +</​code>​ 
 + 
 +There are different options to select the number of GPUs, such as ''​%%--gpus-per-node%%'',​ ''​%%--gpus-per-task%%''​ and more. See the [[https://​slurm.schedmd.com/​sbatch.html|sbatch man page]] for details. 
 + 
 +Currently we have several generations of NVidia GPUs in the cluster, namely: 
 + 
 +<​code>​ 
 +gtx1080 : GeForce GTX 1080  
 +gtx980 ​ : GeForce GTX 980 
 +k40     : Nvidia Tesla k40 
 +</​code>​ 
 + 
 +Most GPUs are commodity graphics cards, and only provide good performance for single precision calculations. If you need double precision performance,​ or error correcting memory (ECC RAM), you can select the Tesla GPUs with 
 +<​code>​ 
 +#SBATCH -p gpu 
 +#SBATCH -G k40:2 
 +</​code>​ 
 +Our Tesla K40 are of the Kepler generation. 
 + 
 +<​code>​ sinfo -p gpu --format=%N,​%G </​code>​ shows a list of host with GPUs, as well as their type and count. 
 + 
 +=====  ​Miscellaneous ​Slurm Commands ​ =====
  
 While ''​sbatch''​ is arguably the most important Slurm command, you may also find the following commands useful: While ''​sbatch''​ is arguably the most important Slurm command, you may also find the following commands useful:
Line 256: Line 314:
  
 **<​nowiki>​sacct -j <​jobid>​ --format=JobID,​User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\ **<​nowiki>​sacct -j <​jobid>​ --format=JobID,​User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\
-Get job Information even after the job has finished.+Get job Information even after the job has finished.\\ 
 +**Note on ''​sacct'':​** Depending on the parameters given ''​sacct''​ chooses a time window in a rather unintuitive way. This is documented in the DEFAULT TIME WINDOW section of its man page. If you unexpectedly get no results from your ''​sacct''​ query, try specifying the start time with, e.g. ''<​nowiki>​-S 2019-01-01</​nowiki>''​.\\ 
 +The ''<​nowiki>​--format</​nowiki>''​ option knows many more fields like **Partition**,​ **Start**, **End** or **State**, for the full list refer to the man page.
  
 **<​nowiki>​scancel</​nowiki>​**\\ **<​nowiki>​scancel</​nowiki>​**\\
 Cancels jobs. Examples:\\ Cancels jobs. Examples:\\
 ''<​nowiki>​scancel 1235</​nowiki>''​ - Send the termination Signal (SIGTERM) to job 1235\\ ''<​nowiki>​scancel 1235</​nowiki>''​ - Send the termination Signal (SIGTERM) to job 1235\\
-''<​nowiki>​scancel --signal=KILL 1235</​nowiki>''​ - Send the kill SIGNAL ​to job 1235\\ +''<​nowiki>​scancel --signal=KILL 1235</​nowiki>''​ - Send the kill Signal (SIGKILL) ​to job 1235\\ 
-''<​nowiki>​scancel --state=PENDING --user=bob --partition=medium-fmz</​nowiki>''​ - Cancel ​job all pending jobs belonging to user ''<​nowiki>​bob</​nowiki>'' ​in partition ''<​nowiki>​medium-fmz</​nowiki>''​+''<​nowiki>​scancel --state=PENDING --user=$USER --partition=medium-fmz</​nowiki>''​ - Cancel all your pending jobs in partition ''<​nowiki>​medium-fmz</​nowiki>''​
  
 Have a look at the respective man pages of these commands to learn more about them! Have a look at the respective man pages of these commands to learn more about them!
 +
 +====== LSF to Slurm Conversion Guide ======
 +This is a short guide on how to convert the most common options in your jobscripts from LSF to Slurm.
 +
 +^ Description ​                    ^ LSF                    ^ Slurm                         ^ Comment ​                                                                                                                                                                                      ^
 +| Submit job                      | bsub < job.sh ​         | sbatch job.sh ​                | No < in Slurm! ​                                                                                                                                                                               |
 +| Scheduler Comment in Jobscript ​ | #BSUB -...             | #SBATCH -...                  |                                                                                                                                                                                               |
 +| Queue/​Partition ​                | -q <​queue> ​            | -p <​partition> ​               |                                                                                                                                                                                               |
 +| Walltime ​                       | -W 48:00               | -t 2-00:​00:​00 ​                | -t 48:00 means 48 min.                                                                                                                                                                        |
 +| Job Name                        | -J <​name> ​             | -J <​name> ​                    ​| ​                                                                                                                                                                                              |
 +| Stdout ​                         | -o <​outfile> ​          | -o <​outfile> ​                 | %J substituted for JobID                                                                                                                                                                      |
 +| Stderr ​                         | -e <​errfile> ​          | -e <​errfile> ​                 | %J substituted for JobID                                                                                                                                                                      |
 +| #​Jobslots ​                      | -n #                   | -n #                          |                                                                                                                                                                                               |
 +| One Host                        | -R "​span[hosts=1]" ​    | -N 1                          |                                                                                                                                                                                               |
 +| Process Distribution ​           | -R "​span[ptile=<​x>​]" ​  | %%--ntasks-per-node x%%       ​| ​                                                                                                                                                                                              |
 +| Exclusive Node                  | -x                     | %%--exclusive%% ​              ​| ​                                                                                                                                                                                              |
 +| Scratch ​                        | -R scratch[2] ​         | -C "​scratch[2]" ​              ​| ​                                                                                                                                                                                              |
 +^ Queue -> Partition Conversion ​                                                                                                                                                                                                                                                        ||||
 +| General Purpose ​                | -q mpi                 | -p medium ​                    ​| ​                                                                                                                                                                                              |
 +|                                 | -q mpi-short ​          | -p medium %%--qos=short%% ​    ​| ​                                                                                                                                                                                              |
 +|                                 | -q mpi-long ​           | -p medium %%--qos=long%% ​     |                                                                                                                                                                                               |
 +|                                 | -q fat                 | -p fat                        |                                                                                                                                                                                               |
 +|                                 | -q fat-short ​          | %%-p fat --qos=short%% ​       |                                                                                                                                                                                               |
 +|                                 | -q fat-long ​           | %%-p fat --qos=long%% ​        ​| ​                                                                                                                                                                                              |
 +|                                 | -ISs -q int /​bin/​bash ​ | -p int -n 20 -N 1 --pty bash  | Don't forget -n 20 -N 1, otherwise, you will only get access to a single core. Fore more Detail See [[en:​services:​application_services:​high_performance_computing:​interactive_queue:​|here]]. ​ |
  
 =====  Getting Help  ===== =====  Getting Help  =====
Line 281: Line 366:
   *  If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem.   *  If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem.
   *  If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.   *  If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.
- 
-====== Running Jobs (for experienced Users) ====== 
- 
-Job scheduling on the scientific compute cluster is handled by IBM LSF. In most cases this means that only few changes to your current job scripts and ''​bsub''​ command lines are necessary, if you have used LSF before. There are three different node types in the cluster, each being served by one or more queues of the general form ''<​nodetype>​[-<​category>​]''​. For example, ''​fat''​ type nodes are served by the ''​fat''​ queue for normal jobs, and by ''​fat-long''​ for long jobs. 
- 
-===== Nodes' specification ===== 
- 
-**Currently there are next "​mpi"​ nodes:** 
-  * 168 nodes with 20 Intel Ivy-Bridge CPU cores 
-  * 101 nodes with 24 Intel Broadwell CPU cores 
-  * 72 nodes with 16 Intel Haswell CPU cores 
-  * 32 nodes with 24 Intel Haswell CPU cores 
-Broadwell nodes have 128 GB of memory, Ivy- and Sandy-Bridge nodes 64 GB and Haswell nodes 128/256 GB. 
- 
-**Current "​fat"​ nodes:** 
-  * 5 nodes with 48 AMD CPU cores (with 128 GB of memory) 
-  * 25 nodes with 64 AMD CPU cores (with 256 GB of memory) 
-  * 15 nodes with 24 Intel Broadwell CPU cores (with 512 GB of memory) 
-  * 5 nodes with 40 Intel Haswell CPU cores (with 1.5 TB of memory) 
-  * 1 node with 32 Intel Haswell CPU cores (with 2 TB of memory) 
- 
-**There are "​gpu"​ nodes as well:** 
-  * 20 nodes with 1 NVidia GTX 770 GPUs 
-  * 15 nodes with 2 NVidia GTX 980 GPUs 
-  * 10 nodes with 2 NVidia K40 GPUs 
-GTX nodes are for single precision CUDA applications and K40 nodes for double precision or memory intensive CUDA applications. 
- 
-The ''​mpi''​ nodes provide the bulk of the compute power of our compute cluster and are meant for all types of applications. They are especially well suited for large MPI jobs, as they have a balanced compute to network performance ratio. The ''​fat''​ nodes are meant for shared memory parallelized workloads scaling beyond 16 cores, and for all applications requiring more than 64 GB of memory on a single node. 
- 
-===== Interactive session on the nodes ===== 
- 
-As stated before, ''​bsub''​ is used to submit jobs to the cluster. For example, to avoid running large tests on the frontend (a good idea!) you can get an interactive session (with the bash shell) on one of the ''​mpi''​ nodes with 
- 
-<​code>​ 
-bsub -ISs -q mpi-short -n 16 -R '​span[hosts=1]'​ -R np16 /​bin/​bash</​code>​ 
-\\ 
-''​-ISs''​ requests support for an interactive shell, and ''​-q mpi-short''​ the corresponding queue. ''​-n 16 -R <​nowiki>'​span[hosts=1]'</​nowiki>​ -R np16''​ ensures that you get a Sandy-Bridge (16 core, ''​np16''​) node exclusively (see below, use ''​-n 64''​ for tp and fat queues). You will get a shell prompt, as soon as a suitable node becomes available. Single thread, non-interactive jobs can be submitted with 
- 
-<​code>​ 
-bsub -q mpi ./​myexecutable</​code>​ 
- 
-===== MPI jobs ===== 
- 
-Note that a single thread job submitted like above will share its execution host with other jobs. It is therefore expected that it does not use more than the memory available per core! On the ''​mpi''​ nodes this amount is 4 GB, as well as on the newer ''​fat''​ nodes. If your job requires more, you must assign additional cores. For example, if your single thread job requires 64 GB of memory, you must submit it like this: 
- 
-<​code>​ 
-bsub -q mpi -n 16 ./​myexecutable</​code>​ 
-\\ 
-OpenMPI jobs can be submitted as follows: 
- 
-<​code>​ 
-bsub -q mpi -n 256 -a openmpi mpirun.lsf ./​myexecutable</​code>​ 
-\\ 
-For Intel MPI jobs it suffices to use ''​-a intelmpi''​ instead of ''​-a openmpi''​. Please note that LSF will not load the correct modules (compiler, library, MPI) for you. You either have to do that before executing ''​bsub'',​ in which case your setup will be copied to the execution hosts, or you will have to use a job script and load the required modules there. ​ 
- 
-A new feature in LSF is ''​pinning''​ support. ''​Pinning''​ (in its most basic version) means instructing the operating system to not apply its standard scheduling algorithms to your workloads, but instead keep processes on the CPU core they have been started on. This may significantly improve performance for some jobs, especially on the ''​fat''​ nodes with their high CPU core count. ''​Pinning''​ is managed via the MPI library, and currently only OpenMPI is supported. There is not much experience with this feature, so we are interested in your feedback. Here is an example: 
- 
-<​code>​ 
-bsub -R "​select[np16] span[ptile=16] affinity[core(1):​cpubind=core]"​ -q mpi -n 256 -a openmpi mpirun.lsf ./​myexecutable</​code>​ 
-\\ 
-The affinity string ''"​affinity[core(1):​cpubind=core]"''​ means that each task is using one core and that the binding should be done based on cores (as opposed to sockets, NUMA units, etc). Because this example is for a pure MPI application,​ x in ''​core(x)''​ is one. In an SMP/MPI hybrid job, x would be equal to the number of threads per task (e. g., equal to ''​OMP_NUM_THREADS''​ for Openmp/MPI hybrid jobs). 
- 
-===== SMP jobs ===== 
- 
-Shared memory parallelized jobs can be submitted with 
- 
-<​code>​ 
-bsub -q mpi -n 8,20 -R '​span[hosts=1]'​ -a openmp ./​myexecutable</​code>​ 
-\\ 
-The ''​span''​ option is required, without it, LSF will assign cores to the job from several nodes, if that is advantageous from the scheduling perspective. 
- 
-===== Using the fat+ queue ===== 
- 
-Nodes with a lot of memory are very expensive and should not normally be used for jobs which could also run on our other nodes. Therefore, please note the following policies: 
- 
-  * Your job must need more than 250 GB RAM. 
-  * Your job must use at least a full 512 GB node or half a 1.5 TB or 2 TB node: 
- 
-  * For a full 512 GB node: 
-<​code>​ 
-#BSUB -x 
-#BSUB -R "​maxmem < 600000"​ 
-</​code>​ 
- 
-  * For half a 1.5 TB node (your job needs more than 500 GB RAM): 
-<​code>​ 
-#BSUB -n 20 
-#BSUB -R span[hosts=1] 
-#BSUB -R "​maxmem < 1600000 && maxmem > 600000"​ 
-</​code>​ 
- 
-  * For a full 1.5 TB node (your job needs more than 700 GB RAM): 
-<​code>​ 
-#BSUB -x 
-#BSUB -R "​maxmem < 1600000 && maxmem > 600000"​ 
-</​code>​ 
- 
-  * For half a 2 TB node (your job needs more than 700 GB RAM): 
-<​code>​ 
-#BSUB -n 16 
-#BSUB -R span[hosts=1] 
-#BSUB -R "​maxmem > 1600000"​ 
-</​code>​ 
- 
-  * For a full 2 TB node (your job needs more than 1.5 TB RAM): 
-<​code>​ 
-#BSUB -x 
-#BSUB -R "​maxmem > 1600000"​ 
-</​code>​ 
- 
-The 512 GB nodes are also available in the fat queue, without these restrictions. However, fat jobs on these nodes have a lower priority compared to fat+ jobs. 
- 
-===== CPU architecture selection ===== 
- 
-Our cluster provides four generations of Intel CPUs and two generations of AMD CPUs. However, the main difference between these CPU types is whether they support Intel'​s AVX2 or not. For selecting this we have introduced the x64inlvl (for x64 instruction level) label: 
- 
-<​code>​ 
-x64inlvl=1 : Supports only AVX 
-x64inlvl=2 : Supports AVX and AVX2 
-</​code>​ 
- 
-In order to choose an AVX2 capable node you therefore have to include 
-<​code>​ 
-#BSUB -R "​x64inlvl=2"​ 
-</​code>​ 
-in your submission script. 
- 
-If you need to be more specific, you can also directly choose the CPU generation: 
- 
-<​code>​ 
-amd=1 : Interlagos 
-amd=2 : Abu Dhabi 
- 
-intel=1 : Sandy Bridge 
-intel=2 : Ivy Bridge 
-intel=3 : Haswell 
-intel=4 : Broadwell 
-</​code>​ 
- 
-So, in order to choose any AMD CPU: 
-<​code>​ 
-#BSUB -R amd 
-</​code>​ 
-In order to choose an Intel CPU of at least Haswell generation: 
-<​code>​ 
-#BSUB -R "​intel>​=3"​ 
-</​code>​ 
-This is equivalent to ''​x64inlvl=2''​. 
- 
-===== GPU selection ===== 
- 
-In order to use a GPU you should submit your job to the ''​gpu''​ queue, and request GPU shares. Each node equipped with a GPU provides as many GPU shares as it has cores, independent of how many GPUs are built in. So for example, on the nodes, which have 24 CPU cores, the following would give you exclusive access to GPUs: 
-<​code>​ 
-#BSUB -R "​rusage[ngpus_shared=24]"​ 
-</​code>​ 
-Note that you need not necessarily also request 24 cores with ''​-n 24'',​ as jobs from the MPI queue may utilize free CPU cores if you do not need them. The latest "​gpu"​ nodes have two GPUs each, and you should use both, if possible. 
- 
-If you request less shares than cores available, other jobs may also utilize the GPUs. However, we have currently no mechanism to select a specific one for a job. This would have to be handled in the application or your job script. 
- 
-A good way to use the nodes which have 2 GPUs with jobs only working on one GPU would be to put two together in one job script and preselect a GPU for each one. 
- 
-Currently we have several generations of NVidia GPUs in the cluster, selectable in the same way as CPU generations:​ 
- 
-<​code>​ 
-nvgen=1 : Kepler 
-nvgen=2 : Maxwell 
-nvgen=3 : Pascal 
-</​code>​ 
- 
-Most GPUs are commodity graphics cards, and only provide good performance for single precision calculations. If you need double precision performance,​ or error correcting memory (ECC RAM), you can select the Tesla GPUs with 
-<​code>​ 
-#BSUB -R tesla 
-</​code>​ 
-Our Tesla K40 are of the Kepler generation (nvgen=1). 
- 
-If you want to make sure to run on a node equipped with two GPUs use: 
-<​code>​ 
-#BSUB -R "​ngpus=2"​ 
-</​code>​ 
- 
-===== Memory selection ===== 
- 
-Note that the following paragraph is about **selecting** nodes with enough memory for a job. The mechanism to actually **reserve** that memory does not change: The memory you are allowed to use equals memory per core times slots (-n option) requested. 
- 
-You can select a node either by currently available memory (mem) or by maximum available memory (maxmem). If you request complete nodes, the difference is actually very small, as a free node's available memory is close to its maximum memory. All requests are in MB. 
- 
-To select a node with more than about 500 GB available memory use: 
-<​code>​ 
-#BSUB -R "​mem>​500000"​ 
-</​code>​ 
-To select a node with more than about 6 GB maximum memory per core use: 
-<​code>​ 
-#BSUB -R "​maxmem/​ncpus>​6000"​ 
-</​code>​ 
-(Yes, you can do basic math in the requirement string!) 
- 
-It bears repeating: None of the above is a memory reservation. If you actually want to reserve "​mem"​ memory, the easiest way is to combine ''​-R "​mem>​...''​ with ''​-x''​ for an exclusive job. 
- 
-Finally, note that the ''​-M''​ option just denotes the memory limit of your job per core (in KB). This is of no real consequence,​ as we do not enforce these limits and it has no influence on the host selection. 
- 
- 
-Besides the options shown in this article, you can of course use the options for controlling walltime limits (-W), output (-o), and your other requirements as usual. You can also continue to use job scripts instead of the command line (with the ''#​BSUB <​option>​ <​value>''​ syntax). 
- 
-Please consult the LSF man pages if you need further information. 
-====== LSF to Slurm Conversion Guide ====== 
-This is a short guide on how to convert the most common options in your jobscripts from LSF to Slurm. 
- 
-^ Description ​                    ^ LSF                   ^ Slurm                 ^ Comment ​                  ^ 
-| Submit job                      | bsub < job.sh ​        | sbatch job.sh ​        | No < in Slurm! ​           | 
-| Scheduler Comment in Jobscript ​ | #BSUB -...            | #SBATCH -...          |                           | 
-| Queue/​Partition ​                | -q <​queue> ​           | -p <​partition> ​       |                           | 
-| Walltime ​                       | -W 48:00              | -t 2-00:​00:​00 ​        ​| ​                          | 
-| Stdout ​                         | -o <​outfile> ​         | -o <​outfile> ​         | %J substituted for JobID  | 
-| Stderr ​                         | -e <​errfile> ​         | -o <​errfile> ​         | %J substituted for JobID  | 
-| #​Jobslots ​                      | -n #                  | -n #                  |                           | 
-| One Host                        | -R "​span[hosts=1]" ​   | -N 1                  |                           | 
-| Process Distribution ​           | -R "​span[ptile=<​x>​]" ​ | <​nowiki>​--ntasks-per-node x</​nowiki> ​  ​| ​                          | 
-| Exclusive Node                  | -x                    | <​nowiki>​--exclusive</​nowiki> ​          ​| ​                          | 
-| Scratch ​                        | -R scratch[2] ​        | -C "​scratch[2]" ​      ​| ​                          | 
-^ Queue -> Partition Conversion ​                                                                           ^^^^ 
-| General Purpose ​                | -q mpi                | medium ​               |                           | 
-|                                 | -q mpi-short ​         | -p medium <​nowiki>​--qos=short</​nowiki>​ |                           | 
-|                                 | -q mpi-long ​          | -p medium <​nowiki>​--qos=long</​nowiki> ​ |                           | 
-|                                 | -q fat                | fat                   ​| ​                          | 
-|                                 | -q fat-short ​         | <​nowiki>​-p fat --qos=short</​nowiki> ​   |                           | 
-|                                 | -q fat-long ​          | <​nowiki>​-p fat --qos=long</​nowiki> ​    ​| ​                          | 
- 
- 
- 
  
 [[Kategorie:​ Scientific Computing]] [[Kategorie:​ Scientific Computing]]