Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:running_jobs_slurm [2019/04/03 10:03]
tehlers
en:services:application_services:high_performance_computing:running_jobs_slurm [2020/04/14 14:47] (current)
mboden [''sbatch'': Specifying node properties with ''-C'']
Line 1: Line 1:
-======First docu (not yet released) ====== +====== Running Jobs with SLURM======
-====== Running Jobs ======+
  
 In the following the basic concepts will be described. In the following the basic concepts will be described.
  
 **Cluster**\\ **Cluster**\\
-A collection of networked computers intended to provide compute capabilities.\\ +A collection of networked computers intended to provide compute capabilities. 
-\\+
 **Node**\\ **Node**\\
-One of these computers, also called host.\\ +One of these computers, also called host. 
-\\+
 **Frontend**\\ **Frontend**\\
-A special node provided to interact with the cluster via shell commands. gwdu101 ​and gwdu102 are our frontends.\\ +A special node provided to interact with the cluster via shell commands. gwdu101gwdu102 ​and gwdu103 ​are our frontends. 
-\\ + 
-**(Job-)Slot**\\ +**Task or (Job-)Slot**\\ 
-Compute capacity for one process (or "​thread"​) at a time, usually one processor core, "​core" ​for short.\\ +Compute capacity for one process (or "​thread"​) at a time, usually one processor core, or CPU for short. 
-\\+
 **Job**\\ **Job**\\
-A compute task consisting of one or several parallel processes.\\ +A compute task consisting of one or several parallel processes. 
-\\+
 **Batch System**\\ **Batch System**\\
-The management system distributing job processes across job slots. In our case Platform LSF, which is operated by shell commands on the frontends.\\ +The management system distributing job processes across job slots. In our case [[https://​slurm.schedmd.com|Slurm]], which is operated by shell commands on the frontends. 
-\\+
 **Serial job**\\ **Serial job**\\
-A job consisting of one process using one job slot.\\ +A job consisting of one process using one job slot. 
-\\+
 **SMP job**\\ **SMP job**\\
-A job with shared memory parallelization (often realized with OpenMP), meaning that all processes need access to the memory of the same node. Consequently an SMP job uses several job slots //on the same node//.\\ +A job with shared memory parallelization (often realized with OpenMP), meaning that all processes need access to the memory of the same node. Consequently an SMP job uses several job slots //on the same node//. 
-\\+
 **MPI job**\\ **MPI job**\\
-A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or substitute.\\ +A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or the Slurm substitute ​''​srun''​
-\\ + 
-**Queue**\\ +**Partition**\\ 
-A label to sort jobs by general requirements and intended execution nodes.+A label to sort jobs by general requirements and intended execution nodes. ​Formerly called "​queue"​
  
-=====  The ''​bsub''​ Command: Submitting Jobs to the Cluster ​ =====+=====  The ''​sbatch''​ Command: Submitting Jobs to the Cluster ​ =====
  
-''​bsub''​ submits information on your job to the batch system:+''​sbatch''​ submits information on your job to the batch system:
  
   *  What is to be done? (path to your program and required parameters)   *  What is to be done? (path to your program and required parameters)
   *  What are the requirements?​ (for example queue, process number, maximum runtime)   *  What are the requirements?​ (for example queue, process number, maximum runtime)
  
-LSF then matches the job’s requirements against the capabilities of available job slots. Once sufficient suitable job slots are found, the job is started. ​LSF considers jobs to be started in the order of their priority.+Slurm then matches the job’s requirements against the capabilities of available job slots. Once sufficient suitable job slots are found, the job is started. ​Slurm considers jobs to be started in the order of their priority.
  
-=====  Available ​Queues ​ =====+=====  Available ​Partitions ​ ===== 
 +We currently have two meta partitions, corresponding to broad application profiles:
  
-We have three generally available base queuescorresponding ​to broad application profiles:+**medium**\\ 
 +This is our general purpose partitionusable for serial and SMP jobs with up to 24 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.
  
-**int**\\ 
-[[interactive queue]]\\ 
-\\ 
-**mpi**\\ 
-This is our general purpose queue, usable for serial and SMP jobs with up to 20 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.\\ 
-\\ 
 **fat**\\ **fat**\\
-This is the queue for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this queue. Up to 64 cores and 256 GB are available on one host. Maximum runtime is 48 hours.+This is the partition ​for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and up to 512 GB are available on one host. Maximum runtime is 48 hours.\\ 
 +The nodes of the fat+ partitions are also present in this partition, but will only be used, if they are not needed for bigger jobs submitted to the fat+ partition.
  
-Both the "​mpi"​ and "​fat"​ queues are also available in manifestations,​ e.g. "​mpi-long",​ corresponding to special runtime requirements:​\\ 
-\\ 
-**-long**\\ 
-Here, the maximum runtime is increased to 120 hours. Job slot availability is limited, though, so expect longer waiting times.\\ 
-\\ 
-**-short**\\ 
-Here, the maximum runtime is decreased to two hours. In turn the queue has a higher base priority, but it also has limited job slot availability. That means that as long as only few jobs are submitted to the "​-short"​ queues, there will be minimal waiting times. These queues are intended for testing and development,​ not for massive production.\\ 
-\\ 
 **fat+**\\ **fat+**\\
-This queue is for jobs that require more than 256 GB RAM on single node. Nodes of fat+ queue have 512 GB, 1.5 and 2 TB RAM. Due to limited amount of such nodes, there are restrictions of using the queue, which you can find at the [[en:​services:​application_services:​high_performance_computing:​running_jobs_for_experienced_users#using_the_fat_queue|page for experienced Users]]+This partition ​is meant for very memory intensive jobs. These partitions are for jobs that require more than 512 GB RAM on single node. Nodes of fat+ partitions ​have 1.5 and 2 TB RAM. You are required ​to have specify your memory needs on job submission to use these nodes (see [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#resource_selection|resource selection]]).\\ 
 +As general advice: Try your jobs on the smaller nodes in the fat partition first and work your way up and don't be afraid to ask for help here.
  
-=====  ''​bsub''​ Syntax and Usage  =====+**gpu** - A partition for nodes containing GPUs. Please refer to [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​gpu_selection]] ​
  
-<​code>​bsub <bsub options> [mpirun.lsf] <path to program> <program parameters></​code>​ +====  ​Runtime limits (QoS)  ==== 
-\\ +If the default time limits are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS.
-====  ​''​bsub''​ options for serial jobs  ====+
  
-**-q <​queue>​**\\ +**long**\\ 
-Submission queue.\\ +Here, the maximum ​runtime is increased to 120 hoursJob slot availability ​is limited, though, so expect longer waiting times.
-\\ +
-**-W <​hh:​mm>​**\\ +
-Maximum ​runtime. If this time is exceeded the job is killed.\\ +
-\\ +
-**-o <​file>​**\\ +
-Store the job output in "​file"​ (otherwise it is sent by email)''​%J''​ in the filename stands for the jobid.\\+
  
-====  ''​bsub''​ options for parallel (SMP or MPI) jobs  ====+**short**\\ 
 +Here, the maximum runtime is decreased to two hours. In turn the queue has a higher base priority, but it also has limited job slot availability. That means that as long as only few jobs are submitted to the "​-short"​ queues, there will be minimal waiting times. These queues are intended for testing and development,​ not for massive production.
  
-**-<min>,<max>**\\ +=====  How to submit jobs  ===== 
-The minimum ​and maximum process countIf "​max"​ is left out, "min" ​is the exact number of job slots required.\\ + 
-\\ +Slurm supports different ways to submit jobs to the cluster: 
-**-<wrapper>**\\ +Interactively or in batch mode. We generally recommend using the 
-This option denotes a wrapper script required to run SMP or MPI jobsThe most important wrappers are ''​openmp''​ (for SMP jobs)''​intelmpi'' ​(for MPI jobs using the Intel MPI library), and ''​openmpi'' ​(for MPI jobs using the OpenMPI library).\\ +batch mode. If you need to run a job interactively,​ you can find 
-\\ +information about that in the [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​interactive_session_on_the_nodes|corresponding section]]. 
-**mpirun.lsf**\\ +Batch jobs are submitted to the cluster using the '​sbatch'​ command 
-LSF’s substitute for ''​mpirun''​. In MPI jobs ''​mpirun.lsf''​ needs to be put in front of the program path.\\+and a jobscript or a command:​\\ 
 +<​code>​sbatch <​options>​ [jobscript.sh | --wrap=<command>]</code>\\ 
 + 
 +**sbatch** can take a lot of options to give more information on the 
 +specifics of your job, e.g. where to run it, how long it will take 
 +and how many nodes it needsWe will examine a few of the options in 
 +the following paragraphs. For a full list of commandsrefer to the 
 +manual of the command with 'man sbatch'​. 
 + 
 +====  ​"sbatch" ​options ​ ==== 
 + 
 +**<​nowiki>​-A all</​nowiki>​**\\ 
 +Specifies the account '​all'​ for the job. This option is //​mandatory//​ for users who have access to special hardware and want to use the general partitions. 
 + 
 +**<​nowiki>​-p <​partition></​nowiki>​**\\ 
 +Specifies in which partition the job should run. Multiple partitions 
 +can be specified in a comma separated list. 
 + 
 +**<​nowiki>​--qos=<​qos></​nowiki>​**\\ 
 +Submit the job using a special QOS. 
 + 
 +**<​nowiki>​-<time></​nowiki>**\\ 
 +Maximum runtime of the jobIf this time is exceeded the job is killed. Acceptable <​time>​ formats include "​minutes"​"​minutes:​seconds",​ "​hours:​minutes:​seconds",​ "​days-hours",​ "​days-hours:​minutes"​ and "​days-hours:​minutes:​seconds" ​(example: 1-12:00:00 will request 1 day and 12 hours). 
 + 
 +**<​nowiki>​-o <​file></​nowiki>​**\\ 
 +Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​)''​%J'' ​in the filename stands ​for the jobid.\\ 
 + 
 +**<​nowiki>​--noinfo</​nowiki>​**\\ 
 +Some metainformation about your job will be added to your output file. If you do not want that, you can suppress it with this flag.\\ 
 + 
 +**<​nowiki>​--mail-type=[ALL|BEGIN|END]</​nowiki>​\\ 
 +<​nowiki>​--mail-user=your@mail.com</​nowiki>​** \\ 
 +Receive mails when the jobs start, end or bothThere are even more options, refer to the sbatch man-page for more information about mail types. If you have a GWDG-mail-address,​ you do not need to specify the mail-user.\\ 
 + 
 +====  Resource Selection ​ ==== 
 + 
 +=== CPU Selection ===
  
-====  ''​bsub'':​ Specifying process distribution with ''​-R'' ​ ====+**<​nowiki>​-n <​tasks></​nowiki>​**\\ 
 +The number of tasks for this job. The default is one task per node.
  
-**<​nowiki>​-R span[hosts=1]</​nowiki>​**\\ +**<​nowiki>​-c <cpus per task></​nowiki>​**\\ 
-This puts all processes on //one// hostYou always want to use this with SMP jobs.\\ +The number of cpus per tasks. The default is one cpu per task. 
-\\ + 
-**<​nowiki>​-R span[ptile=<x>]</​nowiki>​**\\ +**<​nowiki>​-c vs -n</​nowiki>​**\\ 
-''​x''​ denotes the exact number of job slots to be used on each host. If the total process ​number is not divisible by ''​x''​, the residual processes will be put on one host.\\+As a rule of thumb, if you run your code on a single node, use -c. For multi-node MPI-jobs, use -n.\\ 
 + 
 +**<​nowiki>​-<minNodes[,​maxNodes]></​nowiki>​**\\ 
 +Minimum and maximum ​number of nodes that the job should ​be executed ​on. If only one number is specifiedit is used as the precise node count.\\ 
 + 
 +**<​nowiki>​--ntasks-per-node=<​ntasks></​nowiki>​**\\ 
 +Number of tasks per node. If -n and <​nowiki>​--ntasks-per-node</​nowiki>​ is specified, this options specifies the maximum number tasks per node.
 \\ \\
-**<​nowiki>​-R span[ptile='​!'​]</​nowiki>​**\\ +=== Memory Selection === 
-**<​nowiki>​-R same[model]</​nowiki>​**\\ + 
-With this special notation, ​''​x'' ​will become the maximum number ​of cores available ​on the node type used for the jobIn other words, using '!' ​will acquire all job slots on all nodes the job runs on, provided the total number ​of job slots requested is divisible by ''​x'' ​(otherwiese ​the residual will run on one shared host).+By default, your available memory per node is the default memory per task times the number of tasks you have running on that node. You can get the default memory per task by looking at the DefMemPerCPU metric as reported by ''​scontrol show partition <​partition>​ 
 +''​ 
 + 
 +**<​nowiki>​--mem=<​size[units]></​nowiki>​**\\ 
 +Required memory per node. The Unit can be one of [K|M|G|T], but defaults to M. If your processes exceed this limit, they will be killed. 
 + 
 +**<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\ 
 +Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki>​ are mutually exclusive.\\ 
 +=== Example === 
 + 
 +''​<​nowiki>​-n 10 -N 2 --mem=5G</​nowiki>​'' ​Distributes a total of 10 tasks over 2 nodes and reserves 5G of memory ​on each node. 
 + 
 +''​<​nowiki>​--ntasks-per-node=5 -N 2 --mem=5G</​nowiki>''​ Allocates 2 nodes and puts 5 tasks on each of them. Also reserves 5G of memory on each node.  
 + 
 +''​<​nowiki>​-n 10 -N 2 --mem-per-cpu=1G</​nowiki>​'' ​Distributes a total of 10 tasks over 2 nodes and reserves 1G of memory for each task. So the memory per node depends ​on where the tasks are running.\\ 
  
 ====  The GWDG Scientific Compute Cluster ​ ==== ====  The GWDG Scientific Compute Cluster ​ ====
  
-{{ :​en:​services:​scientific_compute_cluster:​nodes.png?​1000 |}}+{{ :​en:​services:​scientific_compute_cluster:​nodes-slurm.png?1000 |}}
  
-This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the "​ehemalige Fernmeldezentrale"​ facility hosting the older resources and the shared /scratch file system and the "​Faßberg"​ facility hosting the latest resources and the shared /scratch2 file system. The shared /​scratch2 ​is usually the best choice ​for temporary data in your jobs, but it is only available at the Faßberg resources (selectable ​with ''​-scratch2''​). The scheme also shows the queues and resources by which nodes are selected using the ''​-q''​ and ''​-R''​ options of ''​bsub''​.+This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the "​ehemalige Fernmeldezentrale"​ facility hosting the older resources and the shared /scratch file system and the "​Faßberg"​ facility hosting the latest resources and the shared /scratch2 file system. The shared ​/scratch and /​scratch2 ​are usually the best choices ​for temporary data in your jobs, but /scratch is only available at the "​Fernmeldezentrale"​ (fmz) resources (select ​it with ''​-C scratch''​) and /​scratch2 ​is only available at the Faßberg ​(fas) resources (select it with ''​-scratch2''​). The scheme also shows the queues and resources by which nodes are selected using the ''​-p'' ​(partition) ​and ''​-C'' ​(constraint) ​options of ''​sbatch''​.
  
-====  ''​bsub'':​ Specifying node properties with ''​-R'' ​ ====+====  ''​sbatch'':​ Specifying node properties with ''​-C'' ​ ==== 
 + 
 +**-C scratch[2]**\\ 
 +The node must have access to shared ''/​scratch''​ or ''/​scratch2''​. 
 + 
 +**-C fmz / -C fas**\\ 
 +The node has to be at that location. It is pretty similar to -C scratch / -C scratch2, since the nodes in the FMZ have access to scratch and those at the Fassberg location have access to  scratch2. This is mainly for easy compatibility with our old partition naming scheme. 
 + 
 +**-C [architecture]**\\ 
 +request a specific CPU architecture. Available Options are: abu-dhabi, ivy-bridge, haswell, broadwell. See [[en:​services:​application_services:​high_performance_computing:​start#​hardware_overview|this table]] for the corresponding nodes.
  
-**-R scratch[2]**\\ 
-The node must have access to shared ''/​scratch''​ or ''/​scratch2''​.\\ 
-\\ 
-**-R work**\\ 
-The node must have access to one of the shared ''/​work''​ directories.\\ 
-\\ 
-**-R "​ncpus=<​x>"​**\\ 
-Choose only nodes with a job slot count of ''​x''​. This is useful with ''<​nowiki>​span[ptile=<​x>​]</​nowiki>''​.\\ 
-\\ 
-**-R big**\\ 
-Choose the nodes with the maximum memory per core available in the queue. Currently only distinguishes ''​gwdaxxx''​ from ''​gwdpxxx''​ nodes.\\ 
-\\ 
-**-R latest**\\ 
-Always use the latest (and usually most powerful) nodes available in the queue. To get a list of current latest nodes run the command ''​bhosts -R latest''​ on one of the frontends. You can also check the [[en:​services:​application_services:​high_performance_computing:​new_nodes|Latest Nodes]] page for more information. 
  
-====  ​''​bsub'': ​Using Job Scripts ​ ====+====  Using Job Scripts ​ ====
  
-A job script is a shell script with a special comment section: In each line beginning with ''#​BSUB''​ the following text is interpreted as a ''​bsub''​ option. Here is an example:+A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. These options have to be at the top of the script before any other commands are executed. Here is an example:
  
 <​code>​ <​code>​
-#!/bin/sh +#!/bin/bash 
-#BSUB -q mpi +#SBATCH ​-p medium 
-#BSUB -W 00:10 +#SBATCH ​-10:00 
-#BSUB -o out.%J+#SBATCH ​-o outfile-%J
  
 /​bin/​hostname /​bin/​hostname
Line 144: Line 180:
  
 <​code>​ <​code>​
-bsub < <script name>  ​+sbatch ​<script name>  ​
 </​code>​ </​code>​
 \\ \\
Line 151: Line 187:
 An exclusive job does use all of its allocated nodes exclusively,​ i.e., it never shares a node with another job. This is useful if you require all of a node's memory (but not all of its CPU cores), or for SMP/MPI hybrid jobs, for example. An exclusive job does use all of its allocated nodes exclusively,​ i.e., it never shares a node with another job. This is useful if you require all of a node's memory (but not all of its CPU cores), or for SMP/MPI hybrid jobs, for example.
  
-To submit an exclusive job add ''​-x''​ to your bsub options. For example, to submit a single task job, which uses a complete fat node with 256 GB memory, you could use:+Do not combine ''<​nowiki>​--exclusive</​nowiki>''​ and ''<​nowiki>​--mem=<​x></​nowiki>''​. In that case you will get all available tasks on the node, but your memory will still be limited to what you specified with ''<​nowiki>​--mem</​nowiki>''​ 
 + 
 +To submit an exclusive job add ''​<​nowiki>​--exclusive</​nowiki>​''​ to your sbatch ​options. For example, to submit a single task job, which uses a complete fat node, you could use:
 <​code>​ <​code>​
-bsub --fat -R big ./mytask+sbatch ​--exclusive -p fat -t 12:00:00 --wrap="​./mytask"
 </​code>​ </​code>​
-(-R big requests ​256 GB node, excluding the 128 GB nodes in the fat queue)+This allocates either ​complete gwda nodes with 256GB, or a complete dfa node with 512GB.
  
 For submitting an OpenMP/MPI hybrid job with a total of 8 MPI processes, spread evenly across 2 nodes, use: For submitting an OpenMP/MPI hybrid job with a total of 8 MPI processes, spread evenly across 2 nodes, use:
 <​code>​ <​code>​
 export OMP_NUM_THREADS=4 export OMP_NUM_THREADS=4
-bsub --q mpi -n 8 -R span[ptile=4-a intelmpi ​mpirun.lsf ./​hybrid_job+sbatch ​--exclusive ​-p medium ​-N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"
 </​code>​ </​code>​
 (each MPI process creates 4 OpenMP threads in this case). (each MPI process creates 4 OpenMP threads in this case).
  
-Please note that fairshare evaluation and accounting is done based on the number of job slots **allocated**. So the first example would count as 64 slots for both fairshare and accounting. 
- 
-Using exclusive jobs does not require reserving all of a node's slots explicitly (e.g., with span[ptile='​!'​]) and subsequently using the MPI library'​s mpiexec or mpiexec.hydra to set the process number, as we explain in our introductory course. This makes submitting a hybrid job as exclusive job more straightforward. 
- 
-However, there is a disadvantage:​ LSF will not reserve the additional job slots required to get a node exclusively. Therefore, when the cluster is very busy, an exclusive job needing a lot of nodes may wait significantly longer. 
- 
-====  A Note On Job Memory Usage  ==== 
- 
-LSF will try to fill up each node with processes up to its job slot limit. Therefore each process in your job must not use more memory than available //per core//! If your per core memory requirements are too high, you have to add more job slots in order to allow your job to use the memory of these slots as well. If your job's memory usage increases with the number of processes, you have to leave additional job slots //empty//, i.e., do not run processes on them. 
- 
-====  Recipe: Reserving Memory for OpenMP ​ ==== 
- 
-The following job script recipe demonstrates using empty job slots for reserving memory for OpenMP jobs: 
- 
-<​code>#​!/​bin/​sh 
-#BSUB -q fat 
-#BSUB -W 00:10 
-#BSUB -o out.%J 
-#BSUB -n 64 
-#BSUB -R big 
-#BSUB -R "​span[hosts=1]"​ 
- 
-export OMP_NUM_THREADS=8 
-./​myopenmpprog 
-</​code>​ 
-\\ 
 ====  Disk Space Options ​ ==== ====  Disk Space Options ​ ====
  
Line 196: Line 208:
 **/​local**\\ **/​local**\\
 This is the local hard disk of the node. It is a fast - and in the case of the ''​gwda,​ gwdd, dfa, dge, dmp, dsu''​ and ''​dte''​ nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.\\ This is the local hard disk of the node. It is a fast - and in the case of the ''​gwda,​ gwdd, dfa, dge, dmp, dsu''​ and ''​dte''​ nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.\\
-\\+A directory is automatically created for each job at ''/​local/​jobs/<​jobID>''​ and the path is exported as the environment variable ''​$TMP_LOCAL''​. 
 **/​scratch**\\ **/​scratch**\\
-This is the shared scratch space, available on ''​gwda''​ and ''​gwdd''​ nodes and frontends ''​gwdu101''​ and ''​gwdu102''​. You can use ''​-scratch''​ to make sure to get a node with access to shared /scratch. It is very fast, there is no automatic file deletion, but also no backup! We may have to delete files manually when we run out of space. You will receive a warning before this happens.\\ +This is the shared scratch space, available on ''​gwda''​ and ''​gwdd''​ nodes and frontends ''​gwdu101''​ and ''​gwdu102''​. You can use ''​-scratch''​ to make sure to get a node with access to shared /scratch. It is very fast, there is no automatic file deletion, but also no backup! We may have to delete files manually when we run out of space. You will receive a warning before this happens. 
-\\+
 **/​scratch2**\\ **/​scratch2**\\
-This space is the same as scratch described above except it is **ONLY** available on the nodes ''​dfa,​ dge, dmp, dsu''​ and ''​dte''​ and on the frontend ''​gwdu103''​. You can use ''​-scratch2''​ to make sure to get a node with access to that space.\\ +This space is the same as scratch described above except it is **ONLY** available on the nodes ''​dfa,​ dge, dmp, dsu''​ and ''​dte''​ and on the frontend ''​gwdu103''​. You can use ''​-scratch2''​ to make sure to get a node with access to that space. 
-\\+
 **$HOME**\\ **$HOME**\\
 Your home directory is available everywhere, permanent, and comes with backup. Your attributed disk space can be increased. It is comparably slow, however. Your home directory is available everywhere, permanent, and comes with backup. Your attributed disk space can be increased. It is comparably slow, however.
Line 211: Line 224:
  
 <​code>​ <​code>​
-#!/bin/sh +#!/bin/bash 
-#BSUB -fat +#SBATCH ​-fat 
-#BSUB -n 64 +#SBATCH ​-N 1 
-#BSUB -R "​span[hosts=1]"​ +#SBATCH ​-n 64 
-#BSUB -scratch +#SBATCH ​-scratch 
-#BSUB -W 24:00 +#SBATCH ​-t 1-00:00:00
-#BSUB -C 0 +
-#BSUB -a openmp+
  
 export g09root="/​usr/​product/​gaussian"​ export g09root="/​usr/​product/​gaussian"​
Line 238: Line 249:
 If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2: If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2:
 <​code>​ <​code>​
-#BSUB -"​scratch||scratch2"​+#SBATCH ​-"​scratch|scratch2"​
 </​code>​ </​code>​
 For that case ''/​scratch2''​ is linked to ''/​scratch''​ on the latest nodes. You can just use ''/​scratch/​${USERID}''​ for the temporary data (don't forget to create it on ''/​scratch2''​). On the latest nodes data will then be stored in ''/​scratch2''​ via the mentioned symlink. For that case ''/​scratch2''​ is linked to ''/​scratch''​ on the latest nodes. You can just use ''/​scratch/​${USERID}''​ for the temporary data (don't forget to create it on ''/​scratch2''​). On the latest nodes data will then be stored in ''/​scratch2''​ via the mentioned symlink.
  
-=====  Miscallaneous LSF Commands ​ =====+==== Interactive session on the nodes ====
  
-While ''​bsub''​ is arguably ​the most important LSF commandyou may also find the following commands useful:+As stated before, ​''​sbatch''​ is used to submit jobs to the clusterbut there is also ''​srun''​ command wich can be used to execute a task directly on the allocated nodes. That command is helpful to start interactive session on the node. You can use interactive session to avoid running large tests on the frontend (a good idea!) you can get an interactive session (with the bash shell) on one of the ''​medium''​ nodes with
  
-**bjobs**\\ +<​code>​srun ​--pty -p medium ​-N 1 -c 16 /bin/bash</code>
-Lists current jobs. Useful options are: ''​-p, -l, -a, , <​jobid>, ​-u all, -q <​queue>, ​-<host>''​.\\+
 \\ \\
-**bhist**\\ +''​<​nowiki>​--pty</nowiki>'' ​requests support for an interactive shell, and ''​-p medium'' ​the corresponding partition. ''​-c 16'' ​ensures that you 16 cpus on the nodeYou will get a shell prompt, as soon as a suitable node becomes availableSingle thread, non-interactive jobs can be run with 
-Lists older jobs. Useful options are: ''​-l, -n, <jobid>''​.\\ +<code>srun -p medium ​./​myexecutable</​code>​
-\\ +
-**lsload**\\ +
-Status of cluster nodes. Useful options are: ''​-l, <​hostname>​''​.\\ +
-\\ +
-**bqueues**\\ +
-Status of batch queues. Useful options are: ''​-l, <​queue>​''​.\\ +
-\\ +
-**bhpart**\\ +
-Why do I have to wait? ''​bhpart''​ shows current user prioritiesUseful options are: ''​-r, <host partition>''​.\\ +
-\\ +
-**bkill**\\ +
-The Final Command. It has two use modes:+
  
-  -  ''​bkill <​jobid>'':​ This kills a job with a specific jobid. +==== GPU selection ​====
-  -  ''​bkill <selection ​options> 0'':​ This kills all jobs fitting the selection options. Useful selection options are: ''​-q <​queue>,​ -m <​host>''​.+
  
-Have look at the respective man pages of these commands to learn more about them! +In order to use GPU you should submit ​your job to the ''​gpu'' ​partition, and request GPU count and optionally the modelCPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting ​single GPU on the node with 20 cores and 4 GPUs, you can get up to 5 cores reserved exclusively for you, the same is with memory. So for exampleif you want 2 GPUs of model Nvidia GeForce GTX 1080 with 10 CPUs, you can submit ​job script ​with the following flags:
- +
-=====  Getting Help  ===== +
-The following sections show you where you can get status Information and where you can get support in case of problems. +
-====  Information sources ​ ==== +
- +
-  *  Cluster status page +
-    *  [[http://​lsf.gwdg.de/​lsfinfo/​]] +
-  *  HPC announce mailing list +
-    *  [[https://​listserv.gwdg.de/​mailman/​listinfo/​hpc-announce]] +
- +
-====  Using the GWDG Support Ticket System ​ ==== +
- +
-Write an email to <​hpc@gwdg.de>​. In the body: +
-  *  State that your question is related ​to the batch system. +
-  *  State your user id (''​$USER''​). +
-  *  If you have problem ​with your jobs please //always send the complete standard output ​and error//! +
-  *  If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem. +
-  *  If you don’t mind us looking at your filesplease state this in your request. You can limit your permission to specific directories or files. +
- +
-===== Latest Nodes ===== +
- +
-Most new nodes are equipped with 2x12 core Intel Broadwell CPUs. The following nodes are added to the cluster: +
- +
-  * 76 nodes with 128 GB memory ​(mpi queue) +
-  * 15 nodes with two NVidia GTX 980 GPUs for single precision CUDA applications (gpu queue and, with lower priority, mpi queue). +
-  * 10 nodes with two NVidia K40 GPUs for double precision or memory intensive CUDA applications (gpu queue and, with lower priority, mpi queue). +
-  * 15 nodes with 512 GB memory (fat+ queue and, with lower priority, fat queue) +
-  * 5 nodes with 1.5 TB memory and 4x10 core Haswell ​CPUs (fat+ queue) +
- +
-In order to accommodate the new systems we have introduced a lot of configuration changes to our job management. Howevermost of you will only need to know one thing now: +
- +
-The new nodes use new scratch file system, /scratch2. They do not have access to /scratch! For your convenience /scratch2 is linked to /scratch on the new nodes, but the new nodes cannot utilize any files stored in the existing /scratch. +
- +
-Requesting nodes with access to the existing /scratch works as before: +
-<​code>​ +
-#BSUB -R scratch +
-</​code>​ +
-To request the new scratch2: +
-<​code>​ +
-#BSUB -R scratch2 +
-</​code>​ +
-If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2:+
 <​code>​ <​code>​
-#BSUB -R "​scratch||scratch2"​+#SBATCH ​-p gpu 
 +#SBATCH -n 10 
 +#SBATCH -G gtx1080:2
 </​code>​ </​code>​
-In that case use /​scratch/​${USERID} for the temporary data (don't forget to create it on /scratch2). On the new nodes data will then be stored in /scratch2 via the mentioned symlink. 
  
-The following changes are not relevant to many of you, but please try to skim over them, in order to see what might affect you or be useful: 
  
-===== 1. The "​latest"​ tag: ===== +You can also omit the model selection, ​here is an example ​of selecting 1 GPU of any available model:<​code>​ 
- +#SBATCH -p gpu 
-Both the last and the new acquisition currently have the ''​latest''​ tag, in order to not interfere with submitted jobs. However, in about a week the ''​latest''​ tag will be removed from the older nodes. After that time, a job submission combining ''​-R latest''​ and ''​-R scratch'' ​can never start, as the latest nodes cannot access /scratch. It may be similar for other combined requests involving "​latest"​. +#SBATCH -n 10 
- +#SBATCH -G 1
-===== 2. CPU architecture ​selection ​===== +
- +
-With the upgradeour cluster provides four generations ​of Intel CPUs and two generations ​of AMD CPUs. However, the main difference between these CPU types is whether they support Intel'​s AVX2 or not. For selecting this we have introduced the x64inlvl (for x64 instruction level) label: +
- +
-<​code>​ +
-x64inlvl=: Supports only AVX +
-x64inlvl=2 : Supports AVX and AVX2+
 </​code>​ </​code>​
  
-In order to choose an AVX2 capable ​node you therefore have to include +There are different options ​to select the number of GPUs, such as ''​%%--gpus-per-node%%'',​ ''​%%--gpus-per-task%%''​ and more. See the [[https://​slurm.schedmd.com/sbatch.html|sbatch man page]] for details.
-<​code>​ +
-#BSUB -R "​x64inlvl=2"​ +
-</code> +
-in your submission script.+
  
-If you need to be more specific, you can also directly choose ​the CPU generation:+Currently we have several generations of NVidia GPUs in the cluster, namely:
  
 <​code>​ <​code>​
-amd=1 Interlagos +gtx1080 ​GeForce GTX 1080  
-amd=2 Abu Dhabi +gtx980  ​GeForce GTX 980 
- +k40     Nvidia Tesla k40
-intel=1 : Sandy Bridge +
-intel=2 : Ivy Bridge +
-intel=3 : Haswell +
-intel=4 ​Broadwell+
 </​code>​ </​code>​
  
-Soin order to choose any AMD CPU:+Most GPUs are commodity graphics cardsand only provide good performance for single precision calculations. If you need double precision performance,​ or error correcting memory (ECC RAM), you can select the Tesla GPUs with
 <​code>​ <​code>​
-#BSUB -R amd+#SBATCH ​-p gpu 
 +#SBATCH -G k40:2
 </​code>​ </​code>​
-In order to choose an Intel CPU of at least Haswell ​generation+Our Tesla K40 are of the Kepler ​generation.
-<​code>​ +
-#BSUB -R "​intel>​=3"​ +
-</​code>​ +
-This is equivalent to ''​x64inlvl=2''​.+
  
-===== 3CPU slot count =====+<​code>​ sinfo -p gpu --format=%N,%G </​code>​ shows a list of host with GPUs, as well as their type and count.
  
-The npxx resource requirement should be substituted by ncpus=xx. For example, in order to choose a node with 20 CPU slots, your submission script should contain +=====  Miscellaneous Slurm Commands ​ =====
-<​code>​ +
-#BSUB -R "ncpus=20" +
-</​code>​ +
-The old syntax (-R np20) will continue to work, and will also be supported on the new nodes. However, it is now deprecated and will be removed with the next cluster extension. It will also be removed from our documentation.+
  
-Please note: Like ''​-R np20''​, ''​-R "​ncpus=20"''​ does **not** substitute ​the ''​-n <​x>''​ statement. ''​-n <​x>''​ requests the amount of coreswhile ''​-R "​ncpus=20"''​ only tells the batchsystem that you want your job to be run only on nodes with the given amount of cores. +While ''​sbatch'' ​is arguably ​the most important Slurm command, you may also find the following commands useful:
-===== 4. Memory selection =====+
  
-__Note that the following paragraph is about **selecting** nodes with enough memory for a job. The mechanism to actually **reserve** that memory does not change: The memory you are allowed to use equals memory per core times slots (-n option) requested.__+**<​nowiki>​sinfo</​nowiki>​**\\ 
 +Shows current status of the cluster and queues
  
-You can select a node either by currently available memory ​(mem) or by maximum available memory (maxmem). If you request complete nodesthe difference is actually very smallas a free node's available memory is close to its maximum memory. All requests are in MB.+**<​nowiki>​squeue</​nowiki>​**\\ 
 +Lists current jobs (default: all users). Useful options are: ''<​nowiki>​-u $USER-p <​partition>​-j <​jobid></​nowiki>'​'.
  
-To select a node with more than about 500 GB memory use: +**<nowiki>scontrol show job <jobid></nowiki>**\\ 
-<code> +Full job information. Only available while job is running and short time thereafter. ​
-#BSUB -R "​maxmem>​500000"​ +
-</code> +
-To select a node with more than about 6 GB memory per core use: +
-<​code>​ +
-#BSUB -R "​maxmem/​ncpus>​6000"​ +
-</code+
-(Yes, you can do basic math in the requirement string!)+
  
-It bears repeating: None of the above is a memory reservation. If you actually want to reserve memory, the easiest way is to combine ''​-R "​maxmem>...''​ with ''​-x''​ for an exclusive job.+**<​nowiki>​squeue ​--start -j <jobid></​nowiki>​**\\ 
 +Expected start timeThis is a rough estimate.
  
-Finallynote that the ''​-M'' ​option just denotes ​the memory limit of your job per core (in KB). This is of no real consequenceas we do not enforce these limits and it has no influence on the host selection.+**<​nowiki>​sacct -j <​jobid>​ --format=JobID,User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\ 
 +Get job Information even after the job has finished.\\ 
 +**Note on ''​sacct''​:** Depending on the parameters given ''​sacct''​ chooses a time window ​in a rather unintuitive way. This is documented in the DEFAULT TIME WINDOW section ​of its man page. If you unexpectedly get no results from your ''​sacct''​ querytry specifying the start time with, e.g. ''<​nowiki>​-S 2019-01-01</​nowiki>''​.\\ 
 +The ''<​nowiki>​--format</​nowiki>''​ option knows many more fields like **Partition**,​ **Start**, **End** or **State**, for the full list refer to the man page.
  
-===== 5Using the fat+ queue =====+**<​nowiki>​scancel</​nowiki>​**\\ 
 +Cancels jobsExamples:​\\ 
 +''<​nowiki>​scancel 1235</​nowiki>''​ - Send the termination Signal (SIGTERM) to job 1235\\ 
 +''<​nowiki>​scancel --signal=KILL 1235</​nowiki>''​ - Send the kill Signal (SIGKILL) to job 1235\\ 
 +''<​nowiki>​scancel --state=PENDING --user=$USER --partition=medium-fmz</​nowiki>''​ - Cancel all your pending jobs in partition ''<​nowiki>​medium-fmz</​nowiki>''​
  
-Nodes with lot of memory are very expensive and should not normally be used for jobs which could also run on our other nodes. Therefore, please note the following policies:+Have look at the respective man pages of these commands to learn more about them!
  
-  * Your job must need more than 250 GB RAM. +====== LSF to Slurm Conversion Guide ====== 
-  * Your job must use at least a full 512 GB node or half 1.5 TB or 2 TB node:+This is short guide on how to convert the most common options in your jobscripts from LSF to Slurm.
  
-  * For a full 512 GB node: +^ Description ​                    ^ LSF                    ^ Slurm                         ^ Comment ​                                                                                                                                                                                      ^ 
-<code> +| Submit job                      | bsub job.sh ​         | sbatch job.sh ​                | No < in Slurm! ​                                                                                                                                                                               | 
-#BSUB -x +| Scheduler Comment in Jobscript ​ | #BSUB -...             | #SBATCH -...                  |                                                                                                                                                                                               | 
-#BSUB -R "maxmem < 600000+| Queue/​Partition ​                | -q <​queue> ​            | -p <​partition> ​               |                                                                                                                                                                                               | 
-</code>+| Walltime ​                       | -W 48:00               | -t 2-00:​00:​00 ​                | -t 48:00 means 48 min.                                                                                                                                                                        | 
 +| Job Name                        | -J <​name> ​             | -J <​name> ​                    ​| ​                                                                                                                                                                                              | 
 +| Stdout ​                         | -o <​outfile> ​          | -o <​outfile> ​                 | %J substituted for JobID                                                                                                                                                                      | 
 +| Stderr ​                         | -e <​errfile> ​          | -e <​errfile> ​                 | %J substituted for JobID                                                                                                                                                                      | 
 +#Jobslots ​                      | -n #                   | -n #                          |                                                                                                                                                                                               | 
 +| One Host                        | -R "span[hosts=1]" ​    | -N 1                          |                                                                                                                                                                                               | 
 +| Process Distribution ​           | -R "​span[ptile=<x>]" ​  | %%--ntasks-per-node x%%       ​| ​                                                                                                                                                                                              | 
 +| Exclusive Node                  | -x                     | %%--exclusive%% ​              ​| ​                                                                                                                                                                                              | 
 +| Scratch ​                        | -R scratch[2] ​         | -C "​scratch[2]" ​              ​| ​                                                                                                                                                                                              | 
 +^ Queue -> Partition Conversion ​                                                                                                                                                                                                                                                        ​|||| 
 +| General Purpose ​                | -q mpi                 | -p medium ​                    ​| ​                                                                                                                                                                                              | 
 +|                                 | -q mpi-short ​          | -p medium %%--qos=short%% ​    ​| ​                                                                                                                                                                                              | 
 +|                                 | -q mpi-long ​           | -p medium %%--qos=long%% ​     |                                                                                                                                                                                               | 
 +|                                 | -q fat                 | -p fat                        |                                                                                                                                                                                               | 
 +|                                 | -q fat-short ​          | %%-p fat --qos=short%% ​       |                                                                                                                                                                                               | 
 +|                                 | -q fat-long ​           | %%-p fat --qos=long%% ​        ​| ​                                                                                                                                                                                              | 
 +|                                 | -ISs -q int /​bin/​bash ​ | -p int -n 20 -N 1 --pty bash  | Don't forget -n 20 -N 1, otherwise, you will only get access to a single core. Fore more Detail See [[en:​services:​application_services:​high_performance_computing:​interactive_queue:​|here]]. ​ |
  
-  * For half a 1.5 TB node (your job needs more than 500 GB RAM): +=====  Getting Help  ===== 
-<​code>​ +The following sections show you where you can get status Information and where you can get support in case of problems. 
-#BSUB -n 20 +====  Information sources ​ ====
-#BSUB -R span[hosts=1] +
-#BSUB -R "​maxmem < 1600000 && maxmem > 600000"​ +
-</​code>​+
  
-  * For a full 1.5 TB node (your job needs more than 700 GB RAM): +  *  HPC announce mailing list 
-<​code>​ +    ​* ​ [[https://​listserv.gwdg.de/​mailman/​listinfo/​hpc-announce]]
-#BSUB -x +
-#BSUB -R "​maxmem < 1600000 && maxmem > 600000"​ +
-</code>+
  
-  * For half a 2 TB node (your job needs more than 700 GB RAM): +====  Using the GWDG Support Ticket System ​ ====
-<​code>​ +
-#BSUB -n 16 +
-#BSUB -R span[hosts=1] +
-#BSUB -R "​maxmem > 1600000"​ +
-</​code>​+
  
-  ​For a full 2 TB node (your job needs more than 1.5 TB RAM): +Write an email to <​hpc@gwdg.de>​. In the body: 
-<​code>​ +  ​ State that your question is related to the batch system. 
-#BSUB -x +  *  State your user id (''​$USER''​). 
-#BSUB -R "​maxmem > 1600000"​ +  ​* ​ If you have a problem with your jobs please //always send the complete standard output and error//! 
-</​code>​+  ​* ​ If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem. 
 +  ​* ​ If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.
  
-The 512 GB nodes are also available in the fat queue, without these restrictions. However, fat jobs on these nodes have a lower priority compared to fat+ jobs.+[[Kategorie:​ Scientific Computing]]
  
-===== 6. GPU selection ===== 
- 
-In order to use a GPU you should submit your job to the GPU queue, and request GPU shares. Each node equipped with a GPU provides as many GPU shares as it has cores, independent of how many GPUs are built in. So, on the new nodes, which have 24 cores, the following would give you exclusive access to GPUs: 
-<​code>​ 
-#BSUB -R "​rusage[ngpus_shared=24]"​ 
-</​code>​ 
-Note that you need not necessarily also request 24 cores with ''​-n 24'',​ as jobs from the MPI queue may utilize free CPU cores if you do not need them. The new nodes have two GPUs each, and you should use both, if possible. 
- 
-If you request less shares than cores available, other jobs may also utilize the GPUs. However, we have currently no mechanism to select a specific one for a job. This would have to be handled in the application or your job script. 
- 
-A good way to use the new nodes with jobs only working on one GPU would be to put two together in one job script and preselect a GPU for each one. 
- 
-Currently we have two generations of NVidia GPUs in the cluster, selectable in the same way as CPU generations:​ 
- 
-<​code>​ 
-nvgen=1 : Kepler 
-nvgen=2 : Maxwell 
-</​code>​ 
- 
-Most GPUs are commodity graphics cards, and only provide good performance for single precision calculations. If you need double precision performance,​ or error correcting memory (ECC RAM), you can select the Tesla GPUs with 
-<​code>​ 
-#BSUB -R tesla 
-</​code>​ 
-Our Tesla K40 are of the Kepler generation (nvgen=1). 
- 
-If you want to make sure to run on a node equipped with two GPUs use: 
-<​code>​ 
-#BSUB -R "​ngpus=2"​ 
-</​code>​ 
-===== 7. New frontend ===== 
- 
-The new frontend gwdu103 has 2x12 Intel Broadwell CPUs and 64 GB memory. If you compile a program on gwdu103 it will often be automatically optimized for Broadwell CPUs / intel=4. In that case it will probably also run on intel=3 / Haswell, but not below that or on AMD nodes. 
- 
-The same policies apply as for all other frontends: You may use it to compile software, to copy from and to the new scratch-Filesystem,​ and for short tests of your compiled binaries. You **must not** use it to run long time tasks, especially if they are CPU or memory intensive. 
- 
-[[Kategorie:​ Scientific Computing]]