Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
en:services:application_services:high_performance_computing:running_jobs_slurm [2019/06/18 10:51]
tehlers [not yet migrated to SLURM]
en:services:application_services:high_performance_computing:running_jobs_slurm [2020/04/14 14:38]
mboden [CPU Selection]
Line 28: Line 28:
  
 **MPI job**\\ **MPI job**\\
-A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or substitute.+A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or the Slurm substitute ​''​srun''​.
  
 **Partition**\\ **Partition**\\
Line 46: Line 46:
  
 **medium**\\ **medium**\\
-This is our general purpose partition, usable for serial and SMP jobs with up to 20 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.+This is our general purpose partition, usable for serial and SMP jobs with up to 24 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.
  
 **fat**\\ **fat**\\
-This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and 256 GB are available on one host. Maximum runtime is 48 hours.+This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and up to 512 GB are available on one host. Maximum runtime is 48 hours.\\ 
 +The nodes of the fat+ partitions are also present in this partition, but will only be used, if they are not needed for bigger jobs submitted to the fat+ partition.
  
 **fat+**\\ **fat+**\\
-This partition is meant for very memory intensive jobs. For more details see below (**fat-fas+** and **fat-fmz+**).+This partition is meant for very memory intensive jobs. These partitions are for jobs that require ​more than 512 GB RAM on single node. Nodes of fat+ partitions have 1.5 and 2 TB RAM. You are required to have specify your memory needs on job submission to use these nodes (see [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​resource_selection|resource selection]]).\\ 
 +As general advice: Try your jobs on the smaller nodes in the fat partition first and work your way up and don't be afraid to ask for help here.
  
-These are called '​meta'​ partitions, because they are just collections +**gpu** - A partition for nodes containing GPUs. Please refer to [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​gpu_selection]] ​
-of different partitions.\\ +
-If you need more fine grained control over the on which kind of nodes your job runs, you can also directly use the underlying '​real'​ partitions:​\\ +
-**medium-fas** - Medium ​nodes at Faßberg\\ +
-**medium-fmz** - Medium nodes at the Fernmeldezentrale\\ +
-**fat-fas** - Fat nodes at Faßberg\\ +
-**fat-fmz** - Fat nodes at the Fernmeldezentrale+
  
-**fat-fas+** and **fat-fmz+**\\ +====  ​Runtime limits (QoS)  ====
-These partitions are for jobs that require more than 256 GB RAM on single node. Nodes of fat+ partitions have 512 GB, 1.5 and 2 TB RAM. Due to limited amount of such nodes, there are restrictions of using the partition, which you can find at the [[en:​services:​application_services:​high_performance_computing:​running_jobs_for_experienced_users#​using_the_fat_partitions|page for experienced Users]]  +
- +
-**gpu** - A partition for nodes containing GPUs. Please refer to INSERT STUFF HERE +
- +
-====  ​Available QOS  ====+
 If the default time limits are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS. If the default time limits are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS.
  
Line 81: Line 72:
 Interactively or in batch mode. We generally recommend using the Interactively or in batch mode. We generally recommend using the
 batch mode. If you need to run a job interactively,​ you can find batch mode. If you need to run a job interactively,​ you can find
-information about that LINK.+information about that in the [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​interactive_session_on_the_nodes|corresponding section]].
 Batch jobs are submitted to the cluster using the '​sbatch'​ command Batch jobs are submitted to the cluster using the '​sbatch'​ command
 and a jobscript or a command:\\ and a jobscript or a command:\\
Line 93: Line 84:
  
 ====  "​sbatch"​ options ​ ==== ====  "​sbatch"​ options ​ ====
 +
 +**<​nowiki>​-A all</​nowiki>​**\\
 +Specifies the account '​all'​ for the job. This option is //​mandatory//​ for users who have access to special hardware and want to use the general partitions.
  
 **<​nowiki>​-p <​partition></​nowiki>​**\\ **<​nowiki>​-p <​partition></​nowiki>​**\\
Line 106: Line 100:
 **<​nowiki>​-o <​file></​nowiki>​**\\ **<​nowiki>​-o <​file></​nowiki>​**\\
 Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​). ''​%J''​ in the filename stands for the jobid.\\ Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​). ''​%J''​ in the filename stands for the jobid.\\
 +
 +**<​nowiki>​--noinfo</​nowiki>​**\\
 +Some metainformation about your job will be added to your output file. If you do not want that, you can suppress it with this flag.\\
 +
 +**<​nowiki>​--mail-type=[ALL|BEGIN|END]</​nowiki>​\\
 +<​nowiki>​--mail-user=your@mail.com</​nowiki>​** \\
 +Receive mails when the jobs start, end or both. There are even more options, refer to the sbatch man-page for more information about mail types. If you have a GWDG-mail-address,​ you do not need to specify the mail-user.\\
  
 ====  Resource Selection ​ ==== ====  Resource Selection ​ ====
Line 116: Line 117:
 **<​nowiki>​-c <cpus per task></​nowiki>​**\\ **<​nowiki>​-c <cpus per task></​nowiki>​**\\
 The number of cpus per tasks. The default is one cpu per task. The number of cpus per tasks. The default is one cpu per task.
 +
 +**<​nowiki>​-c vs -n</​nowiki>​**\\
 +As a rule of thumb, if you run your code on a single node, use -c. For multi-node MPI-jobs, use -n.\\
  
 **<​nowiki>​-N <​minNodes[,​maxNodes]></​nowiki>​**\\ **<​nowiki>​-N <​minNodes[,​maxNodes]></​nowiki>​**\\
Line 132: Line 136:
  
 **<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\ **<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\
-Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki> ​is mutually exclusive\\+Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki> ​are mutually exclusive.\\
 === Example === === Example ===
  
Line 156: Line 160:
 ====  Using Job Scripts ​ ==== ====  Using Job Scripts ​ ====
  
-A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. Here is an example:+A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. These options have to be at the top of the script before any other commands are executed. Here is an example:
  
 <​code>​ <​code>​
Line 188: Line 192:
 <​code>​ <​code>​
 export OMP_NUM_THREADS=4 export OMP_NUM_THREADS=4
-sbatch --exclusive -p mpi -N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"​+sbatch --exclusive -p medium ​-N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"​
 </​code>​ </​code>​
 (each MPI process creates 4 OpenMP threads in this case). (each MPI process creates 4 OpenMP threads in this case).
Line 254: Line 258:
 ==== GPU selection ==== ==== GPU selection ====
  
-In order to use a GPU you should submit your job to the ''​gpu''​ partition, and request GPU count and optionally the model. CPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting a single GPU on the node with 20 cores adn 4 GPUs, you can get up to 5 cores reserved exclusively for you, the same is with memory. So for example, if you want 2 GPUs of model Nvidia GeForce GTX 1080 with 10 CPUs, you can submit a job script with the following flags: +In order to use a GPU you should submit your job to the ''​gpu''​ partition, and request GPU count and optionally the model. CPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting a single GPU on the node with 20 cores and 4 GPUs, you can get up to 5 cores reserved exclusively for you, the same is with memory. So for example, if you want 2 GPUs of model Nvidia GeForce GTX 1080 with 10 CPUs, you can submit a job script with the following flags: 
-<​code> ​+<​code>​
 #SBATCH -p gpu #SBATCH -p gpu
 #SBATCH -n 10 #SBATCH -n 10
-#SBATCH --gres=gpu:gtx1080:2+#SBATCH -gtx1080:2
 </​code>​ </​code>​
  
-You can also omit the model selection, here is an example of selecting 1 GPU of any available model: + 
-<​code>​+You can also omit the model selection, here is an example of selecting 1 GPU of any available model:<​code>​
 #SBATCH -p gpu #SBATCH -p gpu
 #SBATCH -n 10 #SBATCH -n 10
-#SBATCH --gres=gpu:1+#SBATCH -1
 </​code>​ </​code>​
 +
 +There are different options to select the number of GPUs, such as ''​%%--gpus-per-node%%'',​ ''​%%--gpus-per-task%%''​ and more. See the [[https://​slurm.schedmd.com/​sbatch.html|sbatch man page]] for details.
  
 Currently we have several generations of NVidia GPUs in the cluster, namely: Currently we have several generations of NVidia GPUs in the cluster, namely:
Line 279: Line 285:
 <​code>​ <​code>​
 #SBATCH -p gpu #SBATCH -p gpu
-#SBATCH --gres=gpu:k40:2+#SBATCH -k40:2
 </​code>​ </​code>​
 Our Tesla K40 are of the Kepler generation. Our Tesla K40 are of the Kepler generation.
Line 285: Line 291:
 <​code>​ sinfo -p gpu --format=%N,​%G </​code>​ shows a list of host with GPUs, as well as their type and count. <​code>​ sinfo -p gpu --format=%N,​%G </​code>​ shows a list of host with GPUs, as well as their type and count.
  
-=====  ​Miscallaneous ​Slurm Commands ​ =====+=====  ​Miscellaneous ​Slurm Commands ​ =====
  
 While ''​sbatch''​ is arguably the most important Slurm command, you may also find the following commands useful: While ''​sbatch''​ is arguably the most important Slurm command, you may also find the following commands useful:
Line 303: Line 309:
 **<​nowiki>​sacct -j <​jobid>​ --format=JobID,​User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\ **<​nowiki>​sacct -j <​jobid>​ --format=JobID,​User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\
 Get job Information even after the job has finished.\\ Get job Information even after the job has finished.\\
-**Note on ''​sacct'':​** Depending on the parameters given ''​sacct''​ chooses a time window in a rather unintuitive way. This is documented in the DEFAULT TIME WINDOW section of its man page. If you unexpectedly get no results from your ''​sacct''​ query, try specifying the start time with, e.g. ''<​nowiki>​-S 2019-01-01</​nowiki>''​.+**Note on ''​sacct'':​** Depending on the parameters given ''​sacct''​ chooses a time window in a rather unintuitive way. This is documented in the DEFAULT TIME WINDOW section of its man page. If you unexpectedly get no results from your ''​sacct''​ query, try specifying the start time with, e.g. ''<​nowiki>​-S 2019-01-01</​nowiki>''​.\\ 
 +The ''<​nowiki>​--format</​nowiki>''​ option knows many more fields like **Partition**,​ **Start**, **End** or **State**, for the full list refer to the man page.
  
 **<​nowiki>​scancel</​nowiki>​**\\ **<​nowiki>​scancel</​nowiki>​**\\
Line 353: Line 360:
   *  If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem.   *  If you have a lot of failed jobs send at least two outputs. You can also list the jobids of all failed jobs to help us even more with understanding your problem.
   *  If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.   *  If you don’t mind us looking at your files, please state this in your request. You can limit your permission to specific directories or files.
- 
-====== the next Doku-parts are not yet migrated to SLURM ====== 
- 
-===== MPI jobs ===== 
- 
-Note that a single thread job submitted like above will share its execution host with other jobs. It is therefore expected that it does not use more than the memory available per core! On the ''​mpi''​ nodes this amount is 4 GB, as well as on the newer ''​fat''​ nodes. If your job requires more, you must assign additional cores. For example, if your single thread job requires 64 GB of memory, you must submit it like this: 
- 
-<​code>​ 
-bsub -q mpi -n 16 ./​myexecutable</​code>​ 
-\\ 
-OpenMPI jobs can be submitted as follows: 
- 
-<​code>​ 
-bsub -q mpi -n 256 -a openmpi mpirun.lsf ./​myexecutable</​code>​ 
-\\ 
-For Intel MPI jobs it suffices to use ''​-a intelmpi''​ instead of ''​-a openmpi''​. Please note that LSF will not load the correct modules (compiler, library, MPI) for you. You either have to do that before executing ''​bsub'',​ in which case your setup will be copied to the execution hosts, or you will have to use a job script and load the required modules there. ​ 
- 
-A new feature in LSF is ''​pinning''​ support. ''​Pinning''​ (in its most basic version) means instructing the operating system to not apply its standard scheduling algorithms to your workloads, but instead keep processes on the CPU core they have been started on. This may significantly improve performance for some jobs, especially on the ''​fat''​ nodes with their high CPU core count. ''​Pinning''​ is managed via the MPI library, and currently only OpenMPI is supported. There is not much experience with this feature, so we are interested in your feedback. Here is an example: 
- 
-<​code>​ 
-bsub -R "​select[np16] span[ptile=16] affinity[core(1):​cpubind=core]"​ -q mpi -n 256 -a openmpi mpirun.lsf ./​myexecutable</​code>​ 
-\\ 
-The affinity string ''"​affinity[core(1):​cpubind=core]"''​ means that each task is using one core and that the binding should be done based on cores (as opposed to sockets, NUMA units, etc). Because this example is for a pure MPI application,​ x in ''​core(x)''​ is one. In an SMP/MPI hybrid job, x would be equal to the number of threads per task (e. g., equal to ''​OMP_NUM_THREADS''​ for Openmp/MPI hybrid jobs). 
- 
-===== SMP jobs ===== 
- 
-Shared memory parallelized jobs can be submitted with 
- 
-<​code>​ 
-bsub -q mpi -n 8,20 -R '​span[hosts=1]'​ -a openmp ./​myexecutable</​code>​ 
-\\ 
-The ''​span''​ option is required, without it, LSF will assign cores to the job from several nodes, if that is advantageous from the scheduling perspective. 
- 
-===== Using the fat+ queue ===== 
- 
-Nodes with a lot of memory are very expensive and should not normally be used for jobs which could also run on our other nodes. Therefore, please note the following policies: 
- 
-  * Your job must need more than 250 GB RAM. 
-  * Your job must use at least a full 512 GB node or half a 1.5 TB or 2 TB node: 
- 
-  * For a full 512 GB node: 
-<​code>​ 
-#BSUB -x 
-#BSUB -R "​maxmem < 600000"​ 
-</​code>​ 
- 
-  * For half a 1.5 TB node (your job needs more than 500 GB RAM): 
-<​code>​ 
-#BSUB -n 20 
-#BSUB -R span[hosts=1] 
-#BSUB -R "​maxmem < 1600000 && maxmem > 600000"​ 
-</​code>​ 
- 
-  * For a full 1.5 TB node (your job needs more than 700 GB RAM): 
-<​code>​ 
-#BSUB -x 
-#BSUB -R "​maxmem < 1600000 && maxmem > 600000"​ 
-</​code>​ 
- 
-  * For half a 2 TB node (your job needs more than 700 GB RAM): 
-<​code>​ 
-#BSUB -n 16 
-#BSUB -R span[hosts=1] 
-#BSUB -R "​maxmem > 1600000"​ 
-</​code>​ 
- 
-  * For a full 2 TB node (your job needs more than 1.5 TB RAM): 
-<​code>​ 
-#BSUB -x 
-#BSUB -R "​maxmem > 1600000"​ 
-</​code>​ 
- 
-The 512 GB nodes are also available in the fat queue, without these restrictions. However, fat jobs on these nodes have a lower priority compared to fat+ jobs. 
- 
-===== CPU architecture selection ===== 
- 
-Our cluster provides four generations of Intel CPUs and two generations of AMD CPUs. However, the main difference between these CPU types is whether they support Intel'​s AVX2 or not. For selecting this we have introduced the x64inlvl (for x64 instruction level) label: 
- 
-<​code>​ 
-x64inlvl=1 : Supports only AVX 
-x64inlvl=2 : Supports AVX and AVX2 
-</​code>​ 
- 
-In order to choose an AVX2 capable node you therefore have to include 
-<​code>​ 
-#BSUB -R "​x64inlvl=2"​ 
-</​code>​ 
-in your submission script. 
- 
-If you need to be more specific, you can also directly choose the CPU generation: 
- 
-<​code>​ 
-amd=1 : Interlagos 
-amd=2 : Abu Dhabi 
- 
-intel=1 : Sandy Bridge 
-intel=2 : Ivy Bridge 
-intel=3 : Haswell 
-intel=4 : Broadwell 
-</​code>​ 
- 
-So, in order to choose any AMD CPU: 
-<​code>​ 
-#BSUB -R amd 
-</​code>​ 
-In order to choose an Intel CPU of at least Haswell generation: 
-<​code>​ 
-#BSUB -R "​intel>​=3"​ 
-</​code>​ 
-This is equivalent to ''​x64inlvl=2''​. 
- 
-===== Memory selection ===== 
- 
-Note that the following paragraph is about **selecting** nodes with enough memory for a job. The mechanism to actually **reserve** that memory does not change: The memory you are allowed to use equals memory per core times slots (-n option) requested. 
- 
-You can select a node either by currently available memory (mem) or by maximum available memory (maxmem). If you request complete nodes, the difference is actually very small, as a free node's available memory is close to its maximum memory. All requests are in MB. 
- 
-To select a node with more than about 500 GB available memory use: 
-<​code>​ 
-#BSUB -R "​mem>​500000"​ 
-</​code>​ 
-To select a node with more than about 6 GB maximum memory per core use: 
-<​code>​ 
-#BSUB -R "​maxmem/​ncpus>​6000"​ 
-</​code>​ 
-(Yes, you can do basic math in the requirement string!) 
- 
-It bears repeating: None of the above is a memory reservation. If you actually want to reserve "​mem"​ memory, the easiest way is to combine ''​-R "​mem>​...''​ with ''​-x''​ for an exclusive job. 
- 
-Finally, note that the ''​-M''​ option just denotes the memory limit of your job per core (in KB). This is of no real consequence,​ as we do not enforce these limits and it has no influence on the host selection. 
- 
- 
-Besides the options shown in this article, you can of course use the options for controlling walltime limits (-W), output (-o), and your other requirements as usual. You can also continue to use job scripts instead of the command line (with the ''#​BSUB <​option>​ <​value>''​ syntax). 
- 
-Please consult the LSF man pages if you need further information. 
- 
- 
- 
- 
  
 [[Kategorie:​ Scientific Computing]] [[Kategorie:​ Scientific Computing]]