Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:running_jobs_slurm [2019/06/18 10:52]
tehlers
en:services:application_services:high_performance_computing:running_jobs_slurm [2021/06/11 11:38] (current)
mboden [The ''sbatch'' Command: Submitting Jobs to the Cluster]
Line 28: Line 28:
  
 **MPI job**\\ **MPI job**\\
-A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or substitute.+A Job with distributed memory parallelization,​ realized with MPI. Can use several job slots on several nodes and needs to be started with ''​mpirun''​ or the Slurm substitute ​''​srun''​.
  
 **Partition**\\ **Partition**\\
 A label to sort jobs by general requirements and intended execution nodes. Formerly called "​queue"​ A label to sort jobs by general requirements and intended execution nodes. Formerly called "​queue"​
  
-=====  The ''​sbatch'' ​Command: Submitting Jobs to the Cluster ​ =====+=====  The sbatch Command: Submitting Jobs to the Cluster ​ =====
  
 ''​sbatch''​ submits information on your job to the batch system: ''​sbatch''​ submits information on your job to the batch system:
Line 46: Line 46:
  
 **medium**\\ **medium**\\
-This is our general purpose partition, usable for serial and SMP jobs with up to 20 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.+This is our general purpose partition, usable for serial and SMP jobs with up to 24 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours.
  
 **fat**\\ **fat**\\
-This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 64 cores and 256 GB are available on one host. Maximum runtime is 48 hours.+This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 24 cores and up to 512 GB are available on one host. Maximum runtime is 48 hours.\\ 
 +The nodes of the fat+ partitions are also present in this partition, but will only be used, if they are not needed for bigger jobs submitted to the fat+ partition.
  
 **fat+**\\ **fat+**\\
-This partition is meant for very memory intensive jobs. For more details see below (**fat-fas+** and **fat-fmz+**).+This partition is meant for very memory intensive jobs. These partitions are for jobs that require ​more than 512 GB RAM on single node. Nodes of fat+ partitions have 1.5 and 2 TB RAM. You are required to have specify your memory needs on job submission to use these nodes (see [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​resource_selection|resource selection]]).\\ 
 +As general advice: Try your jobs on the smaller nodes in the fat partition first and work your way up and don't be afraid to ask for help here.
  
-These are called '​meta'​ partitions, because they are just collections +**gpu** - A partition for nodes containing GPUs. Please refer to [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​gpu_selection]] ​
-of different partitions.\\ +
-If you need more fine grained control over the on which kind of nodes your job runs, you can also directly use the underlying '​real'​ partitions:​\\ +
-**medium-fas** - Medium ​nodes at Faßberg\\ +
-**medium-fmz** - Medium nodes at the Fernmeldezentrale\\ +
-**fat-fas** - Fat nodes at Faßberg\\ +
-**fat-fmz** - Fat nodes at the Fernmeldezentrale+
  
-**fat-fas+** and **fat-fmz+**\\ +====  ​Runtime limits (QoS)  ​==== 
-These partitions are for jobs that require more than 256 GB RAM on single node. Nodes of fat+ partitions have 512 GB, 1.5 and 2 TB RAM. Due to limited amount of such nodes, there are restrictions of using the partition, which you can find at the [[en:​services:​application_services:​high_performance_computing:​running_jobs_for_experienced_users#​using_the_fat_partitions|page for experienced Users]]  +If the default time limit of 48 hours are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS.
- +
-**gpu** - A partition for nodes containing GPUs. Please refer to INSERT STUFF HERE +
- +
-====  ​Available QOS  ​==== +
-If the default time limits ​are not sufficient for your jobs, you can use a "​Quality of Service"​ or **QOS** to modify those limits on a per job basis. We currently have two QOS.+
  
 **long**\\ **long**\\
-Here, the maximum runtime is increased to 120 hours. Job slot availability is limited, though, so expect longer waiting times.+Here, the maximum runtime is increased to 120 hours. Job slot availability is limited, though, so expect longer waiting times. Be aware that using this QOS does not automatically increase the timelimit of your job. You still have to set it yourself.
  
 **short**\\ **short**\\
Line 81: Line 72:
 Interactively or in batch mode. We generally recommend using the Interactively or in batch mode. We generally recommend using the
 batch mode. If you need to run a job interactively,​ you can find batch mode. If you need to run a job interactively,​ you can find
-information about that LINK.+information about that in the [[en:​services:​application_services:​high_performance_computing:​running_jobs_slurm#​interactive_session_on_the_nodes|corresponding section]].
 Batch jobs are submitted to the cluster using the '​sbatch'​ command Batch jobs are submitted to the cluster using the '​sbatch'​ command
 and a jobscript or a command:\\ and a jobscript or a command:\\
Line 92: Line 83:
 manual of the command with 'man sbatch'​. manual of the command with 'man sbatch'​.
  
-====  ​"sbatch" ​options ​ ====+====  sbatch/srun options ​ ==== 
 + 
 +**<​nowiki>​-A all</​nowiki>​**\\ 
 +Specifies the account '​all'​ for the job. This option is //​mandatory//​ for users who have access to special hardware and want to use the general partitions.
  
 **<​nowiki>​-p <​partition></​nowiki>​**\\ **<​nowiki>​-p <​partition></​nowiki>​**\\
Line 106: Line 100:
 **<​nowiki>​-o <​file></​nowiki>​**\\ **<​nowiki>​-o <​file></​nowiki>​**\\
 Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​). ''​%J''​ in the filename stands for the jobid.\\ Store the job output in "​file"​ (otherwise written to slurm-<​jobid>​). ''​%J''​ in the filename stands for the jobid.\\
 +
 +**<​nowiki>​--noinfo</​nowiki>​**\\
 +Some metainformation about your job will be added to your output file. If you do not want that, you can suppress it with this flag.\\
 +
 +**<​nowiki>​--mail-type=[ALL|BEGIN|END]</​nowiki>​\\
 +<​nowiki>​--mail-user=your@mail.com</​nowiki>​** \\
 +Receive mails when the jobs start, end or both. There are even more options, refer to the sbatch man-page for more information about mail types. If you have a GWDG-mail-address,​ you do not need to specify the mail-user.\\
  
 ====  Resource Selection ​ ==== ====  Resource Selection ​ ====
Line 116: Line 117:
 **<​nowiki>​-c <cpus per task></​nowiki>​**\\ **<​nowiki>​-c <cpus per task></​nowiki>​**\\
 The number of cpus per tasks. The default is one cpu per task. The number of cpus per tasks. The default is one cpu per task.
 +
 +**<​nowiki>​-c vs -n</​nowiki>​**\\
 +As a rule of thumb, if you run your code on a single node, use -c. For multi-node MPI-jobs, use -n.\\
  
 **<​nowiki>​-N <​minNodes[,​maxNodes]></​nowiki>​**\\ **<​nowiki>​-N <​minNodes[,​maxNodes]></​nowiki>​**\\
Line 132: Line 136:
  
 **<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\ **<​nowiki>​--mem-per-cpu=<​size[units]></​nowiki>​**\\
-Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki> ​is mutually exclusive\\+Required memory per task instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-cpu</​nowiki> ​are mutually exclusive.\\ 
 + 
 +**<​nowiki>​--mem-per-gpu=<​size[units]></​nowiki>​**\\ 
 +Required memory per gpu instead of node. <​nowiki>​--mem</​nowiki>​ and <​nowiki>​--mem-per-gpu</​nowiki>​ are mutually exclusive.\\
 === Example === === Example ===
  
Line 144: Line 151:
 ====  The GWDG Scientific Compute Cluster ​ ==== ====  The GWDG Scientific Compute Cluster ​ ====
  
-{{ :​en:​services:​scientific_compute_cluster:nodes-slurm.png?1000 |}}+{{ :​en:​services:​application_services:high_performance_computing:​partitions.png?1000 |}}
  
-This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the "​ehemalige Fernmeldezentrale"​ facility hosting the older resources and the shared /scratch file system and the "​Faßberg"​ facility hosting the latest resources and the shared /scratch2 file system. The shared /scratch and /scratch2 are usually the best choices for temporary data in your jobs, but /scratch is only available at the "Fernmeldezentrale" (fmz) resources (select it with ''​-C scratch''​) and /scratch2 is only available at the Faßberg (fas) resources (select it with ''​-C scratch2''​). The scheme also shows the queues and resources by which nodes are selected using the ''​-p''​ (partition) and ''​-C''​ (constraint) options of ''​sbatch''​.+This scheme shows the basic cluster setup at GWDG. The cluster is distributed across two facilities, with the "​ehemalige Fernmeldezentrale"​ facility hosting the older resources and the shared /scratch file system and the "​Faßberg"​ facility hosting the latest resources and the shared /scratch2 file system. The shared /scratch and /scratch2 are usually the best choices for temporary data in your jobs, but /scratch is only available at the "modular data center" (mdc) resources (select it with ''​-C scratch''​) and /scratch2 is only available at the Faßberg (fas) resources (select it with ''​-C scratch2''​). The scheme also shows the queues and resources by which nodes are selected using the ''​-p''​ (partition) and ''​-C''​ (constraint) options of ''​sbatch''​.
  
 ====  ''​sbatch'':​ Specifying node properties with ''​-C'' ​ ==== ====  ''​sbatch'':​ Specifying node properties with ''​-C'' ​ ====
Line 152: Line 159:
 **-C scratch[2]**\\ **-C scratch[2]**\\
 The node must have access to shared ''/​scratch''​ or ''/​scratch2''​. The node must have access to shared ''/​scratch''​ or ''/​scratch2''​.
 +
 +
 +**-C local**\\
 +The node must have local SSD storage at ''/​local''​.
 +
 +**-C mdc / -C fas**\\
 +The node has to be at that location. It is pretty similar to -C scratch / -C scratch2, since the nodes in the MDC have access to scratch and those at the Fassberg location have access to  scratch2. This is mainly for easy compatibility with our old partition naming scheme.
 +
 +**-C [architecture]**\\
 +request a specific CPU architecture. Available Options are: haswell, broadwell, cascadelake. See [[en:​services:​application_services:​high_performance_computing:​start#​hardware_overview|this table]] for the corresponding nodes.
  
  
 ====  Using Job Scripts ​ ==== ====  Using Job Scripts ​ ====
  
-A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. Here is an example:+A job script is a shell script with a special comment section: In each line beginning with ''#​SBATCH''​ the following text is interpreted as a ''​sbatch''​ option. These options have to be at the top of the script before any other commands are executed. Here is an example:
  
 <​code>​ <​code>​
Line 188: Line 205:
 <​code>​ <​code>​
 export OMP_NUM_THREADS=4 export OMP_NUM_THREADS=4
-sbatch --exclusive -p mpi -N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"​+sbatch --exclusive -p medium ​-N 2 --ntasks-per-node=4 --wrap="​mpirun ./​hybrid_job"​
 </​code>​ </​code>​
 (each MPI process creates 4 OpenMP threads in this case). (each MPI process creates 4 OpenMP threads in this case).
Line 197: Line 214:
  
 **/​local**\\ **/​local**\\
-This is the local hard disk of the node. It is a fast - and in the case of the ''​gwdagwdd, dfa, dge, dmp, dsu''​ and ''​dte'' ​nodes even very fast, SSD based - option for storing temporary data. There is automatic file deletion for the local disks.\\+This is the local hard disk of the node. It is a fast, SSD based option for storing temporary dataavailable on ''​dfa, dge, dmp, dsu''​ and ''​dte''​. There is automatic file deletion for the local disks. If you require your nodes to have access to fast, node-local storage, you can use the ''​-C local''​ contraint for Slurm.\\
 A directory is automatically created for each job at ''/​local/​jobs/<​jobID>''​ and the path is exported as the environment variable ''​$TMP_LOCAL''​. A directory is automatically created for each job at ''/​local/​jobs/<​jobID>''​ and the path is exported as the environment variable ''​$TMP_LOCAL''​.
  
 **/​scratch**\\ **/​scratch**\\
-This is the shared scratch space, available on ''​gwda'' ​and ''​gwdd'' ​nodes and frontends ​''​gwdu101''​ and ''​gwdu102''​. You can use ''​-C scratch''​ to make sure to get a node with access to shared /scratch. It is very fast, there is no automatic file deletion, but also no backup! We may have to delete files manually when we run out of space. You will receive a warning before this happens.+This is the shared scratch space, available on ''​amp''​''​agq''​ and ''​agt'' ​nodes and frontend ​''​login-mdc.hpc.gwdg.de''​. You can use ''​-C scratch''​ to make sure to get a node with access to shared /scratch. It is very fast, there is no automatic file deletion, but also no backup! We may have to delete files manually when we run out of space. You will receive a warning before this happens. To copy data there, you can use the machine transfer-mdc.hpc.gwdg.de,​ but have a look at [[en:​services:​application_services:​high_performance_computing:​transfer_data|Transfer Data]] first.
  
 **/​scratch2**\\ **/​scratch2**\\
-This space is the same as scratch described above except it is **ONLY** available on the nodes ''​dfa,​ dge, dmp, dsu''​ and ''​dte''​ and on the frontend ''​gwdu103''​. You can use ''​-C scratch2''​ to make sure to get a node with access to that space.+This space is the same as scratch described above except it is **ONLY** available on the nodes ''​dfa,​ dge, dmp, dsu''​ and ''​dte''​ and on the frontend ''​login-fas.hpc.gwdg.de''​. You can use ''​-C scratch2''​ to make sure to get a node with access to that space. To copy data there, you can use the machine transfer-fas.hpc.gwdg.de,​ but have a look at [[en:​services:​application_services:​high_performance_computing:​transfer_data|Transfer Data]] first.
  
 **$HOME**\\ **$HOME**\\
-Your home directory is available everywhere, permanent, and comes with backup. Your attributed disk space can be increased. It is comparably slow, however.+Your home directory is available everywhere, permanent, and comes with backup. Your attributed disk space can be increased. It is comparably slow, however. You can find more information about the $HOME file system [[en:​services:​storage_services:​file_service:​fileservice_unix:​start|here]].
  
 ====  Recipe: Using ''/​scratch'' ​ ==== ====  Recipe: Using ''/​scratch'' ​ ====
  
-This recipe shows how to run Gaussian09 using ''/​scratch''​ for temporary files:+This recipe shows how to run Gaussian09 using ''/​scratch2''​ for temporary files:
  
-<​code>​+<​code ​bash>
 #!/bin/bash #!/bin/bash
 #SBATCH -p fat #SBATCH -p fat
 #SBATCH -N 1 #SBATCH -N 1
-#SBATCH -n 64 +#SBATCH -n 24 
-#SBATCH -C scratch+#SBATCH -C scratch2
 #SBATCH -t 1-00:00:00 #SBATCH -t 1-00:00:00
  
 +#Setting up the temporary scratch directory:
 +TMP_SCRATCH=/​scratch/​users/​$USER/​$SLURM_JOB_ID
 +mkdir -p $TMP_SCRATCH
 +
 +#Set up gaussian environment
 export g09root="/​usr/​product/​gaussian"​ export g09root="/​usr/​product/​gaussian"​
 . $g09root/​g09/​bsd/​g09.profile . $g09root/​g09/​bsd/​g09.profile
  
-mkdir -p /scratch/${USER} +#set the $TMP_SCRATCH as the scratch ​dir for gaussian 
-MYSCRATCH=`mktemp -d /scratch/​${USER}/​g09.XXXXXXXX` +export GAUSS_SCRDIR=$TMP_SCRATCH
-export GAUSS_SCRDIR=${MYSCRATCH}+
  
 +#run the calculations
 g09 myjob.com myjob.log g09 myjob.com myjob.log
-  + 
-rm -rf $MYSCRATCH+#Remove the temporary scratch dir 
 +rm -rf $TMP_SCRATCH
 </​code>​ </​code>​
 \\ \\
  
 +Please check out the software manual on how to set the directory for temporary files. Many programs have some flags for it, or read environment variables to determine the location, which you can set in your jobscript.
 ====  Using ''/​scratch2'' ​ ==== ====  Using ''/​scratch2'' ​ ====
-Currently the latest nodes do NOT have an access to ''/​scratch''​. They have an access only to shared ''/​scratch2''​. 
- 
 If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2: If you use scratch space only for storing temporary data, and do not need to access data stored previously, you can request /scratch or /scratch2:
 <​code>​ <​code>​
 #SBATCH -C "​scratch|scratch2"​ #SBATCH -C "​scratch|scratch2"​
 </​code>​ </​code>​
-For that case ''/​scratch2''​ is linked to ''/​scratch''​ on the latest nodes. You can just use ''/​scratch/​${USERID}''​ for the temporary data (don't forget to create it on ''/​scratch2''​). On the latest nodes data will then be stored in ''/​scratch2''​ via the mentioned symlink.+For that case ''/​scratch2''​ is linked to ''/​scratch''​ on the latest nodes. You can just use ''/​scratch/users/​${USERID}''​ for the temporary data (don't forget to create it on ''/​scratch2''​). On the latest nodes data will then be stored in ''/​scratch2''​ via the mentioned symlink.
  
 ==== Interactive session on the nodes ==== ==== Interactive session on the nodes ====
Line 254: Line 276:
 ==== GPU selection ==== ==== GPU selection ====
  
-In order to use a GPU you should submit your job to the ''​gpu''​ partition, and request GPU count and optionally the model. CPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting a single GPU on the node with 20 cores adn 4 GPUs, you can get up to 5 cores reserved exclusively for you, the same is with memory. So for example, if you want 2 GPUs of model Nvidia GeForce GTX 1080 with 10 CPUs, you can submit a job script with the following flags: +In order to use a GPU you should submit your job to the ''​gpu''​ partition, and request GPU count and optionally the model. CPUs of the nodes in gpu partition are evenly distributed for every GPU. So if you are requesting a single GPU on the node with 20 cores and 4 GPUs, you can get up to 5 cores reserved exclusively for you, the same is with memory. So for example, if you want 2 GPUs of model Nvidia GeForce GTX 1080 with 10 CPUs, you can submit a job script with the following flags: 
-<​code> ​+<​code>​
 #SBATCH -p gpu #SBATCH -p gpu
 #SBATCH -n 10 #SBATCH -n 10
-#SBATCH --gres=gpu:gtx1080:2+#SBATCH -gtx1080:2
 </​code>​ </​code>​
  
-You can also omit the model selection, here is an example of selecting 1 GPU of any available model: + 
-<​code>​+You can also omit the model selection, here is an example of selecting 1 GPU of any available model:<​code>​
 #SBATCH -p gpu #SBATCH -p gpu
 #SBATCH -n 10 #SBATCH -n 10
-#SBATCH --gres=gpu:1+#SBATCH -1
 </​code>​ </​code>​
 +
 +There are different options to select the number of GPUs, such as ''​%%--gpus-per-node%%'',​ ''​%%--gpus-per-task%%''​ and more. See the [[https://​slurm.schedmd.com/​sbatch.html|sbatch man page]] for details.
  
 Currently we have several generations of NVidia GPUs in the cluster, namely: Currently we have several generations of NVidia GPUs in the cluster, namely:
Line 279: Line 303:
 <​code>​ <​code>​
 #SBATCH -p gpu #SBATCH -p gpu
-#SBATCH --gres=gpu:k40:2+#SBATCH -k40:2
 </​code>​ </​code>​
 Our Tesla K40 are of the Kepler generation. Our Tesla K40 are of the Kepler generation.
Line 285: Line 309:
 <​code>​ sinfo -p gpu --format=%N,​%G </​code>​ shows a list of host with GPUs, as well as their type and count. <​code>​ sinfo -p gpu --format=%N,​%G </​code>​ shows a list of host with GPUs, as well as their type and count.
  
-=====  ​Miscallaneous ​Slurm Commands ​ =====+=====  ​Miscellaneous ​Slurm Commands ​ =====
  
 While ''​sbatch''​ is arguably the most important Slurm command, you may also find the following commands useful: While ''​sbatch''​ is arguably the most important Slurm command, you may also find the following commands useful:
Line 303: Line 327:
 **<​nowiki>​sacct -j <​jobid>​ --format=JobID,​User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\ **<​nowiki>​sacct -j <​jobid>​ --format=JobID,​User,​UID,​JobName,​MaxRSS,​Elapsed,​Timelimit</​nowiki>​**\\
 Get job Information even after the job has finished.\\ Get job Information even after the job has finished.\\
-**Note on ''​sacct'':​** Depending on the parameters given ''​sacct''​ chooses a time window in a rather unintuitive way. This is documented in the DEFAULT TIME WINDOW section of its man page. If you unexpectedly get no results from your ''​sacct''​ query, try specifying the start time with, e.g. ''<​nowiki>​-S 2019-01-01</​nowiki>''​.+**Note on ''​sacct'':​** Depending on the parameters given ''​sacct''​ chooses a time window in a rather unintuitive way. This is documented in the DEFAULT TIME WINDOW section of its man page. If you unexpectedly get no results from your ''​sacct''​ query, try specifying the start time with, e.g. ''<​nowiki>​-S 2019-01-01</​nowiki>''​.\\ 
 +The ''<​nowiki>​--format</​nowiki>''​ option knows many more fields like **Partition**,​ **Start**, **End** or **State**, for the full list refer to the man page.
  
 **<​nowiki>​scancel</​nowiki>​**\\ **<​nowiki>​scancel</​nowiki>​**\\