Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
en:services:application_services:high_performance_computing:running_jobs_slurm:signals [2021/05/25 14:27] – [Trapping the Signal] mbodenen:services:application_services:high_performance_computing:running_jobs_slurm:signals [2021/05/25 14:38] (current) – [Trapping the Signal] mboden
Line 1: Line 1:
 +====== Job Signals ======
 +Slurm supports sending [[https://en.wikipedia.org/wiki/Signal_(IPC)|signals]] to running jobs before the timelimit is reached. This can be used save the current state of calculations and copy any checkpointing data from local storage to the home or scratch file system, thus saving the results of the jobs from being lost.
 +
 +===== Sending the Signal =====
 +
 +This signals can be controlled with the ''<nowiki>--signal</nowiki>'' sbatch parameter. The syntax is as follows:
 +<code>--signal=B:<sig_num>@<sig_time></code>
 +You can find all options in the [[https://slurm.schedmd.com/sbatch.html#OPT_signal|sbatch manpage]]. This will send the signal with the number <sig_num> at <sig_time> (in seconds) before the timelimit is reached. To put it into a concrete example:
 +<code>--signal=B:12@600</code>
 +This sends the signal '12' (aka SIGUSR2, most likely not used in your program) to the batch job (and all its processes) 10 minutes before the job will run into the timelimit.
 +
 +===== Trapping the Signal =====
 +
 +But just sending a signal is not enough, the job needs to know what to do with the signal. The easiest way to do that is to use a [[https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#index-trap|trap]]. This command can be used to define steps that should be taken, when the signal is received, for example:
 +<code bash>trap 'mkdir -p ${HOME}/job_${SLURM_JOBID}; cp -af ${TMP_LOCAL}/* ${HOME}/job_${SLURM_JOBID}/; exit 12' 12</code>
 +This will trap the signal 12 and run the command given commands to create a folder in the home directory with the JobID of the job and copy all files from the local disk (located at ''$TMP_LOCAL'') into this directory. Some more examples of using signals and traps can be found [[https://www.tutorialspoint.com/unix/unix-signals-traps.htm|here]].
 +
 +If you have a multi-node job, you will have to use srun to run the copy command on all nodes of the job:
 +<code bash>trap 'mkdir -p ${HOME}/job_${SLURM_JOBID}; srun -n ${SLURM_JOB_NUM_NODES} --ntasks-per-node=1 cp -af ${TMP_LOCAL}/* ${HOME}/job_${SLURM_JOBID}/; exit 12' 12</code>
 +
 +One more modification is necessary: The ''trap'' command will wait for the currently running process to finish until it is executed((See [[https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html#sect_12_02_02|section 12.2.2]])). As this is not intended in this case (as the timelimit approaches), the calculations have to be started in the background, for example:
 +<code bash>./long_calculation.py &
 +wait</code>
 +The python program will run in the background, and the ''wait'' waits for it to finish while still allowing the trap to be executed.
 +===== Example =====
 +<code bash>
 +#!/bin/bash
 +#SBATCH -p medium
 +#SBATCH -t 24:00:00
 +#SBATCH -c 10
 +#SBATCH -N 1
 +#SBATCH --signal=B:12@600
 +
 +module load python
 +cd $TMP_LOCAL
 +
 +trap 'mkdir -p ${HOME}/job_${SLURM_JOBID}; cp -af ${TMP_LOCAL}/* ${HOME}/job_${SLURM_JOBID}/; exit 12' 12
 +
 +./big_calculation.py &
 +wait
 +</code>