How to connect to your sbatch running job

 

Slurm allows you to connect, with a standard terminal shell, to the nodes where your jobs are running. From those shell connections you will be able to use the same computational resources allocated for your jobs. You can also run a command, directly from login nodes, against resources allocated to a separate already running job.

This feature can be useful for:

  • monitoring in real time your sbatch running jobs (you can use commands like "top" or “ps” to check your processes or look at the content of local /tmp folders on the compute node)

  • using resources allocated to a running sbatch job that you know might be idle at a given time. For example use GPU or CPU computing power when temporarily idle.

  • Get immediate access to some computational resources if you are in a urgent need. For example, you might be running a GPU jobs and still have enough VRAM (GPU memory) free on that card that could be used to run a separate process.

The syntax to use is srun --jobid=<jobid_number> ... followed by your desired command.

The shell or the commands started via srun --jobid will be constrained to use only the resources available to the jobid job.

This below is an example of how you can connect from a login node to a running sbatch job.

First, we had submitted a standard sbatch job, in this example called my_sbatch_job.sh:

#!/bin/bash #SBATCH -c 1 # Number of cores requested #SBATCH -t 4:00:00 # Wall-time #SBATCH -p priority # Partition #SBATCH --mem=2G # memory per node # your sbatch job commands here python3 python_sleep.py