O2 GPU Re-Queue Partition

From July 1st 2021 the gpu_requeue partition is available only to users working for a PI with a primary or secondary appointment in a pre-clinical department.


The O2 cluster includes a set of lab-contributed GPU nodes. These nodes were purchased by individual labs, but are made available to the rest of the O2 community within the re-queue GPU partition called gpu_requeue.

This partition currently comprises:

  • 28 Nvidia RTX6000  24GB VRAM Single Precision
  • 12 Nvidia L40S          48GB VRAM Single Precision
  •  2 Nvidia  L40            48GB VRAM Single Precision
  •  2 Nvidia M40 Tesla   12GB VRAM Single Precision
  • 12 Nvidia A100          80GB VRAM Double Precision
  • 2 Nvidia A100           40GB  VRAM Double Precision

Most of the cards in this partition (the RTX6000 cards) are not ideal for GPU double precision jobs; if you need to run in double precision you should add the flag --constraint=gpu_doublep when submitting your jobs.

To see the currently available resources under the gpu_requeue partition you can use the command below:

sinfo --Format=nodehost,cpusstate,memory,statelong,gres:35 -p gpu_requeue
HOSTNAMES 			CPUS(A/I/O/T) 	MEMORY 		STATE 	GRES 
compute-gc-17-241 	20/92/0/112 	1000000 	mixed 	gpu:l40s:4,vram:no_consume:48G 
compute-gc-17-242 	4/28/0/32 		 500000 	mixed 	gpu:l40s:4,vram:no_consume:48G 
compute-gc-17-243 	7/25/0/32 		 500000 	mixed 	gpu:l40s:4,vram:no_consume:48G 
compute-gc-17-244 	20/44/0/64 		1030000 	mixed 	gpu:l40:2,vram:no_consume:45G 
compute-gc-17-245 	44/4/0/48 		 383000 	mixed 	gpu:rtx6000:10,vram:no_consume:24G 
compute-gc-17-246 	40/8/0/48 	     383000 	mixed 	gpu:rtx6000:10,vram:no_consume:24G 
compute-gc-17-247 	42/6/0/48 		 383000 	mixed 	gpu:rtx6000:8,vram:no_consume:24G 
compute-gc-17-252 	32/32/0/64 		1000000 	mixed 	gpu:a100:4,vram:no_consume:80G 
compute-gc-17-253 	32/32/0/64 		1000000 	mixed 	gpu:a100:4,vram:no_consume:80G 
compute-gc-17-254 	38/26/0/64 		1000000 	mixed 	gpu:a100:4,vram:no_consume:80G 
compute-g-16-197 	0/20/0/20 		 257548 	idle 	gpu:teslaM40:2,vram:no_consume:12G 
compute-gc-17-249 	0/48/0/48 		1000000 	idle 	gpu:a100:2,vram:no_consume:40G 

How Preemption Works

The labs that purchased these nodes have preemption priority on their own hardware. If the nodes are full and a researcher from one of those labs submits a job, one or more GPU jobs running on the gpu_requeue partition might be killed and re-queued in order to free resources for the Lab's job. That is, the gpu_requeue job will be cancelled, as if you ran the scancel command, and re-submitted (as long as you initially submitted with the flag --requeue).

Please note this may happen at any time without any warning and multiple times.

Preempted jobs will show on Slurm sacct database as PR or PREEMPTED until they are requeued. The scheduler will not replace the jobid of a preempted job but instead it will resubmit it with the same jobid. In order to see the history of a preempted and requeued job you need to add the flag -D to the sacct query.

Jobs that fail for other reasons - like running longer than the reserved time, an error in the code or data, or a node failure - will not automatically be requeued.

How to Submit to the gpu_requeue Partition

To submit jobs on gpu_requeue you need to specify that partition with the flag "-p", and add the flag --requeue. Without the requeue flags jobs will still get killed but will not be automatically requeued.  

For example:

#SBATCH -p gpu_requeue
#SBATCH -t 6:00:00
#SBATCH --requeue 
#SBATCH --gres=gpu:1


<your job goes here>


The gpu_requeue partition currently has a maximum Wall-Time of 24 hours and a fairshare consumption equivalent to 10% of the regular gpu partition.

How to Efficiently Use the gpu_requeue Partition

IMPORTANT: 

In order to work properly, any job submitted to gpu_requeue that writes intermediate files must either be restartable from the beginning (overwriting partially completed files) or from a last saved checkpoint. Researchers are responsible to choose jobs that can be run in this way.

  1. The gpu_requeue partition can be efficiently used to run interactive GPU jobs while developing or testing code and when an interruption won't cause a significant loss of data/work.
  2. Some researchers run large batches of "embarrassingly parallel", independent jobs, which take less than a couple hours each and can be performed in any order. Using the gpu_requeue partition provides access to substantial extra resources, and not too much work will be lost if a small percentage of these jobs are killed. The requeue functionality also means the jobs will be restarted automatically when preempted.
  3. Non-interactive jobs can be built with the capability of auto-restarting from a last saved checkpoint. A properly designed job would follow an algorithm to auto determine if a restart is possible, for example:


if a <restart_file> is present
  restart from <restart_file>
else
  start from beginning


while job is executing
  periodically save a <restart_file>
     
       
if job completes
   remove temporary files, including <restart_file>