Page Comparison

Note
From July 1st 2021 the gpu_requeue partition is available only to users working for a PI with a primary or secondary appointment in a pre-clinical department.

...

This partition currently comprises:

28 Nvidia RTX6000 single precision cards

...

2 Nvidia A100 cards

...

with 40GB of VRAM
12 Nvidia A100 cards with 80GB of VRAM
2 Nvidia M40 Tesla cards

...

Most of the cards in this partition (the RTX6000 cards) are not ideal for GPU double precision jobs; if you need to run in double precision you should add the flag --constraint=gpu_doublep when submitting your jobs.

To see the currently available resources under the gpu_requeue partition you can can use the command below:

Code Block

sinfo --Format=nodehost,cpusstate,memory,statelong,gres -p gpu_requeue
HOSTNAMES           CPUS(A/I/O/T)       MEMORY              STATE               GRES                
compute-g-16-197    0/20/0/20           257548              idle                gpu:teslaM40:2,vram:
compute-gc-17-245   118/3740/0/48           383000              mixed               gpu:rtx6000:10,vram:
compute-gc-17-246   1118/3730/0/48          383000              mixed               gpu:rtx6000:10,vram:
compute-gc-17-247   100/3848/0/48           383000              mixedidle                gpu:rtx6000:8,vram:2
compute-gc-17-249   50/4348/0/48           1000000             mixedidle                gpu:a100:2,vram:40G 
compute-ggc-16-197    0/20/0/20  17-252   1/63/0/64           1000000             mixed               gpu:a100:4,vram:80G 
compute-gc-17-253   0/64/0/64           1000000             idle                gpu:a100:4,vram:80G 
compute-gc-17-254   0/64/0/64         257548  1000000             idle                gpu:teslaM40a100:24,vram:80G

How Preemption Works

The labs that purchased these nodes have preemption priority on their own hardware. If the nodes are full and a researcher from one of those labs submits a job, one or more GPU jobs running on the gpu_requeue partition might be killed and re-queued in order to free resources for the Lab's job. That is, the gpu_requeue job will be cancelled, as if you ran the scancel command, and re-submitted (as long as you initially submitted with the flag --requeue).

...

How to Submit to the gpu_requeue Partition

To submit jobs on gpu_requeue you need to specify that partition with the flag "-p", and add the flag --requeue. Without the requeue flags jobs will still get killed but will not be automatically requeued.

...

How to Efficiently Use the gpu_requeue Partition

IMPORTANT:

In order to work properly, any job submitted to gpu_requeue that writes intermediate files must either be restartable from the beginning (overwriting partially completed files) or from a last saved checkpoint. Researchers are responsible to choose jobs that can be run in this way.

...

Versions Compared

Old Version 16

New Version Current

Key

How Preemption Works

How to Submit to the gpu_requeue Partition

How to Efficiently Use the gpu_requeue Partition