Note |
---|
From July 1st 2021 the gpu_requeue partition is available only to users working for a PI with a primary or secondary appointment in a pre-clinical department. |
...
This partition currently comprises:
- 28 Nvidia RTX6000 single precision cards
...
- 2 Nvidia A100 cards
...
- with 40GB of VRAM
- 12 Nvidia A100 cards with 80GB of VRAM
- 2 Nvidia M40 Tesla cards
...
Most of the cards in this partition (the RTX6000 cards) are not ideal for GPU double precision jobs; if you need to run in double precision you should add the flag --constraint=gpu_doublep when submitting your jobs.
To see the currently available resources under the gpu_requeue partition you can can use the command below:
Code Block |
---|
sinfo --Format=nodehost,cpusstate,memory,statelong,gres -p gpu_requeue HOSTNAMES CPUS(A/I/O/T) MEMORY STATE GRES compute-g-16-197 0/20/0/20 257548 idle gpu:teslaM40:2,vram: compute-gc-17-245 118/3740/0/48 383000 mixed gpu:rtx6000:10,vram: compute-gc-17-246 1118/3730/0/48 383000 mixed gpu:rtx6000:10,vram: compute-gc-17-247 100/3848/0/48 383000 mixedidle gpu:rtx6000:8,vram:2 compute-gc-17-249 50/4348/0/48 1000000 mixedidle gpu:a100:2,vram:40G compute-ggc-16-197 0/20/0/20 17-252 1/63/0/64 1000000 mixed gpu:a100:4,vram:80G compute-gc-17-253 0/64/0/64 1000000 idle gpu:a100:4,vram:80G compute-gc-17-254 0/64/0/64 257548 1000000 idle gpu:teslaM40a100:24,vram:80G |
How Preemption Works
The labs that purchased these nodes have preemption priority on their own hardware. If the nodes are full and a researcher from one of those labs submits a job, one or more GPU jobs running on the gpu_requeue partition might be killed and re-queued in order to free resources for the Lab's job. That is, the gpu_requeue job will be cancelled, as if you ran the scancel command, and re-submitted (as long as you initially submitted with the flag --requeue).
...
How to Submit to the gpu_requeue Partition
To submit jobs on gpu_requeue you need to specify that partition with the flag "-p", and add the flag --requeue. Without the requeue flags jobs will still get killed but will not be automatically requeued.
...
How to Efficiently Use the gpu_requeue Partition
IMPORTANT:
In order to work properly, any job submitted to gpu_requeue that writes intermediate files must either be restartable from the beginning (overwriting partially completed files) or from a last saved checkpoint. Researchers are responsible to choose jobs that can be run in this way.
...