Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: update with new a100's and minor formatting change to make the gpu options more obvious.

...

...

...

...

...

...

...

...


Note

From July 1st 2021 the gpu_requeue partition is available only to users working for a PI with a primary or secondary appointment in a pre-clinical department.


The O2 cluster includes a set of lab-contributed GPU nodes. These nodes were purchased by individual labs, but are made available to the rest of the O2 community within the re-queue GPU partition called gpu_requeue.

This partition currently comprises:

  • 28 Nvidia RTX6000 single precision

...

  • cards
  • 2 Nvidia A100 cards with 40GB of VRAM
  • 12 Nvidia A100 cards with 80GB of VRAM
  • 2 Nvidia M40 Tesla

...

  • cards

Most of the cards in this partition (the RTX6000 cards) are not ideal for GPU double precision jobs; if you need to run in double precision you should add the flag --constraint=gpu_doublep when submitting your jobs.

To see the currently available resources under the gpu_requeue partition you  can can use the command below:

Code Block
sinfo --Format=nodehost,cpusstate,memory,statelong,gres -p gpu_requeue
HOSTNAMES           CPUS(A/I/O/T)       MEMORY              STATE               GRES                
compute-g-16-197    0/20/0/20           257548              idle                gpu:teslaM40:2,vram:
compute-gc-17-245   8/40/0/48           383000              mixed               gpu:rtx6000:10,vram:
compute-gc-17-246   18/30/0/48          383000 385218             mixed idle                gpu:rtx6000:10,vram:
compute-gc-17-246247   0/48/0/48           385218383000              idle                gpu:rtx6000:8,vram:102
compute-gc-17-247249   0/48/0/48           1000000             idle                gpu:a100:2,vram:40G 
compute-gc-17-252   1/63/0/64           1000000             mixed             385218   gpu:a100:4,vram:80G 
compute-gc-17-253   0/64/0/64           1000000             idle                gpu:a100:4,vram:80G 
compute-gc-17-254   0/64/0/64           1000000             idle                gpu:rtx6000:8a100:4,vram:80G 

How Preemption Works

The labs that purchased these nodes have preemption priority on their own hardware. If the nodes are full and a researcher from one of those labs submits a job, one or more GPU jobs running on the gpu_requeue partition might be killed and re-queued in order to free resources for the Lab's job. That is, the gpu_requeue job will be cancelled, as if you ran the scancel command, and re-submitted (as long as you initially submitted with the flag --requeue).

...

How to Submit to the gpu_requeue Partition

To submit jobs on gpu_requeue you need to specify that partition with the flag "-p", and add the flag --requeue. Without the requeue flags jobs will still get killed but will not be automatically requeued.  

...

How to Efficiently Use the gpu_requeue Partition

IMPORTANT: 

In order to work properly, any job submitted to gpu_requeue that writes intermediate files must either be restartable from the beginning (overwriting partially completed files) or from a last saved checkpoint. Researchers are responsible to choose jobs that can be run in this way.

...