Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: minor edits.

The O2 cluster includes a set of lab-contributed GPU nodes. These nodes were purchased by individual labs, but are made available to the rest of the O2 community within the re-queue GPU partition called gpu_requeue.

This partition currently comprises 28 Nvidia RTX6000 single precision cards and 2 Nvidia M40 Tesla cards. It is not suitable for GPU double precision jobs.

...

The labs that purchased these nodes have preemption priority on their own hardware. If the nodes are full and a researcher from one of those labs submits a job, one or more GPU jobs running on the gpu_requeue partition might be killed and re-queued in order to free resources for the Lab's job. That is, the gpu_requeue job will be cancelled, as if you ran the scancel command, and re-submitted (as long as you initially submitted with the flag --requeue).

Please note this may happen at any time without any warning  multiple and multiple times.

Preempted jobs will show on Slurm sacct database as PR or PREEMPTED until they are requeued. The scheduler will not replace the jobid of a preempted job but instead it will resubmit it with the same jobid. In order to see the history of a preempted and requeued job you need to add the flag -D to the sacct query.

...

How to Submit to the gpu_requeue Partition

To submit jobs on gpu_requeue you need to specify that partition with the flag "-p", and add the flag --requeue. Without the requeue flags jobs will still get killed but will not be automatically requeued.  

...

How to Efficiently Use the gpu_requeue Partition

IMPORTANT: 

In order to work properly, any job submitted to gpu_requeue that writes intermediate files must either be restartable from the beginning (overwriting partially completed files) or from a last saved checkpoint. Researchers are responsible to choose jobs that can be run in this way.

...