Page Comparison

Table of Contents

GPU Resources in O2

There are 28 34 GPU nodes with a total of 135 147 GPU cards available on the O2 cluster. The nodes are accessible in three gpu partitions: gpu, gpu_quad, gpu_requeue.

The gpu partition includes 32 double precision GPU cards: 16 Tesla K80 with 12GB of VRAM, 8 Tesla M40 with 12GB and 24GB of VRAM, and 8 Tesla V100 .with 16GB of VRAM

The gpu_quad partition includes 71 GPUs: 47 single precision RTX 8000 cards and cards with 48GB of VRAM, 8 A40 single precisions cards 48GB of VRAM, 24 double precision Tesla V100s cards .with 32GB of VRAM, 4 double precision A100 cards with 80G of VRAM and 8 A100 MIG cards with 40G of VRAM

The gpu_requeue partition includes 32 44 GPUs: 28 single precision RTX 6000 cards with 24GB of VRAM, 2 double precision Tesla M40 cards and 2 , 2 A100 cards with 40GB of VRAM, and 12 A100 cards with 80GB of VRAM.

To list current information about all the nodes and cards available for a specific partition, use the command sinfo --Format=nodehost,available,memory,statelong,gres:20 40 -p <partition> for example:

Code Block

language	text

login02:~ sinfo  --Format=nodehost,available,memory,statelong,gres:2040 -p gpu,gpu_quad,gpu_requeue
HOSTNAMES           AVAIL               MEMORY              STATE               GRES
compute-g-16-175    up                  257548              mixed               gpu:teslaM40:4,vram:24G
compute-g-16-176    up                  257548              mixed               gpu:teslaM40:4,vram:12G
compute-g-16-177    up                  257548              mixed               gpu:teslaK80:8,vram:12G
compute-g-16-194    up                  257548              mixed               gpu:teslaK80:8,vram:12G
compute-g-16-254    up                  373760              mixed               gpu:teslaV100:4,vram:16G
compute-g-16-255    up                  373760              mixed               gpu:teslaV100:4,vram:16G
compute-g-17-145    up      AVAIL            770000   MEMORY           mixed   STATE               GRESgpu:rtx8000:10,vram:48G
compute-g-1617-175146    up                  257548770000              mixed               gpu:rtx8000:teslaM4010,vram:448G
compute-g-1617-176147    up                  257548383000              mixed               gpu:teslaM40teslaV100s:4,vram:32G
compute-g-1617-194148    up                  257548383000              mixed               gpu:teslaK80:8teslaV100s:4,vram:32G
compute-g-1617-254149    up                  373760383000              mixed               gpu:teslaV100teslaV100s:4,vram:32G
compute-g-1617-255150    up                  373760383000              mixed               gpu:teslaV100teslaV100s:4,vram:32G
compute-g-1617-177151    up                  257548383000              idle mixed               gpu:teslaV100s:teslaK804,vram:832G
compute-g-17-146152    up                  770000383000              mixed               gpu:rtx8000:10teslaV100s:4,vram:32G
compute-g-17-147153    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-148154    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-149155    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-150156    up                  383000              mixed               gpu:rtx8000:teslaV100s3,vram:448G
compute-g-17-151157    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-152158    up                  383000              mixed               gpu:rtx8000:teslaV100s3,vram:448G
compute-g-17-153159    up                  383000              mixed               gpu:rtx8000:3,vram:48G
compute-g-17-154160    up                  383000              mixed               gpu:rtx8000:3,vram:48G
compute-g-17-155161    up                  383000              mixed               gpu:rtx8000:3,vram:48G
compute-g-17-156162    up                  383000500000              mixed               gpu:rtx8000:3a40:4,vram:48G
compute-g-17-157163    up                  383000500000              mixed               gpu:rtx8000:3a40:4,vram:48G
compute-g-17-159164    up                  383000500000              mixed               gpu:a100:rtx80004,vram:380G
compute-g-17-160165    up                  383000500000              mixed               gpu:rtx8000:3a100.mig:8,vram:40G
compute-g-1716-161197    up                  383000257548              mixed               gpu:rtx8000:3teslaM40:2,vram:12G
compute-ggc-17-145245   up    up              383000    770000          mixed    allocated           gpu:rtx8000rtx6000:10,vram:24G
compute-ggc-17-158 246   up                  383000              downmixed                gpu:rtx8000:3rtx6000:10,vram:24G
compute-ggc-1617-197247    up                  257548383000              mixed               gpu:teslaM40:2rtx6000:8,vram:24G
compute-gc-17-245249   up                  3830001000000              mixed               gpu:rtx6000:10a100:2,vram:40G
compute-gc-17-247252   up                  383000 1000000             mixed               gpu:a100:rtx60004,vram:880G
compute-gc-17-249253   up                  1000000             mixed               gpu:a100:24,vram:80G
compute-gc-17-246254   up                  3830001000000             mixed         idle      gpu:a100:4,vram:80G          gpu:rtx6000:10

GPU Partition

The gpu partition is open to all O2 users; to run jobs on the gpu partition use the flag -p gpu

...

The gpu_quad partition is open to any users working for a PI with a primary or secondary appointment in a pre-clinical department; to run jobs on the gpu_quad partition use the flag -p gpu_quad. If you work at an affiliate institution but are collaborating with an on-Quad PI, please contact Research Computing to gain access.

...

The O2 cluster includes several contributed GPU cards, purchased and owned directly by HMS Labs. When idle, those GPU resources are made available in O2 under our gpu_requeue partition. However, if a member of a purchasing lab submits a job, your job may be killed and resubmitted at any time.

Note
Starting from From July 1st 2021 the gpu_requeue partition will be is available only to users working for a PI with a primary or secondary appointment in a pre-clinical department.

For detailed information about the gpu_requeue see O2 GPU Re-Queue Partition.

...

The amount of GPU resources that can be used by each user at a given time is measured in terms of GPU hours / user. Currently there is an active limit of 160 200 GPU hours for each user.

For example, at any time each user can allocate* at most 2 GPU cards for 80 100 hours,16 20 GPU cards for 10 hours, or any other combination that does not exceed the total GPU hours limit. (If you use just 1 GPU card, the partition maximum wall time will limit you to 120 hours.)

...

The gpu_quad and gpu_requeue partition are not affected by those limits.

How to submit a GPU job

Warning

You must not reassign the variable CUDA_VISIBLE_DEVICES;

the Slurm scheduler presets the correct value for CUDA_VISIBLE_DEVICES and alterining the preset value will likely cause your job to run without a GPU card.

Most GPU application will require access to CUDA Toolkit libraries, so before submitting a job you will likely need to load one of the available CUDA modules, for example:

Code Block
login01:~ module load gcc/69.2.0 cuda/1011.17

Note that if you are running a precompiled GPU application, for example a pip-installed Tensorflow, you will need to load the same version of CUDA that was used to compile your application (Tensorflow==2.2.0 was compiled using CUDA 10.1)

...

Code Block

language	text

login01:sbatch gpujob.sh
Submitted batch job 6900310


where gpujob.sh contains


#-----------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -c 4
#SBATCH -t 6:00:00
#SBATCH -p gpu_quad
#SBATCH --gres=gpu:2

module load gcc/69.2.0
module load cuda/911.07

./deviceQuery  #this is just an example 


#-----------------------------------------------------------------------------------------

It is also possible to request a specific type of GPU card by using the --gres flag. For example --gres=gpu:teslaM40:3 can be used to request 3 GPU Tesla M40 cards.

Currently the GPU flags available are: teslaK80, teslaM40, teslaV100, teslaV100s, rtx6000, rtx8000, a100 however each partitions might only have a subset of those card types, as indicated in the first paragraph., rtx8000, a100 however each partitions might only have a subset of those card types, as indicated in the first paragraph.

It is also possible to request a minimum amount of VRAM available on the GPU card to be allocated for the job. This can be done using the gres vram. For example using the flag --gres=gpu:1,vram:15G would request a GPU card that has at least 15G of VRAM. To see the VRAM of each card type in O2 you can use the Slurm command sinfo -p gpu,gpu_quad,gpu_requeue --Format=nodehost,gres:40

How to compile and run Cuda programs

...

Code Block

language	text

#!/bin/bash
#SBATCH -c 4
#SBATCH -t 6:00:00
#SBATCH -p gpu_quad
#SBATCH --gres=gpu:2

module load gcc/69.2.0
module load cuda/911.07

/n/cluster/bin/job_gpu_monitor.sh &

./deviceQuery  #this is just an example

...

Versions Compared

Old Version 11

New Version Current

Key

GPU Resources in O2

How to submit a GPU job

How to compile and run Cuda programs