Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

GPU Resources in O2

There are 28 34 GPU nodes with a total of 135 147 GPU cards available on the O2 cluster. The nodes are accessible in three gpu partitions: gpu, gpu_quad, gpu_requeue.

The gpu partition includes 32 double precision GPU cards: 16 Tesla K80 with 12GB of VRAM, 8 Tesla M40 with 12GB and 24GB of VRAM, and 8 Tesla V100 .with 16GB of VRAM

The gpu_quad partition includes 71 GPUs: 47 single precision RTX 8000 cards and cards with 48GB of VRAM, 8 A40 single precisions cards 48GB of VRAM, 24 double precision Tesla V100s cards .with 32GB of VRAM, 4 double precision A100 cards with 80G of VRAM and 8 A100 MIG cards with 40G of VRAM

The gpu_requeue partition includes 32 44 GPUs: 28 single precision RTX 6000 cards with 24GB of VRAM, 2 double precision Tesla M40 cards and , 2 A100 cards .with 40GB of VRAM, and 12 A100 cards with 80GB of VRAM.

To list current information about all the nodes and cards available for a specific partition, use the command sinfo  --Format=nodehost,available,memory,statelong,gres:20 40 -p <partition> for example:

Code Block
languagetext
login02:~ sinfo  --Format=nodehost,available,memory,statelong,gres:2040 -p gpu,gpu_quad,gpu_requeue
HOSTNAMES           AVAILgpu_requeue
HOSTNAMES           AVAIL               MEMORY              STATE               GRES
compute-g-16-175    up                  257548              mixed               gpu:teslaM40:4,vram:24G
compute-g-16-176    up                  257548              mixed               gpu:teslaM40:4,vram:12G
compute-g-16-177    up                  257548              mixed               gpu:teslaK80:8,vram:12G
compute-g-16-194    up                  257548              mixed               gpu:teslaK80:8,vram:12G
compute-g-16-254    up                  373760              mixed               gpu:teslaV100:4,vram:16G
compute-g-16-255    up                  373760              mixed               gpu:teslaV100:4,vram:16G
compute-g-17-145    up                  MEMORY770000              STATEmixed               GRESgpu:rtx8000:10,vram:48G
compute-g-1617-175146    up                  257548770000              mixed               gpu:teslaM40:4rtx8000:10,vram:48G
compute-g-1617-176147    up                  257548383000              mixed               gpu:teslaM40teslaV100s:4,vram:32G
compute-g-1617-194148    up                  257548383000              mixed               gpu:teslaK80:8teslaV100s:4,vram:32G
compute-g-1617-254149    up                  373760383000              mixed               gpu:teslaV100teslaV100s:4,vram:32G
compute-g-1617-255150    up                  373760383000              mixed               gpu:teslaV100teslaV100s:4,vram:32G
compute-g-1617-177151    up                  257548383000              idle mixed               gpu:teslaV100s:teslaK804,vram:832G
compute-g-17-146152    up                  770000383000              mixed               gpu:rtx8000:10teslaV100s:4,vram:32G
compute-g-17-147153    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-148154    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-149155    up                  383000              mixed               gpu:rtx8000:teslaV100s3,vram:448G
compute-g-17-150156    up                  383000              mixed               gpu:teslaV100s:4rtx8000:3,vram:48G
compute-g-17-151157    up                  383000              mixed               gpu:rtx8000:teslaV100s3,vram:448G
compute-g-17-152158    up                  383000              mixed               gpu:rtx8000:teslaV100s3,vram:448G
compute-g-17-153159    up                  383000              mixed               gpu:rtx8000:3,vram:48G
compute-g-17-154160    up                  383000              mixed               gpu:rtx8000:3,vram:48G
compute-g-17-155161    up                  383000              mixed               gpu:rtx8000:3,vram:48G
compute-g-17-156162    up                  383000500000              mixed               gpu:a40:rtx80004,vram:348G
compute-g-17-157163    up                  383000500000              mixed               gpu:a40:rtx80004,vram:348G
compute-g-17-159164    up                  383000500000              mixed               gpu:rtx8000:3a100:4,vram:80G
compute-g-17-160165    up                  383000500000              mixed               gpu:rtx8000:3:a100.mig:8,vram:40G
compute-g-1716-161197    up                  383000257548              mixed               gpu:rtx8000:3teslaM40:2,vram:12G
compute-ggc-17-145245   up    up              383000    770000          mixed    allocated           gpu:rtx8000rtx6000:10,vram:24G
compute-ggc-17-158 246   up                  383000              downmixed                gpu:rtx6000:rtx800010,vram:324G
compute-ggc-1617-197247    up                  257548383000              mixed               gpu:teslaM40:2rtx6000:8,vram:24G
compute-gc-17-245249   up                  3830001000000              mixed               gpu:a100:rtx60002,vram:1040G
compute-gc-17-247252   up                  3830001000000              mixed               gpu:rtx6000:8a100:4,vram:80G
compute-gc-17-249253   up                  1000000             mixed               gpu:a100:24,vram:80G
compute-gc-17-246254   up                  1000000 383000            mixed       idle        gpu:a100:4,vram:80G        gpu:rtx6000:10

GPU Partition

The gpu partition is open to all O2 users; to run jobs on the gpu partition use the flag -p gpu

...

The gpu_quad partition is open to any users working for a PI with a primary or secondary appointment in a pre-clinical department; to run jobs on the gpu_quad partition use the flag -p gpu_quad. If you work at an affiliate institution but are collaborating with an on-Quad PI, please contact Research Computing to gain access.

...

The amount of GPU resources that can be used by each user at a given time is measured in terms of GPU hours / user. Currently there is an active limit of 160 200 GPU hours for each user.

For example, at any time each user can allocate* at most 2 GPU cards for 80 100 hours,16 20 GPU cards for 10 hours, or any other combination that does not exceed the total GPU hours limit. (If you use just 1 GPU card, the partition maximum wall time will limit you to 120 hours.)

...

The gpu_quad and gpu_requeue partition are not affected by those limits.

How to submit a GPU job


Warning

You must not reassign the variable CUDA_VISIBLE_DEVICES;

the Slurm scheduler presets the correct value for CUDA_VISIBLE_DEVICES and alterining the preset value will likely cause your job to run without a GPU card.

Most GPU application will require access to CUDA Toolkit libraries, so before submitting a job you will likely need to load one of the available CUDA modules, for example:

Code Block
login01:~ module load gcc/69.2.0 cuda/1011.17


Note that if you are running a precompiled GPU application, for example a pip-installed Tensorflow, you will need to load the same version of CUDA that was used to compile your application (Tensorflow==2.2.0 was compiled using CUDA 10.1)

...

Code Block
languagetext
login01:sbatch gpujob.sh
Submitted batch job 6900310


where gpujob.sh contains


#-----------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -c 4
#SBATCH -t 6:00:00
#SBATCH -p gpu_quad
#SBATCH --gres=gpu:2

module load gcc/69.2.0
module load cuda/911.07

./deviceQuery  #this is just an example 


#-----------------------------------------------------------------------------------------


It is also possible to request a specific type of GPU card by using the --gres flag. For example --gres=gpu:teslaM40:3 can be used to request 3 GPU Tesla M40 cards. 

Currently the GPU flags available are: teslaK80, teslaM40, teslaV100, teslaV100s, rtx6000, rtx8000, a100 however each partitions might only have a subset of those card types, as indicated in the first paragraph., rtx8000, a100 however each partitions might only have a subset of those card types, as indicated in the first paragraph.
It is also possible to request a minimum amount of VRAM available on the GPU card to be allocated for the job. This can be done using the gres vram. For example using the flag --gres=gpu:1,vram:15G would request a GPU card that has at least 15G of VRAM. To see the VRAM of each card type in O2 you can use the Slurm command sinfo -p gpu,gpu_quad,gpu_requeue --Format=nodehost,gres:40

How to compile and run Cuda programs

...

Code Block
languagetext
#!/bin/bash
#SBATCH -c 4
#SBATCH -t 6:00:00
#SBATCH -p gpu_quad
#SBATCH --gres=gpu:2

module load gcc/69.2.0
module load cuda/911.07

/n/cluster/bin/job_gpu_monitor.sh &

./deviceQuery  #this is just an example 

...