Table of Contents
GPU Resources in O2
There are 28 34 GPU nodes with a total of 135 147 GPU cards available on the O2 cluster. The nodes are accessible in three gpu partitions: gpu, gpu_quad, gpu_requeue.
The gpu partition includes 32 double precision GPU cards: 16 Tesla K80 with 12GB of VRAM, 8 Tesla M40 with 12GB and 24GB of VRAM, and 8 Tesla V100 .with 16GB of VRAM
The gpu_quad partition includes 71 GPUs: 47 single precision RTX 8000 cards and cards with 48GB of VRAM, 8 A40 single precisions cards 48GB of VRAM, 24 double precision Tesla V100s cards .with 32GB of VRAM, 4 double precision A100 cards with 80G of VRAM and 8 A100 MIG cards with 40G of VRAM
The gpu_requeue partition includes 32 44 GPUs: 28 single precision RTX 6000 cards with 24GB of VRAM, 2 double precision Tesla M40 cards and 2 , 2 A100 cards with 40GB of VRAM, and 12 A100 cards with 80GB of VRAM.
To list current information about all the nodes and cards available for a specific partition, use the command sinfo --Format=nodehost,available,memory,statelong,gres:20 40 -p <partition> for example:
Code Block | ||
---|---|---|
| ||
login02:~ sinfo --Format=nodehost,available,memory,statelong,gres:2040 -p gpu,gpu_quad,gpu_requeue HOSTNAMES AVAIL MEMORY STATE GRES compute-g-16-175 up 257548 mixed gpu:teslaM40:4,vram:24G compute-g-16-176 up 257548 mixed gpu:teslaM40:4,vram:12G compute-g-16-177 up 257548 mixed gpu:teslaK80:8,vram:12G compute-g-16-194 up 257548 mixed gpu:teslaK80:8,vram:12G compute-g-16-254 up 373760 mixed gpu:teslaV100:4,vram:16G compute-g-16-255 up 373760 mixed gpu:teslaV100:4,vram:16G compute-g-17-145 up AVAIL 770000 MEMORY mixed STATE GRESgpu:rtx8000:10,vram:48G compute-g-1617-175146 up 257548770000 mixed gpu:rtx8000:teslaM4010,vram:448G compute-g-1617-176147 up 257548383000 mixed gpu:teslaM40teslaV100s:4,vram:32G compute-g-1617-194148 up 257548383000 mixed gpu:teslaK80:8teslaV100s:4,vram:32G compute-g-1617-254149 up 373760383000 mixed gpu:teslaV100teslaV100s:4,vram:32G compute-g-1617-255150 up 373760383000 mixed gpu:teslaV100teslaV100s:4,vram:32G compute-g-1617-177151 up 257548383000 idle mixed gpu:teslaV100s:teslaK804,vram:832G compute-g-17-146152 up 770000383000 mixed gpu:rtx8000:10teslaV100s:4,vram:32G compute-g-17-147153 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-148154 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-149155 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-150156 up 383000 mixed gpu:rtx8000:teslaV100s3,vram:448G compute-g-17-151157 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-152158 up 383000 mixed gpu:rtx8000:teslaV100s3,vram:448G compute-g-17-153159 up 383000 mixed gpu:rtx8000:3,vram:48G compute-g-17-154160 up 383000 mixed gpu:rtx8000:3,vram:48G compute-g-17-155161 up 383000 mixed gpu:rtx8000:3,vram:48G compute-g-17-156162 up 383000500000 mixed gpu:rtx8000:3a40:4,vram:48G compute-g-17-157163 up 383000500000 mixed gpu:rtx8000:3a40:4,vram:48G compute-g-17-159164 up 383000500000 mixed gpu:a100:rtx80004,vram:380G compute-g-17-160165 up 383000500000 mixed gpu:rtx8000:3a100.mig:8,vram:40G compute-g-1716-161197 up 383000257548 mixed gpu:rtx8000:3teslaM40:2,vram:12G compute-ggc-17-145245 up up 383000 770000 mixed allocated gpu:rtx8000rtx6000:10,vram:24G compute-ggc-17-158 246 up 383000 downmixed gpu:rtx8000:3rtx6000:10,vram:24G compute-ggc-1617-197247 up 257548383000 mixed gpu:teslaM40:2rtx6000:8,vram:24G compute-gc-17-245249 up 3830001000000 mixed gpu:rtx6000:10a100:2,vram:40G compute-gc-17-247252 up 383000 1000000 mixed gpu:a100:rtx60004,vram:880G compute-gc-17-249253 up 1000000 mixed gpu:a100:24,vram:80G compute-gc-17-246254 up 3830001000000 mixed idle gpu:a100:4,vram:80G gpu:rtx6000:10 |
GPU Partition
The gpu partition is open to all O2 users; to run jobs on the gpu partition use the flag -p gpu
...
The gpu_quad partition is open to any users working for a PI with a primary or secondary appointment in a pre-clinical department; to run jobs on the gpu_quad partition use the flag -p gpu_quad. If you work at an affiliate institution but are collaborating with an on-Quad PI, please contact Research Computing to gain access.
...
The O2 cluster includes several contributed GPU cards, purchased and owned directly by HMS Labs. When idle, those GPU resources are made available in O2 under our gpu_requeue partition. However, if a member of a purchasing lab submits a job, your job may be killed and resubmitted at any time.
Note |
---|
Starting from From July 1st 2021 the gpu_requeue partition will be is available only to users working for a PI with a primary or secondary appointment in a pre-clinical department. |
For detailed information about the gpu_requeue see O2 GPU Re-Queue Partition.
...
The amount of GPU resources that can be used by each user at a given time is measured in terms of GPU hours / user. Currently there is an active limit of 160 200 GPU hours for each user.
For example, at any time each user can allocate* at most 2 GPU cards for 80 100 hours,16 20 GPU cards for 10 hours, or any other combination that does not exceed the total GPU hours limit. (If you use just 1 GPU card, the partition maximum wall time will limit you to 120 hours.)
...
The gpu_quad and gpu_requeue partition are not affected by those limits.
How to submit a GPU job
Warning |
---|
You must not reassign the variable CUDA_VISIBLE_DEVICES; the Slurm scheduler presets the correct value for CUDA_VISIBLE_DEVICES and alterining the preset value will likely cause your job to run without a GPU card. |
Most GPU application will require access to CUDA Toolkit libraries, so before submitting a job you will likely need to load one of the available CUDA modules, for example:
Code Block |
---|
login01:~ module load gcc/69.2.0 cuda/1011.17 |
Note that if you are running a precompiled GPU application, for example a pip-installed Tensorflow, you will need to load the same version of CUDA that was used to compile your application (Tensorflow==2.2.0 was compiled using CUDA 10.1)
...
Code Block | ||
---|---|---|
| ||
login01:sbatch gpujob.sh Submitted batch job 6900310 where gpujob.sh contains #----------------------------------------------------------------------------------------- #!/bin/bash #SBATCH -c 4 #SBATCH -t 6:00:00 #SBATCH -p gpu_quad #SBATCH --gres=gpu:2 module load gcc/69.2.0 module load cuda/911.07 ./deviceQuery #this is just an example #----------------------------------------------------------------------------------------- |
It is also possible to request a specific type of GPU card by using the --gres flag. For example --gres=gpu:teslaM40:3 can be used to request 3 GPU Tesla M40 cards.
How to compile and run Cuda programs
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH -c 4 #SBATCH -t 6:00:00 #SBATCH -p gpu_quad #SBATCH --gres=gpu:2 module load gcc/69.2.0 module load cuda/911.07 /n/cluster/bin/job_gpu_monitor.sh & ./deviceQuery #this is just an example |
...