Table of Contents
GPU Resources in O2
There are 28 34 GPU nodes with a total of 135 147 GPU cards available on the O2 cluster. The nodes are accessible in three gpu partitions: gpu, gpu_quad, gpu_requeue.
The gpu partition includes 32 double precision GPU cards: 16 Tesla K80 with 12GB of VRAM, 8 Tesla M40 with 12GB and 24GB of VRAM, and 8 Tesla V100 .with 16GB of VRAM
The gpu_quad partition includes 71 GPUs: 47 single precision RTX 8000 cards and cards with 48GB of VRAM, 8 A40 single precisions cards 48GB of VRAM, 24 double precision Tesla V100s cards .with 32GB of VRAM, 4 double precision A100 cards with 80G of VRAM and 8 A100 MIG cards with 40G of VRAM
The gpu_requeue partition includes 32 44 GPUs: 28 single precision RTX 6000 cards with 24GB of VRAM, 2 double precision Tesla M40 cards and , 2 A100 cards .with 40GB of VRAM, and 12 A100 cards with 80GB of VRAM.
To list current information about all the nodes and cards available for a specific partition, use the command sinfo --Format=nodehost,available,memory,statelong,gres:20 40 -p <partition> for example:
Code Block | ||
---|---|---|
| ||
login02:~ sinfo --Format=nodehost,available,memory,statelong,gres:2040 -p gpu,gpu_quad,gpu_requeue HOSTNAMES AVAILgpu_requeue HOSTNAMES AVAIL MEMORY STATE GRES compute-g-16-175 up 257548 mixed gpu:teslaM40:4,vram:24G compute-g-16-176 up 257548 mixed gpu:teslaM40:4,vram:12G compute-g-16-177 up 257548 mixed gpu:teslaK80:8,vram:12G compute-g-16-194 up 257548 mixed gpu:teslaK80:8,vram:12G compute-g-16-254 up 373760 mixed gpu:teslaV100:4,vram:16G compute-g-16-255 up 373760 mixed gpu:teslaV100:4,vram:16G compute-g-17-145 up MEMORY770000 STATEmixed GRESgpu:rtx8000:10,vram:48G compute-g-1617-175146 up 257548770000 mixed gpu:teslaM40:4rtx8000:10,vram:48G compute-g-1617-176147 up 257548383000 mixed gpu:teslaM40teslaV100s:4,vram:32G compute-g-1617-194148 up 257548383000 mixed gpu:teslaK80:8teslaV100s:4,vram:32G compute-g-1617-254149 up 373760383000 mixed gpu:teslaV100teslaV100s:4,vram:32G compute-g-1617-255150 up 373760383000 mixed gpu:teslaV100teslaV100s:4,vram:32G compute-g-1617-177151 up 257548383000 idle mixed gpu:teslaV100s:teslaK804,vram:832G compute-g-17-146152 up 770000383000 mixed gpu:rtx8000:10teslaV100s:4,vram:32G compute-g-17-147153 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-148154 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-149155 up 383000 mixed gpu:rtx8000:teslaV100s3,vram:448G compute-g-17-150156 up 383000 mixed gpu:teslaV100s:4rtx8000:3,vram:48G compute-g-17-151157 up 383000 mixed gpu:rtx8000:teslaV100s3,vram:448G compute-g-17-152158 up 383000 mixed gpu:rtx8000:teslaV100s3,vram:448G compute-g-17-153159 up 383000 mixed gpu:rtx8000:3,vram:48G compute-g-17-154160 up 383000 mixed gpu:rtx8000:3,vram:48G compute-g-17-155161 up 383000 mixed gpu:rtx8000:3,vram:48G compute-g-17-156162 up 383000500000 mixed gpu:a40:rtx80004,vram:348G compute-g-17-157163 up 383000500000 mixed gpu:a40:rtx80004,vram:348G compute-g-17-159164 up 383000500000 mixed gpu:rtx8000:3a100:4,vram:80G compute-g-17-160165 up 383000500000 mixed gpu:rtx8000:3:a100.mig:8,vram:40G compute-g-1716-161197 up 383000257548 mixed gpu:rtx8000:3teslaM40:2,vram:12G compute-ggc-17-145245 up up 383000 770000 mixed allocated gpu:rtx8000rtx6000:10,vram:24G compute-ggc-17-158 246 up 383000 downmixed gpu:rtx6000:rtx800010,vram:324G compute-ggc-1617-197247 up 257548383000 mixed gpu:teslaM40:2rtx6000:8,vram:24G compute-gc-17-245249 up 3830001000000 mixed gpu:a100:rtx60002,vram:1040G compute-gc-17-247252 up 3830001000000 mixed gpu:rtx6000:8a100:4,vram:80G compute-gc-17-249253 up 1000000 mixed gpu:a100:24,vram:80G compute-gc-17-246254 up 1000000 383000 mixed idle gpu:a100:4,vram:80G gpu:rtx6000:10 |
GPU Partition
The gpu partition is open to all O2 users; to run jobs on the gpu partition use the flag -p gpu
...
The gpu_quad partition is open to any users working for a PI with a primary or secondary appointment in a pre-clinical department; to run jobs on the gpu_quad partition use the flag -p gpu_quad. If you work at an affiliate institution but are collaborating with an on-Quad PI, please contact Research Computing to gain access.
...
The amount of GPU resources that can be used by each user at a given time is measured in terms of GPU hours / user. Currently there is an active limit of 160 200 GPU hours for each user.
For example, at any time each user can allocate* at most 2 GPU cards for 80 100 hours,16 20 GPU cards for 10 hours, or any other combination that does not exceed the total GPU hours limit. (If you use just 1 GPU card, the partition maximum wall time will limit you to 120 hours.)
...
The gpu_quad and gpu_requeue partition are not affected by those limits.
How to submit a GPU job
Warning |
---|
You must not reassign the variable CUDA_VISIBLE_DEVICES; the Slurm scheduler presets the correct value for CUDA_VISIBLE_DEVICES and alterining the preset value will likely cause your job to run without a GPU card. |
Most GPU application will require access to CUDA Toolkit libraries, so before submitting a job you will likely need to load one of the available CUDA modules, for example:
Code Block |
---|
login01:~ module load gcc/69.2.0 cuda/1011.17 |
Note that if you are running a precompiled GPU application, for example a pip-installed Tensorflow, you will need to load the same version of CUDA that was used to compile your application (Tensorflow==2.2.0 was compiled using CUDA 10.1)
...
Code Block | ||
---|---|---|
| ||
login01:sbatch gpujob.sh Submitted batch job 6900310 where gpujob.sh contains #----------------------------------------------------------------------------------------- #!/bin/bash #SBATCH -c 4 #SBATCH -t 6:00:00 #SBATCH -p gpu_quad #SBATCH --gres=gpu:2 module load gcc/69.2.0 module load cuda/911.07 ./deviceQuery #this is just an example #----------------------------------------------------------------------------------------- |
It is also possible to request a specific type of GPU card by using the --gres flag. For example --gres=gpu:teslaM40:3 can be used to request 3 GPU Tesla M40 cards.
How to compile and run Cuda programs
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH -c 4 #SBATCH -t 6:00:00 #SBATCH -p gpu_quad #SBATCH --gres=gpu:2 module load gcc/69.2.0 module load cuda/911.07 /n/cluster/bin/job_gpu_monitor.sh & ./deviceQuery #this is just an example |
...