NOTICE: FULL O2 Cluster Outage, January 3 - January 10th
O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10
- on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
- on Jan 3 (6:00 PM): O2 systems will start being powered off.
This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.
Specifically:
- The O2 Cluster will be completely offline, including O2 Portal.
- All data on O2 will be inaccessible.
- Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
- Websites on O2 will be completely offline, including all web content.
More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation
O2 Command CheatSheet
Note: many of the format fields for these commands are interchangeable and can be combined together. To see detailed information about each command run man command in the bash terminal (for example man sacct)Â
sbatch
sbatch slurm_script.sh | submit the batch job slurm_script.sh (preferred way, see our wiki for an example of a slurm script) |
sbatch -c 4 -p priority -t 1-00:00:00 --wrap="<Command>" | submit a job requesting 4 cores to partition priority with time limit 24 hours to execute <Command> |
srun
srun -p interactive --pty -t 4:00:00 -c 2 bash | Submit a 2 core interactive job to partition interactive with time limit 4 hours |
srun -p short --pty -t 4:00:00 -c 1 --mem=64G --test-only | Obtain an estimate of when a job with the specified resources might get dispatched, without actually submitting the job |
squeue
squeue -u $USER -t PD | list all pending jobs |
squeue -u $USER -t R | list all running jobs |
squeue -u $USER -t PD --Format=jobid,reasonlist,starttime | list reason and expected start time (if available) for pending jobs |
squeue -u $USER -t R --Format=jobid,partition,state,timelimit,starttime | list general information for running jobs |
squeue -u $USER --Format=jobid:10,partition:15 | list jobid and partition with a custom characters spacing |
sacct
sacct -u $USER --format=jobid,state,ExitCode,Timelimit,Elapsed | list past job status, exit code, requested walltime and actual used runtime |
sacct -u $USER --format=jobid,Submit,Start,End | list past jobs submit,start and end time |
sacct -u $USER --format=jobid,ReqTRES,MaxRSS --units=G | Â list past job information about memory requested and memory used |
sacct -u $USER --format=jobid,CPUTime,TotalCPU | list past job information about allocated CPU time versus actually used CPUtime |
sacct -u $USER --format=jobid,CPUTime%20,TotalCPU%20 | list past job information about allocated CPU time versus actually used CPUtime with a custom character spacing |
sacct -e | See the different fields you can use in a format statement |
add -j JOBID_NUMBER on the above command to get information only for a specific job
add -S YYYY-MM-DD on sacct command to look only for jobs in any given state after YYYY-MM-DD
add -E YYYY-MM-DD on sacct command to look only for jobs in any given state before YYYY-MM-DD
sinfo
sinfo -s | print a summary of the available partitions and their states |
sinfo -p <name> | print detailed information about partition <name> |
scontrol
scontrol update JobId=JobNumber TimeLimit=<time> | update job 'JobNumber' requested timelimit to <time> (can't go beyond partition limit) |
scontrol update JobId=JobNumber Dependency=<dependency_list> | update dependencies to <dependency_list> |
scontrol update JobId=JobNumber Partition=<name> | update requested partition to <name> |
scontrol update JobId=JobNumber MinMemoryCPU=<megabytes> | update mem_per_cpu required memory per node to <megabytes> |
scontrol update JobId=JobNumber MinMemoryNode=<megabytes> | update mem required memory per node to <megabytes> |
scontrol update JobId=JobNumber NumCPUs=<count> | update job number of cores requested to <count> |
scontrol show partition | list detailed information about available partitions |
Note that most of those properties can only be changed while jobs are pending.
scancel
scancel <jobid> | cancel job <jobid> |
scancel -u $USER | cancel all running and pending jobs |
scancel -n <name> | cancel all jobs with jobname <name> |
scancel -p <name> | cancel all jobs in the specified partition <name> |
scancel -t <PENDING | RUNNING | SUSPENDED > | cancel all job in the specified state |