NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

O2 Command CheatSheet


Note: many of the format fields for these commands are interchangeable and can be combined together. To see detailed information about each command run man command in the bash terminal (for example man sacct) 

sbatch

sbatch slurm_script.sh

submit the batch job slurm_script.sh (preferred way, see our wiki for an example of a slurm script)

sbatch -c 4 -p priority -t 1-00:00:00 --wrap="<Command>"

submit a job requesting 4 cores to partition priority with time limit 24 hours to execute <Command>

srun

srun -p interactive --pty -t 4:00:00 -c 2 bash

Submit a 2 core interactive job to partition interactive with time limit 4 hours

srun -p short --pty -t 4:00:00 -c 1 --mem=64G --test-only

Obtain an estimate of when a job with the specified resources might get dispatched, without actually submitting the job

squeue

squeue -u $USER -t PD

list all pending jobs

squeue -u $USER -t R

list all running jobs

squeue -u $USER -t PD --Format=jobid,reasonlist,starttime

list reason and expected start time (if available) for pending jobs

squeue -u $USER -t R --Format=jobid,partition,state,timelimit,starttime

list general information for running jobs

squeue -u $USER --Format=jobid:10,partition:15

list jobid and partition with a custom characters spacing

sacct

sacct -u $USER --format=jobid,state,ExitCode,Timelimit,Elapsed

list past job status, exit code, requested walltime and actual used runtime

sacct -u $USER --format=jobid,Submit,Start,End

list past jobs submit,start and end time

sacct -u $USER --format=jobid,ReqTRES,MaxRSS --units=G

 list past job information about memory requested and memory used

sacct -u $USER --format=jobid,CPUTime,TotalCPU

list past job information about allocated CPU time versus actually used CPUtime

sacct -u $USER --format=jobid,CPUTime%20,TotalCPU%20

list past job information about allocated CPU time versus actually used CPUtime with a custom character spacing

sacct -e

See the different fields you can use in a format statement



add -j JOBID_NUMBER on the above command to get information only for a specific job

add -S YYYY-MM-DD on sacct command to look only for jobs in any given state after YYYY-MM-DD

add -E YYYY-MM-DD on sacct command to look only for jobs in any given state before YYYY-MM-DD

sinfo

sinfo -s

print a summary of the available partitions and their states

sinfo -p <name>

print detailed information about partition <name>

scontrol

scontrol update JobId=JobNumber TimeLimit=<time>

update job 'JobNumber' requested timelimit to <time> (can't go beyond partition limit)

scontrol update JobId=JobNumber Dependency=<dependency_list>

update dependencies to <dependency_list>

scontrol update JobId=JobNumber Partition=<name>

update requested partition to <name>

scontrol update JobId=JobNumber MinMemoryCPU=<megabytes>

update mem_per_cpu required memory per node to <megabytes>

scontrol update JobId=JobNumber MinMemoryNode=<megabytes>

update mem required memory per node to <megabytes>

scontrol update JobId=JobNumber NumCPUs=<count>

update job number of cores requested to <count>

scontrol show partition

list detailed information about available partitions



Note that most of those properties can only be changed while jobs are pending.

scancel

scancel <jobid>

cancel job <jobid>

scancel -u $USER

cancel all running and pending jobs

scancel -n <name>

cancel all jobs with jobname <name>

scancel -p <name>

cancel all jobs in the specified partition <name>

scancel -t <PENDING | RUNNING | SUSPENDED >

cancel all job in the specified state