Page Comparison

Table of Contents

...

The basic process of running jobs:

You login via SSH (secure shell) to the host: o2.hms.harvard.edu
1. If you are connecting to O2 from outside of the HMS network, you will be required to use two-factor authentication.
2. Please reference this wiki page for detailed instructions on how to login to the cluster.
If necessary, you copy your data to the cluster from your desktop or another location. See the File Transfer page for data transfer instructions. You may also want to copy large data inputs to the scratch3 filesystem for faster processing.
You submit your job - for example, a program to map DNA reads to a genome - specifying how long your job will take to run and what partition to run in. You can modify your job submission in many ways, like requesting a large amount of memory, as described below.
Your job sits in a partition (job status PENDING), and when it's your turn to run, SLURM finds a computer that's not too busy.
Your job runs on that computer (also known as a "compute node"), and goes to job status RUNNING. While it's running, you don't interact with the program. If you're running a program that requires user input or pointing and clicking, see Interactive Sessions below.
The job finishes running (job status COMPLETED, or FAILED if it had an error). You get an email when the job is finished if you specified --mail-type (and optionally --mail-user) in your job submission command.
If necessary, you might want to copy data back from the scratch filesystem to a backed-up location, or to your desktop.

Definitions

Here's some of the terminology that you'll find used in relation to SLURM in this and other documents:

...

Note: Any time <userid> is mentioned in this document, it should be replaced with your HMS account, formerly called an eCommons ID (and omit the <>). Likewise, <jobid> should be replaced with an actual job ID, such as 12345. The name of a batch job submission script should be inserted wherever <jobscript> is mentioned.

SLURM command	Sample command syntax	Meaning
sbatch	`sbatch <jobscript>`	Submit a batch (non-interactive) job.
srun	`srun --pty -t 0-0:5:0 -p interactive /bin/bash`	Start an interactive session for five minutes in the interactive queue with default 1 CPU core and 4GB of memory
squeue	`squeue -u <userid>`	View status of your jobs in the queue. Only non-completed jobs will be shown. We have an easier-to-use alternative command called O2squeue.
scontrol	`scontrol show job <jobid>`	Look at a running job in detail. For more information about the job, add the `-dd` parameter.
scancel	`scancel <jobid>`	Cancel a job. `scancel` can also be used to kill job arrays or job steps.
scontrol	`scontrol hold <jobid>`	Pause a job
scontrol	`scontrol release <jobid>`	Release a held job (allow it to run)
sacct	`sacct -j <jobid>`	Check job accounting data. Running `sacct` is most useful for completed jobs. We have an easier-to-use alternative command called O2sacct.
sinfo	`sinfo`	See node and partition information. Use the `-N` parameter to see information per node.

...

Workflow	Best queue to use	Time limit
1 or 2 jobs at a time	`priority`	1 month
> 2 short jobs	`short`	12 hours
> 2 medium jobs	medium	5 days
> 2 long jobs	`long`	1 month
MPI parallel job using multiple nodes (with sbatch `-a` and `mpirun.sh`)	`mpi`	5 days
GPU job	`gpu, gpu_quad,gpu_mpi_quad, gpu_requeue`	5 days and 1 day
Interactive work (e.g. MATLAB graphical interface, editing/testing code) rather than a batch job	`interactive`	12 hours

...

groups (/n/groups)
log (/n/log - for web hosting users)
files (/n/files - accessible through transfer partition, but you must request access to run jobs in this partition)
scratch3 (/n/scratch3 - for storage of temporary or intermediary files)

...

Versions Compared

Old Version 86

New Version 87

Key

Definitions