Table of Contents

This page gives a basic introduction to the O2 cluster for new cluster users. Reading this page will help you to submit interactive and batch jobs to the Slurm scheduler on O2, as well as teach you how to monitor and troubleshoot your jobs as needed.

...

You login via SSH (secure shell) to the host: o2.hms.harvard.edu
1. If you are connecting to O2 from outside of the HMS network, you will be required to use two-factor authentication.
2. Please reference this wiki page for detailed instructions on how to login to the cluster.
If necessary, you copy your data to the cluster from your desktop or another location. See the File Transfer page for data transfer instructions. You may also want to copy large data inputs to the scratch3 filesystem for faster processing.
You submit your job - for example, a program to map DNA reads to a genome - specifying how long your job will take to run and what partition to run in. You can modify your job submission in many ways, like requesting a large amount of memory, as described below.
Your job sits in a partition (job status PENDING), and when it's your turn to run, SLURM finds a computer that's not too busy.
Your job runs on that computer (also known as a "compute node"), and goes to job status RUNNING. While it's running, you don't interact with the program. If you're running a program that requires user input or pointing and clicking, see Interactive Sessions below.
The job finishes running (job status COMPLETED, or FAILED if it had an error). You get an email when the job is finished if you specified --mail-type (and optionally --mail-user) in your job submission command.
If necessary, you might want to copy data back from the scratch filesystem to a backed-up location, or to your desktop.

...

host, node, and server are all just fancy words for a computer
core Cluster nodes can run multiple jobs at a time. Currently, most of the nodes on O2 have 32 cores, meaning they can run 32 simple jobs at a time. Some jobs can utilize more than one core.
master host: The system that performs overall coordination of the SLURM cluster. In our case we have a primary server named slurm-prod01 and one backup.
submission host: When logging in to o2.hms.harvard.edu, you'll actually end up on a "login node" called login01, login02, login03, login04 or login05. You submit your SLURM jobs from there. Do not run any computationally heavy processes on login nodes!
execution host: (or "compute node") A system where SLURM jobs will actually run. In O2, compute nodes are named compute-x-yy-zz. x, yy and zz tell the location of the machine in the computer room.
partition: Partitions can also be referred to as "queues." When submitting a job, you'll need to specify a partition to submit your job to. If you're not sure which partition to use, see the How to choose a partition in O2 page. We aimed to name the partitions logically, so the partition name will tell you the type of jobs that can be run through them, what resources those jobs can access, who can submit jobs to a given partition, and so forth.
filesystem: From your perspective, just a fancy word for a big disk drive. The different filesystems O2 sees have different characteristics, in terms of the speed of reading/writing data, whether they're backed up, etc. See Filesystems.

...

We can also get more granular in talking about where to store your files, specifically what directories a cluster user can write to on O2. Every cluster user has a /home directory with a 100GiB quota. When you login to the cluster, you will be placed in your home directory. This directory is named like /home/user, where "user" is replaced with your eCommons in lowercase. Additionally, each cluster user can use /n/scratch3 for storage of temporary or intermediary files. The per-user scratch3 quota is 10TiB or 1 million files/directories. Any file on scratch3 that is not accessed for 30 days will be deleted, and there are no backups of scratch3 data. A cluster user must create their scratch3 directory using a provided script. Finally, there are lab or group directories. These are located under /n/groups, /n/data1, or /n/data2. The quota for group directories is shared for all group members, and there is not a standard quota that all groups have. Any of these directory options (/home, /n/scratch3, group directory) can be used to store data that will be computed against on O2. When you submit a job to the SLURM scheduler on O2, you will need to mention in which directory your data (that you want to compute on) is stored and where you want the output data to be stored.

...

The current partition configuration:

Partition	Specification *
short	12 hours
medium	5 days
long	30 days
interactive	2 job limit, 12 hours
mpi	5 days limit
priority	2 job limit, 30 day limit
transfer	4 cores max, 5 days limit, 5 concurrently cores per user
gpu, gpu_requeue, gpu_quad, gpu_mpi_quad	see Using O2 GPU resources

Check our How to choose a partition in O2 chart to see which partition you should use.

...

Basic SLURM commands

Note: Any time <userid> is mentioned in this document, it should be replaced with your HMS account, formerly called an eCommons ID (and omit the <>). Likewise, <jobid> should be replaced with an actual job ID, such as 12345. The name of a batch job submission script should be inserted wherever <jobscript> is mentioned.

SLURM command	Sample command syntax	Meaning
sbatch	`sbatch <jobscript>`	Submit a batch (non-interactive) job.
srun	`srun --pty -t 0-0:5:0 -p interactive /bin/bash`	Start an interactive session for five minutes in the interactive queue.
squeue	`squeue -u <userid>`	View status of your jobs in the queue. Only non-completed jobs will be shown. We have an easier-to-use alternative command called O2squeue.
scontrol	`scontrol show job <jobid>`	Look at a running job in detail. For more information about the job, add the `-dd` parameter.
scancel	`scancel <jobid>`	Cancel a job. `scancel` can also be used to kill job arrays or job steps.
scontrol	`scontrol hold <jobid>`	Pause a job
scontrol	`scontrol release <jobid>`	Release a held job (allow it to run)
sacct	`sacct -j <jobid>`	Check job accounting data. Running `sacct` is most useful for completed jobs. We have an easier-to-use alternative command called O2sacct.
sinfo	`sinfo`	See node and partition information. Use the `-N` parameter to see information per node.

Submitting Jobs

There are two commands to submit jobs on O2: sbatch or srun.

sbatch is used for submitting batch jobs, which are non-interactive. The sbatch command requires writing a job script to use in job submission. When invoked, sbatch creates a job allocation (resources such as nodes and processors) before running the commands specified in the job script.

srun is used for starting interactive sessions (commonly used on O2) or job steps (less commonly used on O2). When used for an interactive session, srun will create a new job allocation. It can also be used from within a pre-existing allocation, such as within an sbatch script.

Tip
If you would like to submit a job with a script, use `sbatch`. If you want to start an interactive session, use `srun`.

The `sbatch` command

You submit jobs to SLURM using the sbatch command, followed by the script you'd like to run. When you submit the job, sbatch will give you a numeric JobID that you can later use to monitor your job. The SLURM scheduler will then find a computer that has an open slot matching any specifications you gave (see below on requesting resources), and tell that computer to run your job.

sbatch also accepts option arguments to configure your job, and O2 users should specify all of the following with each job, at a minimum:

the partition (using -p)
a runtime limit, i.e., the maximum hours and minutes (-t 2:30:00) the job will run. The job will be killed if it runs longer than this limit, so it's better to overestimate.

Most users will be submitting jobs to the short, medium, long, priority, or interactive partitions, some in mpi. Here is a guide to choose a proper partition:

Workflow	Best queue to use	Time limit
1 or 2 jobs at a time	`priority`	1 month
> 2 short jobs	`short`	12 hours
> 2 medium jobs	medium	5 days
> 2 long jobs	`long`	1 month
MPI parallel job using multiple nodes (with sbatch `-a` and `mpirun.sh`)	`mpi`	5 days
GPU job	`gpu, gpu_quad,gpu_mpi_quad, gpu_requeue`	5 days and 1 day
Interactive work (e.g. MATLAB graphical interface, editing/testing code) rather than a batch job	`interactive`	12 hours

`sbatch` options quick reference

The sbatch command can take many, many different flags. This is just a quick description of the most popular flags. They are described in more detail further down the page.

Only -p and -t are required. If you are running a multi-threaded or parallel job, which splits a job in pieces over multiple cores/threads/processors to run faster, -c or -n is required.

-n 1

Number of tasks requested. By default this value is set to 1 and to each task it is allocated only 1 core. This flag is typically used to request resources when running MPI parallel jobs or other multi-tasked jobs.

-c 4

Number of cores requested per task. Here, we request that the job run on four cores. (Some programs use "processors" or "threads" for this idea). For all jobs on O2, you will be restricted to the number of cores you request. If you accidentally request -c 1, but tell your program to use 4 cores, your job will still run but it will be limited to the 1 allocated core only. This is the flag you need to use for the typical parallel job that runs on shared memory systems.

-e errfile

Send errors (STDERR) to file errfile.

--mail-user=emailAddress

Send email to this address. If this option is not used, by default the email is sent to the address listed in ~/.forward. Must be used with --mail-type.

We recommend using commands such as O2squeue or O2sacct for job monitoring instead of relying on the email notifications, which do not contain much information.

--mail-type=FAIL

This option specifies when to send job emails. The available mail types include: NONE, BEGIN, END, FAIL, ALL (equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buffer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit).

Again, we would recommend leveraging commands like O2squeue or O2sacct for job monitoring instead of the SLURM email notifications.

--mem=100M

Reserve 100 MiB of memory. Use either this option OR --mem-per-cpu. To avoid confusion, it's best to include a size unit, one of K, M, G, T for kibibyte, mebibyte, etc. If you don't specify a unit, you'll be requesting in mebibyte by default.

--mem-per-cpu=100M

Resource request. Reserve 100 MiB of memory per core. Use either this option OR --mem. Best practice is to include a size unit, such as K, M, G, or T for kibibyte, mebibyte, etc.

--open-mode=append

For error and output files, choose either to append to them or truncate them.

-N 1

Run on this number of nodes. Unless your job is running in the MPI queue, always use 1.

-o outfile

Send screen output (STDOUT) to file outfile.

-p short

Partition to run your job in. All partitions but interactive can be used for sbatch jobs.

-t 30

Job will be killed if it runs longer than 30 minutes (0-5:30 means 0 days, five hours and thirty minutes)

...

We can specify all the options of sbatch in a script and submit. An example batch script is seen below:

...

You will still get the same result as above, even though you have not exported DESTFILE inside your submission script. The value of DESTFILE was exported along with the rest of your environment at submission time, because it was already set. However, for obvious reasons, it is recommended that you make such exports inside your submission script for documentation purposes so that you have a convenient record of what you set those variables to if you return to your submission some time later.

The `srun` command

Like sbatch, srun can be used to submit jobs under the SLURM scheduler. sbatch and srun even share many of the same options! However, srun is implemented a bit differently. You can use srun to create job steps (simply put srun in front of the commands you want to run) within an sbatch script, or to start an interactive session. If srun is used within an sbatch script, it will use the preexisting resource allocation. If srun is submitted independent of sbatch, you will need to specify resources to be allocated to your job.

...

You can submit "interactive" jobs under SLURM, which allows you to actually log in to a compute node. You can then test commands interactively, watch your code compile, run programs that require user input while the command is running, or use a graphical interface. (Matlab and R can run either in batch mode or graphical mode, for example.) These jobs are intended to be run in the interactive partition, but currently can be used in any partition. It may take longer to be assigned an interactive job in a partition other than interactive, due to the priority of the partition. See How to choose a partition in O2for more detail.

...

You can then run your commands, and logout when you're done. The interactive queue has a limit of 12 hours.

You can request extra memory or multiple cores (up to 20) in an interactive session srun.

Note: It is not recommended to submit srun commands from within an interactive job. These commands will be executed on the same resource allocation as the interactive job. If you want to submit other jobs from within an interactive job, use sbatch commands instead.

Within `sbatch` scripts

With the SLURM scheduler, it is encouraged to put srun in front of every command you want to run in an sbatch script. However, this is not required!

Using srun will denote job steps, so when you monitor your job, each srun command will be shown as a different step. Additionally, if your job fails, job steps will help you troubleshoot better by narrowing down which command had the issue.

...

If we save this script as srun_in_sbatch.sh, it can be submitted by sbatch srun_in_sbatch.sh. After the job completes, you can see the job statistics (which will be broken down by numbered job steps) by running sacct -j <jobid> or by using O2sacct <jobid> .

Job Arrays

There are two ways to run job arrays. You can use a job array script just like a typical script, but submit 30 copies:

Code Block
sbatch --array=1-30 submit.sh

Or you can insert this line in the script instead:

Code Block
#SBATCH --array=1-30

(It may be desired to use %a and %A in output filenames in this case.)

If files are cleverly numbered, you can reference them with ${SLURM_ARRAY_TASK_ID} which fetches the array ID, e.g. you need to process 30 fastq files, and they are named (something like) fastq1.fastq, fastq2.fastq, etc.:

...

Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks.

When using job arrays, you write an sbatch script including resource requests using #SBATCH directives and then the generic command you want to run. The requested resources like cores, memory, walltime, etc. will all be on a per-task basis, not for the overall job array.

The only difference between a job array script and a standard sbatch script is that you need to state how many copies of the analysis you want to run with the --array parameter.

There are two ways to submit job arrays. The first method is to add in the --array parameter on the command line:

Code Block
sbatch --array=1-30 submit.sh

Or you can insert this line in the script instead:

Code Block
#SBATCH --array=1-30

Job arrays have two job IDs: the overall job ID for the whole array, and IDs for each task in the array. You can use %a and %A in your output and error filenames to substitute the array index and the job array ID, respectively. Using these filename patterns will mean you can figure out which job array ID and task in the array the file corresponds to based upon its name. It is preferable to use %A and %a instead of %j; %j will give a unique job ID for each task in the array.

Job arrays do not have to be submitted for consecutive indexes. You can submit only for certain indexes, or with a certain step size. It is helpful to be able to run specific indexes, for example, when previously these tasks failed in a prior run.

Code Block
# submit job with index values of 7, 47, and 89 sbatch --array=7,47,89 submit.sh # submit job with index values of 2, 4, 6, 8, 10, 12 sbatch --array=2-12:2 submit.sh

If files are cleverly numbered, you can reference them with ${SLURM_ARRAY_TASK_ID} which fetches the array ID, e.g. you need to process 30 fastq files, and they are named (something like) fastq1.fastq, fastq2.fastq, etc.:

Code Block
<command that processes fastq files> /path/to/fastq"${SLURM_ARRAY_TASK_ID}".fastq

A different input file will go to each job in the array, mapped 1-to-1.

The full set of environment variables when a job array is submitted:

ENV_VAR	function
SLURM_JOB_ID	the jobID of each job in the array (distinct). Passable with `%j`.
SLURM_ARRAY_JOB_ID	the jobID of the whole array (the same for every job in the array; equal to the SLURM_JOB_ID of the first job dispatched in the array). Passable with `%A`.
SLURM_ARRAY_TASK_ID	index of the job in the array (distinct). Passable with `%a`.

To control how many jobs can be executed at a time, specify this inside the --array flag with %. To modify the above sbatch command to only allow 5 running jobs in the array at a time:

Code Block
sbatch --array=1-30%5 submit.sh

This also works with the #SBATCH directive.

Putting all this together, here is a full example job array script to run an array with 4 tasks:

Code Block

#!/bin/bash
#SBATCH -c 1                 # request one core
#SBATCH -t 0-0:10            # 10 minute run time
#SBATCH -p short             # use short partition
#SBATCH --mem 500            # use 500 MiB
#SBATCH -o fastqc_%A_%a.out  # Send output for each job to file named with "fastqc", with overall job ID and task ID in the filename
#SBATCH --array=1-4          # Run array for indexes 1-4, effectively running the command 4x (but with different input files)

module load fastqc/0.11.5

fastqc sample_"${SLURM_ARRAY_TASK_ID}"_R1.fastq

A different input file will go to each job in the array, mapped 1-to-1.

The full set of environment variables when a job array is submitted:

...

ENV_VAR

...

function

...

SLURM_JOB_ID

...

the jobID of each job in the array (distinct).

...

SLURM_ARRAY_JOB_ID

...

the jobID of the whole array (the same for every job in the array; equal to the SLURM_JOB_ID of the first job dispatched in the array). Passable with %A.

...

SLURM_ARRAY_TASK_ID

...

index of the job in the array (distinct). Passable with %a.

...

Code Block
sbatch --array=1-30%5 submit.sh

This also works with the #SBATCH directiveThe script would be submitted with sbatch akin to any other batch job submission. Our files in this example would be named consecutively like sample_1_R1.fastq, sample_2_R1.fastq, and so on. There will be 4 tasks to run, one fastqc run for each of the fastq files. ${SLURM_ARRAY_TASK_ID} will be replaced with the actual task id. For the first task, the full analysis command would be fastqc sample_1_R1.fastq. The output would be written to files with unique names (of the form fastqc_%A_%a.out), as %a would be replaced with the array indexes 1-4.

To cancel specific jobs in the array:

...

Other dependency parameters:

parameter	usage
after:jobid[:jobid...]	asynchronous execution (begin after <jobid>(s) has begun
afterany:jobid[:jobid...]	begin after <jobid>(s) has terminated (EXIT or DONE)
afterok:jobid[:jobid...]	begin after <jobid>(s) have successfully finished with exit code of 0
afternotok:jobid[:jobid...]	begin after <jobid>(s) has failed
singleton	begin after any jobs with the same name and user have terminated

Using ? with a dependency allows it to be satisfied no matter what. It is possible to chain multiple dependencies together:

...

There are several commands you may wish to use to see the status of your jobs, including:

squeue - by default squeue will show information about all users' jobs. Use -u <userid> to get information just about yours. An easier-to-use alternative command to squeue is called O2squeue.
scontrol - most scontrol options can't be invoked by regular users, but scontrol show job <jobid> is a useful command that gives detailed job information. This command only works for currently running jobs.
sstat - shows status information for currently running jobs. Many fields can be requested using the --format parameter. Reference the job status fields in the sstat documentation for more information.
sacct - reports accounting information for jobs and job steps. This works for both running or completed jobs, but it is most useful for completed jobs. Many fields can be requested using the --format parameter. Check the job accounting fields in the sacct documentation for more information. An easier-to-use alternative command to sacct is called O2sacct.

...

There are several ways to modify your submitted jobs in SLURM, such as:

canceling a job using scancel
pause a job using scontrol hold
release a job using scontrolrelease

Example job monitoring commands

...

By default, you will not receive an execution report by e-mail when a job you have submitted to Slurm completes. If you would like to receive such notifications, please use --mail-type and optionally --mail-user in your job submission. Currently, the SLURM emails contain minimal information (jobid, job name, run time, status, and exit code); this was an intentional design feature from the SLURM developers. At this time, we suggest running sacct/O2sacct queries for more detailed information than the job emails provide.

...

We recommend monitoring your jobs using commands on O2 such as squeue or O2squeue, and sacct or O2sacct.

More information for a job can be found by running sacct -j <jobid>, or by using O2sacct <jobid>. The following command can be used to obtain accounting information for a completed job:

Code Block
sacct -j <jobid> --format JobId,NNodes,Partition,NCPUs,State,ReqMem,MaxRSS,Elapsed,CPUTime,TimeLimit,ExitCode,Start,End

If your job reports COMPLETED status, then your job worked as expected. If your job reports FAILED status, there was a problem with the tool used or the job submission script.

If you get a TIMEOUT error, the CPUTime reported will be greater than the -t limit you specified in your job.

Similarly, if your job used too much memory, you will receive an error like: Job <jobid> exceeded memory limit <memorylimit>, being killed. For this job, sacct or O2sacct will report a larger MaxRSS than ReqMem, and OUT_OF_MEMORY job status. You will need to rerun the job, requesting more memory.

...

If you would prefer, you can direct the output by specifying the -o (stdout) and -e (stderr) options to sbatch:

Code Block
sbatch -o myjob.out -e myjob.err myjob.sh

You can add the execution node name and jobids to the output and error files using %N and %j:

Code Block
sbatch -o %N_%j.out -e %N_%j.err myjob.sh

For other available "filename patterns" (what %N and %jare called by the SLURM developers), please reference the SLURM sbatch documentation.

...

You will need to connect to O2 with the -XY flags in your SSH command. Substitute <username> for your eCommons:

...

To enable graphics forwarding within an srun job, add the --x11 flag to your job submission. For example:

...

For graphics forwarding within sbatch job submissions no additional parameter is required, provided that you submitted a job after having logged in with the corresponding ssh flag.

Running jobs on multiple cores

...

First, you need to tell SLURM the number of cores you want to run on. You do this with the -c flag. The number of cores you request from the program should always be the same as the number of cores you request from SLURM. Note that different programs have different options to specify multiple cores. For example, tophat -p 8 tells the Tophat aligner to use eight cores. So you might have a job submission script that has a line to request 8 cores:

...

Each job has a runtime limit, which is the maximum number of seconds that the job can be in RUNNING state. This is also known as "wall clock time". You specify this limit with the -t parameter with sbatch or srun. If your job does not complete prior to the runtime limit, then your job will be killed. Each partition has a maximum time limit. See below for details:

Queue	Maximum time limit
short	12 hours
medium	5 days
long	30 days
interactive	12 hours
mpi	5 days
priority	30 days
transfer	5 days
gpu, gpu_requeue, gpu_quad, gpu_mpi_quad	see Using O2 GPU resources

Requesting resources

You may want to request a node with specific resources for your job. For example, your job may require 4 GB of free memory in order to run. Or you might want one of the nodes set aside for file transfers or other purposes.

...

Note that without any units appended, memory reservation values are specified in MiB by default. You can change the size unit of the memory request by appending K, G, T for kibibyte, gibibyte, tebibytes etc.

Filesystem Resources

Filesystem resources make sure that your job is only dispatched to a node that has the appropriate network file system mounted and accessible to it. All users are encouraged to use filesystem resources. During planned maintenance events or unplanned filesystem outages, using a filesystem resource requirement will keep your job from being unnecessarily dispatched and subsequently failing.

The file system resources are:

groups (/n/groups)
log (/n/log - for web hosting users)
files (/n/files - accessible through transfer partition, but you must request access to run jobs in this partition)
scratch3 (/n/scratch3 - for storage of temporary or intermediary files)

Example Usage:

sbatch --constraint="files"

If your job requires multiple filesystems:

sbatch --constraint="files&groups"

Feedback Wanted

We expect to make adjustments to the SLURM configuration as we learn more about how O2 is used. Your feedback will be the best way for us to learn what adjustments we should be considering to make the cluster most useful to the widest audience. We're also interested in feedback on what information is most useful for new users, so that we can expand and improve this documentation. Please email Research Computing if you have any feedback, questions, comments or concerns!

...

Versions Compared

Old Version 84

New Version 85

Key

Basic SLURM commands

Submitting Jobs

The `sbatch` command

`sbatch` options quick reference

The `srun` command

Within `sbatch` scripts

Job Arrays

Example job monitoring commands

Running jobs on multiple cores

Requesting resources

Filesystem Resources

Feedback Wanted

Page Comparison

Versions Compared

Old Version 84

New Version 85

Key

Basic SLURM commands

Submitting Jobs

The sbatch command

sbatch options quick reference

The srun command

Within sbatch scripts

Job Arrays

Example job monitoring commands

Running jobs on multiple cores

Requesting resources

Filesystem Resources

Feedback Wanted

The `sbatch` command

`sbatch` options quick reference

The `srun` command

Within `sbatch` scripts