Deprecated Page

This documentation was written in 2016 specifically for those transitioning from our previous cluster named Orchestra to the then-brand-new O2 cluster. As the Orchestra cluster was retired in March 2018, we would strongly recommend that any new O2 cluster user review the "Using Slurm Basic" page instead of this page. This page is no longer being updated.

The Orchestra cluster used the LSF scheduler. LSF takes bsub commands and dispatches jobs to cluster nodes.

O2 uses a different scheduler called Slurm. Most of the concepts are the same, but the command names and flags are different. For example, in LSF you submit a bowtie job to a queue with a time limit using bsub -q mcore -W 10:0 -n 4 bowtie -p 4 ...; in Slurm you submit a job to a partition using sbatch -p short -t 10:0:0 -c 4 -N 1 --wrap="bowtie -p 4 ...". (As there is more than one way to submit jobs through Slurm, more details on the changes can be found below.)

Logging in to O2

Note: if you have only logged into Orchestra before, you need a separate O2 account. Contact Research Computing to get one.

ssh login to the hostname: o2.hms.harvard.edu
- If you're connecting to O2 from outside of the HMS network, you will need to use two-factor authentication.
Use PuTTY on Windows or Terminal on Mac or whatever method you used for Orchestra. You will land on a machine named something like login01, login02, etc.

Where are my files?

In general, O2 reads the same file paths as Orchestra. If you change a file in /home/ab123/blah on Orchestra and then log into O2, that same file will be changed.

IMPORTANT NOTE: /groups on Orchestra is called /n/groups on O2! Everything underneath that should look the same, though. So, for example, many bowtie2 databases are in /n/groups/shared_databases/bowtie2_indexes.

Where is my software?

We have switched from the "environment modules" system to "Lmod". For the average user, this won't make much difference, although you will need to change the names of modules you load. The modules are no longer in categories like "seq" and "dev".

Like on Orchestra, typing module avail will tell you what software is available, and module load samtools/1.3.1 loads the samtools module.

Typing module spider will give you a list of all software modules available. There is also a search tool, module keyword, which can be used with a search term to return matching modules.

As Lmod is a hierarchical system, you may need to load a prerequisite module to be able to load you need (or see it in module avail). For example, if you'd like to load the Cython module, but don't know the prerequisite modules, run module spider cython. Then, you can load the prerequisites (listed in the module spider output), and will be able to load Cython.

NOTE: If you are used to running programs with full paths like /opt/samtools/..., that won't work, because there is no /opt on O2. The apps are installed under /n/app, but we'd prefer that you use modules when possible.

Beta test note: You need to module load gcc/6.2.0 to see many of the available bioinformatics modules. module spider will list of all of the available modules, even if you do not have gcc/6.2.0 loaded.

What are nodes like?

The majority of nodes on O2 have 32 cores and 256 GB RAM each.

NOTE: you might be tempted to run your jobs with as many cores as possible (20 is the maximum number of cores you can request in all queues except mpi). Most multi-threaded programs (like many NGS applications) will not be 20 times faster on 20 cores. You might want to experiment with 1 core, 4, 8, etc., to see what kind of speedups you get. Asking for more cores will often lead to a substantially longer pend time, which can remove any benefit of a slight speedup in run time.

SLURM accounts

Though you can log in to O2 login nodes with your eCommons authentication credentials, in order to submit jobs to the SLURM scheduler, your eCommons ID needs to be assigned to a SLURM account. To check whether this has already been done, please run the command:

sshare -Unu $USER

If you get output similar in form to this:

    simpson_hjs42              abc12         1    0.001106           0      0.000000   1.000000

...then you're ready to start submitting jobs to SLURM.

Getting no output from this command, on the other hand, means that your eCommons ID is not yet assigned to a SLURM account. As a result, if you attempt to submit a job to SLURM, you will see an error like one of the following:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

In order to associate your eCommons ID with a SLURM account, and thereby be permitted to submit jobs to the SLURM scheduler (and avoid such errors), please submit a request to us using the Account Request form.

What if I am using both Orchestra and O2?

If you have customized your shell while using Orchestra (e.g. through the hidden .bashrc file in your home directory, aka ~/.bashrc), you may notice error messages when logging into O2, or when starting an interactive session. Orchestra and O2 read the same ~/.bashrc, which means the commands in the file will execute regardless of which cluster you're on. For example, if you load modules in your ~/.bashrc, these same modules may not exist on O2. Likewise, if you add a directory in /groups to your PATH, it will not work because /groups is now /n/groups on O2.

In order to adapt your shell initialization scripts (~/.bashrc, ~/.bash_profile, etc.) so that they work properly on all the RC clusters (O2, Orchestra, and Transfer), you can have their code test for the value of the HMS_CLUSTER environment variable, which is automatically set upon login to one of the values o2, orchestra, or transfer.

The example snippet below illustrates this technique:

    if [[ "$HMS_CLUSTER" = o2 ]]
    then

        # O2-specific settings

    elif [[ "$HMS_CLUSTER" = orchestra ]]
    then

        # Orchestra-specific settings

    fi

Translate LSF commands

This table gives rough equivalents for the LSF commands you're using to using. There's more information below for some of them. You can also read the man pages for the Slurm commands.

(Note: Below and throughout the document, text like <jobid> should be replaced with an actual job ID, like 12345, and <userid> would be replaced by your eCommons ID, without the <> around it.

LSF command	Slurm analogue	sample command syntax
bsub	sbatch (submit a batch job) srun (for interactive, or starting a job step)	sbatch <jobscript>
bjobs	squeue	squeue -j <jobid> squeue -u <userid> (all jobs by a user) scontrol show job <jobid> (like bjobs -l)
bkill	scancel	scancel <jobid>
bstop	scontrol hold	scontrol hold <jobid>
bresume	scontrol release	scontrol release <jobid>
bqueues	sinfo	sinfo -p <partition>
bhosts	sinfo -N (or scontrol show nodes)
bhist	sacct	sacct -j <jobid>

Partitions (aka Queues)

The partitions on O2 are like the shared queues on Orchestra, including short, medium, long, priority, interactive, mpi, transfer and gpu.

Changes from Orchestra and other notes:

The medium queue, which allows jobs up to 5 days
Jobs in the long queue (for now) will not be suspended
There is no mcore queue. You can run multi-core jobs in the regular queues
mpi partition is for MPI jobs only, not multi-threaded shared-memory jobs (this is true of Orchestra too). Please check our dedicated mpi wiki page for additional information
gpu partition limit is based on the total amount of GPU-hours allocated for each user

Check our How to choose a partition in O2 chart to see which partition you should use.

Submitting jobs

The `sbatch` command

Like LSF, Slurm allows you to submit jobs from the command line, or to create scripts that get submitted. We encourage users to use scripts, which are more reproducible and easier to troubleshoot. The #SBATCH commands at the top of the script are like the #BSUB commands at the top of LSF scripts.

A typical (basic) batch job submission script looks like the following:

myjob.sh

#!/bin/bash
#SBATCH -c 1                    # Number of cores requested
#SBATCH -t 5                    # Runtime in minutes
								# Or use HH:MM:SS or D-HH:MM:SS, instead of just number of minutes
#SBATCH -p short                # Partition (queue) to submit to
#SBATCH --mem-per-cpu=8G        # 8 GB memory needed (memory PER CORE)
#SBATCH --open-mode=append      # append adds to outfile, truncate deletes first
### In filenames, %j=jobid, %a=index in job array
#SBATCH -o %j.out               # Standard out goes to this file
#SBATCH -e %j.err               # Standard err goes to this file
#SBATCH --mail-type=END         # Mail when the job ends  
#write command-line commands below this line
hostname

You can run this script by saving it as myjob.sh and running sbatch myjob.sh on the command line.

Useful variable substitutions:

Sub	Meaning	LSF analogue
%j	jobid	%J
%a	job array id	%I (capital i)
%A	master jobid for array	%J
%N	node name
%u	userid

Slurm job arrays utilize %A and %a to make distinctions between jobs in the array, so it is suggested to use these variables for output files.

You don't have to use a separate script. You can use the --wrap option to run a single command. However, we discourage running jobs this way because they are harder to troubleshoot (SLURM job accounting doesn't retain commands used with --wrap), and certain complex commands (that include piping, or using | ) will not be interpreted properly.

Here are some interesting sbatch flags and LSF analogues. Most flags have long and short versions. For example, sbatch --ntasks=4 is the same as sbatch -n 4.

Note the --mem-per-cpu flag, which reserves memory. The default memory reservation on O2 is only 1 Gigabyte, which is less than Orchestra. So if you get an error that says "Exceeded job memory limit", you'll need to use this option.

flag	usage	LSF analogue	notes
--ntasks (-n)	-n <num_cores>	-n <num_cores>	number of tasks. (Use with -N). Currently, we suggest that the -c parameter be used for most jobs, not -n.
--cpus-per-task (-c)	-c <num_cores>	-n <num_cores>	cores requested PER TASK, this is what is typically used for multithreaded parallel jobs.
--nodes (-N)	-N <num_nodes>	-R "span[hosts=<num>]"	Automatically set to 1 except for MPI queue
--time (-t)	-t <runtime>	-W <runtime>	specified in minutes. (Format is D-HH:MM:SS If one colon used, runtime is specified in MM:SS.) Required for all job submissions.
--partition (-p)	-p <queue_name>	-q <queue_name>	Queues are called "Partitions" in Slurm
--mem-per-cpu	--mem-per-cpu=<memory>	-R "rusage[mem=<memory>]"	RAM memory per core. To avoid confusion, it's best to include a size unit, one of K, M, G, T for kilobytes, megabytes, etc. So e.g. --mem-per-cpu=8G asks for 8 GB per core.
--open-mode	--open-mode=[append\|truncate]	-o vs -oe; -e vs -ee	choose whether to append to or truncate output files. Default action on Orchestra was like append.
--output (-o)	-o <output_file>	-o <output_file>	default = stdout and stderr are in same file, named "slurm-%j.out", where the "%j" is replaced by the job ID. The default file name for job arrays is "slurm-%A_%a.out", where "%A" is the job ID and "%a" is the array index.
--error (-e)	-e <err_file>	-e <err_file>
--job-name (-J)	--job-name=<jobname>	-J <jobname>	default = name of batch script
--mail-type	--mail-type=<type>[,<type>...]	-B (BEGIN) -N (END)	types: NONE, BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buffer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit) BEGIN, END, FAIL will only send once if used with a job array.
--mail-user	--mail-user=<email>	-u <email>	default = user that submitted the job (~/.forward files work like on Orchestra) used with --mail-type
--dependency (-d)	--dependency=<deps>	-w <deps>	See below for details on specifying dependencies
--wrap	--wrap='echo "hello"'	bsub echo "hello"	Runs a command without needing to make a shell script

Job Customization

If you need some flexibility in your script (e.g. you need to submit the same job but with multiple input files, etc.), submission scripts can take command line arguments as well as inheriting environment variables (your entire current environment is exported with the job when it submits). Here is a shell script that takes a single command line argument:

arguments.sh

#!/bin/bash
#SBATCH -c 1                               # Request one core
#SBATCH -N 1                               # Request one node (if you request more than one core with -c, also using
										   # -N 1 means all cores will be on the same node)
#SBATCH -t 0-00:05                         # Runtime in D-HH:MM format
#SBATCH -p short                           # Partition to run in
#SBATCH --mem=100                          # Memory total in MB (for all cores)
#SBATCH -o hostname_%j.out                 # File to which STDOUT will be written, including job ID
#SBATCH -e hostname_%j.err                 # File to which STDERR will be written, including job ID
#SBATCH --mail-type=FAIL                   # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=abc123@hms.harvard.edu   # Email to which notifications will be sent

echo $1 > file.txt

If you submit this script like so:

sbatch arguments.sh hello

When the job finishes, you should have an output file called file.txt that contains the text hello. If you are savvy at bash, you can use this functionality to script loops that can vary the parameters you send to your submission script to submit lots of similar jobs at the same time without having to manually modify the script itself.

You can also take environment variables as well. Say you run the following command:

export DESTFILE=file.txt

Then make the following change to arguments.sh before submitting it as above:

echo $1 > $DESTFILE

You will still get the same result as above, even though you have not exported DESTFILE inside your submission script. The value of DESTFILE was exported along with the rest of your environment at submission time, because it was already set. However, for obvious reasons, it is recommended that you make such exports inside your submission script for documentation purposes so that you have a convenient record of what you set those variables to if you return to your submission some time later.

The `srun` command

The srun command is used for job submissions, and shares many options with sbatch! However, srun is usually for job steps, not batch jobs like sbatch. Some example use cases for srun are: starting an interactive session, or running job steps (such as within an script that will be submitted with sbatch). Depending on how srun is invoked, you may need to request resources to be allocated.

Interactive Sessions

In the dev cluster, interactive sessions are currently configured to be submitted to any partition, not just interactive. To submit an interactive session (with a run time of 12 minutes and memory of 8GB):

srun -p interactive --pty --mem 8000 -t 12:00 /bin/bash

For all batch jobs, sbatch should be used instead.

Note: It is not recommended to submit srun commands from within an interactive session. These commands will be executed on the same resource allocation as the interactive job. If you want to submit other jobs from within an interactive job, use sbatch commands instead.

Within `sbatch` scripts

With SLURM, srun commands can be used in scripts to denote job steps. When monitoring your job, each command prepended with srun will be shown as a different step. Additionally, if your job fails, job steps will help you troubleshoot better by narrowing down which command had the issue.

Here is an example script that will give the hostname of the node it was executed on, and return information about the partitions and nodes in O2.

#!/bin/bash
#SBATCH -c 1                               # 1 core
#SBATCH -t 0-00:05                         # Runtime of 5 minutes, in D-HH:MM format
#SBATCH -p short                           # Run in short partition
#SBATCH -o hostname_sinfo_%j.out           # File to which STDOUT + STDERR will be written, including job ID in filename
#SBATCH --mail-type=FAIL                   # ALL email notification type
#SBATCH --mail-user=abc123@hms.harvard.edu # Email to which notifications will be sent

srun hostname
srun sinfo

If we save this script as srun_in_sbatch.sh, it can be submitted by sbatch srun_in_sbatch.sh. After the job completes, you can see the job statistics (which will be broken down by job step) by running sacct -j <jobid>.

Getting Job Information

There are multiple ways to find information about your jobs, such as using the squeue command (works on currently pending/running jobs only), looking at emailed job reports (contain less detail than those from Orchestra), and running sacct (works for running and completed jobs). More information about job monitoring is detailed in Using Slurm Basic.

Monitoring pending or running jobs

squeue gives information about jobs currently in the scheduler (pending, running, etc.). Unlike bjobs, though, by default it shows jobs by ALL users (equivalent to bjobs -u all). So you will usually want to run squeue -u <userid>.

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               279     short blast.sb    ak150  R       0:02      1 compute-a-16-28

The ST column gives job state: R means running, PD means pending, etc. A full list of job states can be found here in the SLURM squeue documentation.

If you have pending jobs and you'd like to know approximately when they will start, use squeue --start -u <userid>. The START_TIME column reports the estimated time your job will begin. However, this estimate assumes that all other jobs on the cluster will run to the maximum runtime limits that users input (with sbatch -t). So often your job will begin well before the reported START_TIME.

JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
1289470 long wrap kmk34 PD 2017-05-20T17:19:58 1 compute-a-16-96 (Resources)

Useful squeue options:

squeue option	bjobs eqiuvalent	Notes
-j	<jobid>	Report on a specific job ID (or comma-separated set of job IDs)
-n	-J	Job name
-o / -O		Lets you ask for specific types of information about the job. See the squeue man page
-p	-q	Only jobs in a given partition
--start		Report expected start time of pending jobs
-t	-r, -p, -d	Show jobs in a particular state (e.g. `-t R` is like `bjobs -r`) or comma-separated set of states
-u	-u	Note that you can also give a comma-separated list of IDs
-w	-m	Show jobs on a particular node

Job completion emails or output/error files

Unlike Orchestra, you will not receive a job report by email by default when an job on O2 completes. You can use --mail-type and optionally --mail-user in your job submission to receive these notifications. Also, your job output will not be emailed to you, even if you request email notifications. Slurm will automatically put job error and output in a file called slurm-<jobid>.out in the directory you submitted the job from. If you would prefer, you can direct the output by specifying the -o (stdout) and -e (stderr) options in the job submission command.

Currently, the job completion SLURM emails contain minimal information (jobid, job name, run time, status, and exit code); this was an intentional design feature from the Slurm developers. HMS Research Computing is looking into alternative methods to deliver job metrics, but at this time, we suggest running sacct queries for more detailed information.

Job accounting data using `sacct`

Emailed job reports only report the exit code, status, and run time of your job. More information for a job can be found by running sacct -j <jobid>. By default, sacct only outputs a limited amount of information. You can specify additional output fields using the --format option. You can see a full list of possible fields by running sacct --helpformat.

Here is an example command to obtain accounting information for a completed job:

sacct -j <jobid> --format JobId,NNodes,Partition,NCPUs,State,ReqMem,MaxRSS,Elapsed,CPUTime,TimeLimit,ExitCode,Start,End

If you are contacting Research Computing because a job did not behave as expected, it's often helpful to include sacct output in your request. Just attach it to the email, or paste it into the form.

Job Arrays

There are two ways to run job arrays. You can use a job array script just like a typical script, but submit 30 copies:

sbatch --array=1-30 submit.sh

Or you can insert this line in the script instead:

#SBATCH --array=1-30

(It may be desired to use %a and %A in output filenames in this case.)

If files are cleverly numbered, you can reference them with ${SLURM_ARRAY_TASK_ID} which fetches the array ID, e.g. you need to process 30 fastq files, and they are named (something like) fastq1.fastq, fastq2.fastq, etc.:

<command that processes fastq files> /path/to/fastq"${SLURM_ARRAY_TASK_ID}".fastq

A different input file will go to each job in the array, mapped 1-to-1.

The full set of environment variables when a job array is submitted:

ENV_VAR	function
SLURM_JOB_ID	the jobID of each job in the array (distinct).
SLURM_ARRAY_JOB_ID	the jobID of the whole array (the same for every job in the array; equal to the SLURM_JOB_ID of the first job dispatched in the array). Passable with `%A`.
SLURM_ARRAY_TASK_ID	index of the job in the array (distinct). Passable with `%a`.

To control how many jobs can be executed at a time, specify this inside the --array flag with %. To modify the above sbatch command to only allow 5 running jobs in the array at a time:

sbatch --array=1-30%5 submit.sh

This also works with the #SBATCH directive.

To cancel specific jobs in the array:

# cancel job array 20, entries 5-10
scancel 20_[5-10]
 
# cancel job array 20, entries 15 and 17
scancel 20_15 20_17
 
# Cancel the current job or job array element (if job array)
if [[-z $SLURM_ARRAY_JOB_ID]]; then
  scancel $SLURM_JOB_ID
else
  scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi

Job Dependencies

If you need to submit a job that is reliant on a previous job(s) (e.g. <jobid> must complete successfully first), use

sbatch --dependency=afterok:<jobid>[:<jobid>...] submit.sh

Other dependency parameters:

parameter	usage
after:jobid[:jobid...]	asynchronous execution (begin after <jobid>(s) has begun
afterany:jobid[:jobid...]	begin after <jobid>(s) has terminated (EXIT or DONE)
afterok:jobid[:jobid...]	begin after <jobid>(s) have successfully finished with exit code of 0
afternotok:jobid[:jobid...]	begin after <jobid>(s) has failed
singleton	begin after any jobs with the same name and user have terminated

Using ? with a dependency allows it to be satisfied no matter what. It is possible to chain multiple dependencies together:

sbatch --dependency=afterok:1:2,afterany:3:4,?afternotok:5 submit.sh

This job will submit only after jobs 1 and 2 have completed successfully AND after 3 and 4 have terminated, and after 5 has failed (or not failed). If any requirement is not met (e.g. job 1 fails), the job will never submit. The squeue command will show these jobs with unmet requirements as PENDING status with DependencyNeverSatisfied for the reason why the job is waiting for execution.

To allow Slurm to manage dependencies for you, provide the following flag:

--kill-on-invalid-dep=<yes|no>

If yes, then if a dependency is failed, the job will automatically cancel itself. Our current configuration already removes jobs whose dependencies will never be met, but it is probably best to always include this flag when submitting a job with dependencies (in case our configuration changes in the future).

Moving from Orchestra to O2