NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

Frequently Asked Questions and Answers



Accounts and logging in

How do I request an O2 account?

The prerequisite for obtaining an O2 account is an HMS account (formerly called an HMS eCommons ID) account, as O2 uses these credentials for cluster authentication. You can request an HMS account at https://harvardmed.service-now.com/stat?id=account if you do not already have one. You can also reach out to the HMS IT Service Desk at itservicedesk@hms.harvard.edu or 617-432-2000 with any problems with registering or managing your HMS account.

Requesting an O2 account can be done using the O2 Account Request Form, which requires an HMS account to access. You will receive an email notification once we have created your account.

I have an O2 account. How do I login to the O2 cluster?

You can connect to O2 using ssh (secure shell) at the hostname: o2.hms.harvard.edu. If you're on Linux or Mac, you can use the native terminal application. If you're on Windows, you will need to install a program to connect to O2; we recommend MobaXterm. In either terminal or MobaXterm, type the following command:

ssh yourHMSaccount@o2.hms.harvard.edu

where you write instead of yourHMSaccount you write your actual HMS account (something like js123 if your name is John Smith). Make sure your HMS account is in lowercase. You will be prompted to enter your HMS account password. The cursor will not advance as you type in your password; this is a security feature of Linux. If you're not on an HMS network, you'll need to go through Two Factor Authentication. See the Two Factor Authentication on O2 and Two Factor Authentication FAQ pages for more information on setting two factor up to connect to O2.

Once you successfully authenticate with your HMS account password and through two-factor (if required), you'll be on one of the O2 login servers.

For more details on how to login to the cluster, please reference this wiki page.

I can't login to O2!

All cluster logins from outside the HMS network require two-factor authentication. For more details, please reference Two Factor Authentication on O2 and Two Factor Authentication FAQ. Please contact us if you are having trouble with two-factor authentication on O2. 

Please do NOT send us or anyone else your password. Ever. We can assist you without knowing your password, and sharing accounts on the cluster is prohibited by Harvard security policy.

If you're having difficulty logging in to O2, make sure you're using your HMS account (in lowercase) and HMS account password to log in. If that does not resolve the problem, you can reset your HMS account password. Contact the HMS IT Service Desk (itservicedesk@hms.harvard.edu, or 617-432-2000) if you're unable to login to the HMS account management website. Your HMS account may have been locked due to too many failed login attempts to the O2 cluster. Once you are able to login to the HMS account management site, wait 1 hour and try logging in to O2 again. If you're still facing problems, then send in a ticket to us.

Files, Storage, Quotas

Where can I put my data?

There are several different filesystems (or locations to store data) that each researcher will have access to. See Filesystems which starts with a basic rundown of where you would want to put which kind of information.

Are my files automatically backed up?

It depends. See Filesystems. Temporary filesystems (like the scratch filesystem in /n/scratch, or /tmp, which is a hard drive on individual compute nodes) are not backed up, and are occasionally purged of data. We strongly encourage you to use a backed-up filesystem for important data. Don't store the only copy of your data on your desktop unless it is reliably backed up. 

Help! I deleted a file/directory/thesis!

See the Restoring backups section of Filesystems. As that section describes, IF the data was on a backed-up filesystem, you can actually restore the data yourself. If you run into trouble, contact Research Computing, and we'll do our best to help you. We strongly encourage you to use a backed-up filesystem for important data. For example, even Research Computing has no way to restore deleted data on the scratch filesystem.

How do I get data to/from O2?

See File Transfer.

How much can I store on O2?

It depends. See Filesystem Quotas.

Starting Jobs

I just want to run a job!

If you want to run a program called Analyze that you would run like this from the command line:

Analyze -i input.fasta -o output.txt

then to run it on the cluster you would need to create an sbatch script to submit it as a batch job.

For example, an sbatch script called analyze.sh contains:

#!/bin/bash #SBATCH -p priority #SBATCH -t 0-1 #SBATCH -o analyze.%j.out #SBATCH --mem 2G #SBATCH -c 1 Analyze -i input.fasta -o output.txt

The job will be run in the priority partition for one hour, using 1 core, and 2GiB of memory. The output for the job will go to a filename called analyze.%j.out, where %j will be replaced with the numeric SLURM job ID.

Note: A job gets 1 GiB memory if you don't explicitly ask for more (or less), and 1 GiB is plenty for many applications. Your job will start faster the less memory you ask for. So only ask for extra memory if you need it – i.e. if you run a job that dies with an error that it went over the memory limit.

You submit the sbatch script to the SLURM scheduler by:

Please reference the Using Slurm Basic page for a longer introduction to what it means to submit jobs to O2.



If you want to debug or compile code, where you'll need to run a bunch of different programs one at a time, the fastest way to get started is to request an interactive job:

This will start an interactive job with a two hour time limit. From here, you can compile applications or run programs. Note that there are still limits on the amount of CPU or memory resources available to you. Your job will be limited to the actual number of core(s) you request; unless specified, a job will be allocated 1 core by default. Additionally, your job will be killed if you try to use more memory than what you have requested.

How do I use the --account argument when submitting jobs?

It is important to know that your HMS ID account (formerly known as eCommons) is not the same as your SLURM account. To check which SLURM accounts are associated with your HMS ID, you use the $ share -Uu $USER command. For example, the following user is associated with two SLURM accounts, rccg and lab1.

If the user wants to submit jobs using the lab1 SLURM account, then

  • From the command line: sbatch --account=lab1 [..other job parameters..]

  • In an sbatch script: #SBATCH --account=lab1

How do I choose which partition to run in?

See How to choose a partition in O2.

There are thousands of jobs PENDING (or PD) in a partition. Will my job take forever to start?

Probably not, though the dispatch time of the job depends on the job priority, the resources you've requested, and the current availability of cluster resources. For example, if the cluster is very busy and you need 250 GiB of RAM for your analysis, your job will pend for a while, as you're essentially asking for a whole compute node to be empty. You can reference the Job Priority page, which details the six factors contributing to a job's priority. The most important factors for the average O2 user are: age (increases the longer your job sits in the queue), partition (jobs in interactive and priority partitions will be dispatched first), and fairshare (tracks the resource you recently used and compares those with the fair share of computational resources available for each user). Your fairshare will deplete with more usage, but will fully rebound within two days of no cluster usage. When the cluster is busy, your jobs will pend longer the lower your fairshare is. 

Short vs. Long jobs: Is it better to run 48 separate 30 minute jobs or 1 single 24 hour job?

The best job submission strategy to choose depends on many factors. For example, if each of the 48 jobs requires multiple CPUs (>2) and a large amount of memory (>20GiB - 40GiB), then a single longer job is preferable to 48 shorter jobs. If each job only requires 1 CPU core and 1GiB or less of memory, then running the jobs separately will usually be faster. If you have any questions about optimizing your workflows, then please contact us!

My O2 jobs are very important. How can I guarantee that I will be able to run them when I need to?

Please contact us and we'll be happy to work with your needs.

How am I supposed to know how long my job will take?

You can use the O2_jobs_report command to get detailed information on resource usage of your completed jobs, including how long the job took. See here for an introduction on using O2_jobs_report.

By running test versions of your workflow (to make sure that the process is correct), you can get a sense of how long the full workflow will run. Remember that jobs can die for a variety of reasons so it's always best to design your workflow so you can quickly recover if it gets interrupted. Please contact us if you would like help with this. Especially if you are just running something once, it's fine to overestimate the runtime limit.

How long should my job(s) take?

The O2 cluster is not designed to work with extremely short jobs (<1 minute). The minimum run time that you can aim for is ~10-15 minutes, based upon the scheduler and the cluster's underlying configuration. If you have a large batch of very short running jobs, the time to process the job submissions will be substantially longer than actually running the jobs, and this may slow the cluster down for everyone. You can write a script to batch sets of jobs together. Please contact us if you want assistance with this process. Another option is to start a session in the interactive partition, and run the many short running jobs in that session.

My job has to run on a node that has 16GiB of memory free. How can I make sure it goes to the right node?

Use the --mem parameter in your job submission command with the amount of memory you want to request. See the Using Slurm Basic page for more information.

Problems with jobs

Why am I getting the following error “You have more than one Slurm account, must specify --account=NAME where NAME is a Slurm account name. Use sshare -U -u $USER to list your Slurm accounts associations” after submitting jobs to O2?

This is related to a new change on O2. For more information, check the question - How do I use the --account argument when submitting jobs?

Why hasn't my job started yet? Why has it been in PEND state for so long?

Using O2squeue followed by the jobid will give you an expected start time of your job (START_TIME column) and the reason why your job is pending ( in the NODELIST(REASON) column). See this page for information on using O2squeue. For further job troubleshooting tips for pending jobs, please refer to the "Slurm Job Reasons" and the "Jobs that never start" sections of the Troubleshooting Slurm Jobs page.

Why did my job exit before it finished the analysis?

Probably because it ran too long or used too much memory. If your job runs longer than the runtime limit you give with -t, it will be killed with the TIMEOUT state. This can also occur if your job uses resources incorrectly (e.g., a multi-threaded job that doesn't use sbatch -c, or a badly-behaved Matlab job). See the "Exceeded run time" section of Troubleshooting Slurm jobs for more information on avoiding this error. 

If you use more memory than you reserved (or more than the default memory of 1GiB, if you didn't explicitly ask for a certain amount) then your job will be killed. In output from the sacct command, this job will be seen as CANCELLED by 0, which differentiates it as a job killed by the scheduler instead of a job killed by the user (you will see another number representing your user id instead of 0 in this case). See the "Exceeded requested memory" section of Troubleshooting Slurm Jobs for more information.

What does "oom-kill event" mean?

If you're seeing that in a job output, that means your job was killed because it exceeded the memory allocation you requested. Your job should also be in OUT_OF_MEMORY state, which hopefully is fairly clear. Simply request more memory (with --mem or --mem-per-cpu), and eventually your job will complete successfully (or you'll run out of available memory to request, in which case you should contact Research Computing for next steps). See "Exceeded Requested Memory" section of Troubleshooting Slurm Jobs for more information about this specific error.

Why am I getting a "permission denied" error on a previously writable directory? OR Why have all my jobs since a certain time failed when they used to run fine?

Either of these problems can be due to going over the 100GiB quota in your home directory, or over the set quota in a shared group directory. Please read here for more information.

Why are jobs that exceed a partition's time limit killed instead of just being moved to a partition with a longer time limit?

If we moved jobs that exceeded a time limit, users could inappropriately take advantage of the scheduling system by always submitting to the shortest time-limited partition to get quick job execution, after which their long-running jobs would cascade through the partitions with longer limits.

An interactive job works, but running it as batch doesn't

This is likely because you're using sbatch --wrap, or sbatch without writing a script. The --wrap option is not failproof; more complex commands, such as those that use |, are not interpreted correctly when using this job submission method. We recommend that you package up the commands as a script and submit the script using sbatch instead. See the "Submitting Jobs" section of the Using Slurm Basic page for more information on sbatch scripts.

Why can't I plot to files or use graphical user interfaces in my job?

If you are plotting (e.g. in R or Matlab) or trying to use a graphical user interface (GUI) on the cluster, you must set up an X11 session. This is a multi-step process that involves running an X11 server on your desktop/laptop, connecting with ssh -XY, and using srun --x11 for an interactive job (no additional parameter required for sbatch jobs).

Please keep in mind that we do not recommend using GUI applications (like the graphical mode of Matlab) on the O2 cluster, as you can experience laggy performance as graphics are forwarded over X11 in real time. 

Why did I receive an email that says my priority will be /was lowered due to inefficient jobs?

It means your jobs requested sizeable resources, most of which were not actually needed. Please check Improving O2 EfficiencyImproving O2 Efficiency by Changing User Priorityfor more information.

Specific Programs or Programming Languages

Can I run Matlab on O2?

Yes! Matlab in particular is so popular that we have a whole separate page for Using MATLAB on O2. As a brief summary:

- You can run Matlab in graphical mode (editing your program, graphing, etc.) or batch mode (simply running a script)
- Many Matlab programs written on the desktop can be copied directly to the cluster and work with minimal or no changes
- O2 has many Matlab toolkits available
- O2 is particularly useful if you want to run jobs that require a lot of memory (RAM) or processing power. Many jobs can be split into pieces and run in parallel on O2 for a substantial speedup

Can I use RStudio on O2?

Yes, RStudio is available on O2 using X11 forwarding. For more information, please check RStudio on O2. Also, the BioGrids module provides a static version of RStudio as a remote GUI console on O2.

Can I run Jupyter notebooks on O2?

Yes, we have instructions for setting up Jupyter notebooks here. However, note that this process can be prone to failure. We are investigating the feasibility of offering a more robust solution in the future.

How do I run a particular version of Matlab, Java, Perl, Python, R, or some bioinformatics program?

Many programs have multiple versions installed. See which versions of Java, R, or STAR are available with commands like module spider java or module spider R or module spider star, for example. See Using Applications on O2 for more detail.

How do I get a library for R, Perl, Python, etc.?

We might already have it under a different version of the language. (See Using Applications on O2 for more detail.) You can also install R, Perl, or Python packages in your directory. See Personal R PackagesPersonal Perl Packages, and Personal Python Packages.

Can I do deep learning/machine learning/GPU analyses on O2?

Yes, please reference this page for more information on GPU resources on O2.

Where can I find databases or references on O2?

Many programs available on O2 require databases or reference files to run. We make commonly requested databases and reference files available on O2 under /n/shared_db. For details on how to find the database you’re looking for, please reference the Public Databases page.