O2 Cluster Status

O2 Cluster Status

This page shows all service outages for the O2 cluster, including planned maintenance and unplanned events.

ONLINE

Scheduled Maintenance and Current Outages:

Date

Service

Issue

Date

Service

Issue

 

Previous Service Outages:

Date

Service

Issue

Date

Service

Issue

2025-11-12

O2 Portal

Multi-Factor Authentication change on the O2 Portal

  • This update applies only to the O2 Portal. All other access to the O2 cluster, including Secure Shell (SSH) and file transfer (SFTP) applications, will continue to require two-factor authentication (2FA) via the Duo Mobile app. 

--

To create a more consistent and secure sign-in experience, HMS IT updated the O2 Portal to use Okta Verify instead of the Duo Mobile app for authentication on Wednesday, November 12, 2025. After the update, you must sign in using Okta Verify when you access the O2 Portal. 

Set up Okta Verify for your HMS account to ensure uninterrupted access. If you already use Okta Verify for other HMS services, no action is needed. 

If you have questions or concerns about using Okta Verify, visit the HMS IT Service Portal. You can also contact HMS IT at 617-432-2000 or itservicedesk@hms.harvard.edu.

2025-10-22

O2 Portal

Some users were unable to log in to the O2 Portal, while others are able to do so. IT identified the root cause of the problem, which was resolved on the morning on Thursday, Oct 23.

2025-10-18

O2 and related services

The O2 high performance compute cluster, as well as related services such as storage and HMS-hosted servers and databases were unavailable due to an unplanned infrastructure outage.

2025-08-12

O2 HPC Cluster

The slurm job scheduler was temporarily degraded, which could have affected the ability to submit new jobs.

2025-07-29

O2 Portal

July 29: We are pleased to announce that the major upgrade of the O2 Portal has been successfully completed. The system is functional, including the new version 4 features.

Our IT team is finalizing a few remaining tasks to ensure everything is functioning optimally. Researchers experiencing problems may need to log out and in again or clean their browser cache.  Please let us know if there are specific applications that are not working as expected.


To improve performance and provide new user-friendly features, HMS Research Computing will upgrade the O2 Portal software on Tuesday, July 29, 2025, from 7:00 AM to 9:00 AM EDT (UTC-4)

During this upgrade, the O2 Portal will be briefly unavailable. To minimize disruption, plan your portal-dependent tasks accordingly:

  • There will be no impact for batch jobs which are already running on O2.

  • A user working in the O2 Portal will lose their connection after their browser refreshes.

  • Users will need to relaunch and re-establish the connection to any previously running applications.

  • New connections to O2 Portal will not work until the upgrade is complete.

Otherwise, existing applications running in the O2 Portal will continue operating normally and will be accessible again after the upgrade is complete. You can still access O2 jobs, login nodes, and the transfer cluster without interruption through a terminal.  

This upgrade moves us from version 2 to version 4 of the Open OnDemand software. Version 4 offers a more modern and intuitive interface, quicker file editing, improved interactive application management, enhanced insights into job efficiency, and additional security enhancements. These improvements aim to make your experience smoother and more productive. To learn more, see the detailed Open OnDemand v4.0 Release Notes

2025-07-22 ->2025-07-23

O2 HPC Cluster

Tue., July 22: Around 9PM O2 started experiencing authentication issues causing terminals to repeatedly ask for a password.

Wed. July 23: Authentication issues resolved, normal O2 terminal login function restored.

2025-07-11

O2 HPC Cluster

Mon., July 14: O2 cluster capacity has been restored.

Sat., July 12 (9am) update: O2 cluster capacity has been largely restored. While users may encounter additional wait times for jobs to run because of reduced capacity, fixed nodes are processing jobs normally. Jobs that were running before nodes became unresponsive will need to be resubmitted. 

HMS IT is working to bring remaining nodes up as quickly as possible, with additional capacity expected to become available by Monday.

Fri., July 11: Around 4pm this afternoon, many O2 cluster nodes became unresponsive.  

Researchers may see their submitted jobs not working as expected, Slurm scheduler commands (such as squeue) failing, and interactive sessions closing or freezing up. 

We're working to resolve this issue as quickly as possible, and we'll follow up with any critical updates.

April-June 2025:

COMPLETED

O2 HPC cluster migration to Red Hat Enterprise Linux (RHEL) 9.x

OVERVIEW: O2 Cluster Linux version update

To ensure ongoing security and functionality, we have updated O2 to run under Red Hat Enterprise Linux (RHEL) version 9.x. The O2 cluster’s previous Linux operating system, CentOS 7, had reached its end of life and was no longer able to receive updates or security patches.

ACTION REQUIRED!

Any software compiled by end users on O2 under the old OS (before June 2025) will need to be recompiled, and then reinstalled in the new RHEL environment, including R, Python, and Conda packages. We understand this may cause some inconvenience, but it is essential for modernizing our IT infrastructure. 

  • Please see our guide for help in recompiling your applications under RHEL.

After the update, user groups will have slightly different names under RHEL due to a required change which will further standardize O2’s authentication.

This will not affect how you login. Your O2 username is still your HMS ID.

Changes:

  1. Your O2 account’s personal group, the default group associated with your user, will now have a “.group” appended. e.g. User abc123 will have a personal group of abc123.group instead of the current “abc123”.

  2. If your lab has a shared folder, in most cases the lab’s group name will have a prefix of “hpc_”. e.g. group smith will now be named hpc_smith.

  3. Some groups, mostly those used under /n/files, will have whitespace characters. e.g. group GENETICS_smith will now be named GENETICS smith.

Group names are mostly relevant for those who need to modify the group owner (e.g. using chgrp), such as lab data managers. This will not affect your group memberships, nor access to data, but only what the groups are called.

After the update, most role / application users will have slightly different names under RHEL due to a required change which will further standardize O2’s authentication.

Change:

  • A role user will will have a prefix of “hpc_”. e.g. User smithweb will now be named hpc_smithweb.

Role users are not tied to individuals, and do things like run background processes on a server. Most O2 researchers are not affected by this change.

If you use a container on O2 and plan to continue to use it after the this OS update, please send us its full path so we can pre-install it for you: rchelp@hms.harvard.edu

  • Please note: the O2 login process no longer requires lower case for the letters in the username.

  • If you encounter any issues using RHEL, please contact Research Computing at rchelp@hms.harvard.edu

2025-06-23 ->

2025-06-26

O2 HPC Cluster

Planned outage for the O2 cluster migration to Red Hat Enterprise Linux (RHEL) 9.x

START: Monday, June 23, 2025, at 9:00 AM (UTC-4)

END: Thursday, June 26, 2025 at 5:00 PM (UTC-4)

  • Websites hosted on O2 will remain online, but will not be able to submit cluster jobs.

2025-06-17

O2 Cluster

June 28: Issue was resolved with the update to Red Hat Enterprise Linux.

June 18: HMS IT is still investigating the root cause, so slurm errors may still occur, but job scheduling is running more smoothly overall.

June 17: Around 10:30 AM this morning, O2 nodes began to close and slurm controllers became unresponsive. 
These issues may cause users to experience errors with job submission, slurm commands failing, and internal errors on O2 portal. 

HMS IT is investigating the cause of these issues.

2025-04-03

O2 Cluster software: /n/app

A software freeze is in place, meaning that HMS Research Computing will no longer update any O2 applications accessible as a "module", and HMS IT will no longer update CentOS 7 software.

2025-03-18 → 2025-03-27

O2 Cluster

03-18 A brief storage outage around 9-11 PM on 03/13 has put a significant portion of the O2 Cluster in a state where no additional jobs will be accepted by affected nodes. Our Dev-Ops and Research Computing teams are working on restarting these nodes to resume normal operation. 
If your job has landed on one of these nodes it may experience some latency or not work properly. To work around this cancel the impacted job and re-submit the job to land on a different node.
03-19 After some investigation, our teams found an issue with the network file system connection on nodes. Addressing this will require a gradual reboot of most nodes on the cluster. To continue working around this issue, cancel the impacted job and re-submit the job to land on a different node.

We don't currently have a time estimate for when this will be complete. However, reboots may need to be run until next week (03/24-03/28). We will update this timeline as more information becomes available. 

03-27 A batch of high I/O jobs contributed to nodes to becoming slow to respond, causing Slurm to eventually refuse jobs on these nodes. These jobs have been discontinued, and the affected nodes have been restarted. 

O2 performance issues have now been resolved.

2025-01-22 →

2025-01-23

/n/files

To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array.

Outage window: Wednesday, January 22 (5pm) to Thursday, January 23 (approx. 9am-10am), 2025.

  • This will only affect the O2 filesystem: /n/files

which is only accessible from the transfer servers (transfer.rc.hms.harvard.edu) and transfer compute nodes.

2025-01-03 →

2025-01-10

Full O2 outage

HMS IT is undertaking a project to relocate our data center within the Markley Data Center to optimize our IT infrastructure. This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs. Users can bookmark the web page for this project, which will be updated as more information becomes available.

O2 will be impacted from Friday, January 3, 5:30 PM EST, to Friday, January 10, 2025.

  • Jan 3 (5:30-6:00 PM): O2 login access will be turned off.

  • Jan 3 (6:00 PM): O2 systems will start being powered off.

During this time:

  • The O2 Cluster will be completely offline, including O2 Portal.

  • All data on O2 will be inaccessible.

  • Websites hosted on O2 will be completely offline, including all web content, not just job submissions.

  • Other services reliant on O2 will be affected.

2024-12-04

filesystem /n/standby

HMS IT is moving to a new Standby storage system designed to efficiently manage large amounts of data on Wednesday December 4th, 9am – 5pm EST. Standby storage will be inaccessible during this period. 

  • In the O2 environment, this maintenance only affects /n/standby on O2's file transfer systems. No other O2 research data or services should be affected. 

Be sure to save any work stored in these affected directories before the migration begins. Do not try to access files in these locations during the migration. The files will be available again after the migration is complete.  

2024-07-17

O2 web, filesystem /n/www

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/www filesystem to a new storage cluster:

START: Wednesday, July 17, 2024, at 9:00 AM (UTC-4)

END: Wednesday, July 17, 2024, at 12:00 PM (UTC-4) The maintenance took longer than anticipated but completed successfully by 5:15 PM

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

During this outage on July 17 (9:00 AM to 12:00 PM):

  • Websites and services hosted on O2 Web Hosting will be unavailable for the duration of the migration.

  • The genomebrowser-uploads website will be unavailable

  • The /n/www filesystem will be inaccessible

In preparation, on Tuesday July 16, at 4:00 PM EDT (the day before the outage):

  • O2 compute nodes will not have access to /n/www. Any O2 cluster job relying on /n/www will fail.

  • But until the full outage begins on July 17, websites will be up and /n/www will still remain available on O2 login nodes.

2024-07-03

O2

A performance issue affected a number of O2 services, including:

  • O2 Portal

  • Jupyter Notebooks

  • RStudio

  • MATLAB

2024-06-25 -

2024-06-28

filesystem /n/groups

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/groups filesystem to a new storage cluster:

START: Tuesday, June 25, 2024, at 5:00 PM (UTC-4)

END: Friday, June 28, 2024, at 12:00 PM noon (UTC-4)

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

However,  /n/groups will be inaccessible during this period.

** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins.

2024-05-29 -

2024-05-31

filesystems /n/data2 /n/no_backup2

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/data2 and /n/no_backup2 filesystems to a new storage cluster:

START: Monday, May 29, 2024, at 9:00 AM (UTC-4)

END: Wednesday, May 31, 2024 at 5:00 PM (UTC-4)

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

However,  /n/data2 and /n/no_backup2 will be inaccessible during this period.

** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins.

2024-04-13 -

2024-04-16

filesystems /n/data1 /n/cluster /n/shared_db

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/data1 , /n/cluster , and /n/shared_db filesystems to a new storage cluster:

  • START: Saturday, April 13, 2024, at 9:00 AM

  • END: Tuesday, April 16, 2024 at 10:00 AM

  • The migration was completed on Monday Apr 15 at approximately 7:30 PM

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

However,  /n/data1/n/cluster , and /n/shared_db will be inaccessible during this period. Keep in mind:

  • /n/data1 contains shared folders for some groups that use O2.

  • /n/cluster contains tools including: quota-v2, O2sacct, O2squeue, O2usage, scratch_create_directory.sh

  • /n/shared_db contains public databases for research use

** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins.

2024-02-13 -

2024-02-15

O2 Cluster

After a successful storage migration, pending jobs on O2 were allowed to dispatch as of 10AM, while login services started coming online as of 10:30 AM.

To provide more robust and reliable storage, 

HMS IT will migrate all O2 Home folders and the Slurm job scheduler software to a new storage cluster during the following window:

START: Tuesday, February 13, 2024, from 5:00 PM EST (UTC-5). 

END: Thursday, February 15, 2024, from 10:00 AM  EST (UTC-5). 

During this time, the O2 Cluster will be offline. This means: 

  • No jobs will run during the outage.

  • O2 sign-in services will be unavailable.

  • Jobs submitted to O2 through websites will not run until after the outage.

  • Globus file transfer will be unavailable.

  • Other services reliant on O2 will be affected.

Jobs scheduled to run during the outage will be postponed with ReqNodeNotAvail, Reserved for maintenance; they will start after the upgrade is complete. 

If a job needs to be completed before the upgrade, schedule it as soon as possible.

2024-02-07

O2 Cluster

An issue with the O2 storage environment affected access to use O2:

  • O2 logins are unavailable

  • O2 Portal is unavailable

  • O2 transfer cluster is unavailable

  • Currently running jobs may be affected

2024-02-05

O2 Cluster

There was a HMS-wide network outage on the morning of Feb 5 which affected access to the O2 cluster as well as most other HMS services.

Please note that it is possible that O2 jobs running during the network outage were affected, depending on the type of job, and also the nature of the network outage, which is still being determined.

2023-12-08 -

2024-01-16

O2 scratch storage

To provide more robust and reliable storage, HMS IT has deployed a new storage cluster, designated as /n/scratch, to replace the current /n/scratch3

The /n/scratch3 filesystem is being retired on Jan 16, 2024

The timeline for this update is:  

  1. November 13 – Beta access to new /n/scratch for preliminary testing. 

  2. December 8 – Full access to /n/scratch for all users. The existing /n/scratch3 will temporarily remain available. 

  3. January 8 /n/scratch3 becomes read-only. 

  4. January 16 /n/scratch3 is retired. The path /n/scratch3 will no longer exist on O2 and no data will be recoverable from the old /n/scratch3

Please update your workflow to use/n/scratch by January 8, and see our documentation about changes in the new https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2652045313

  • HMS IT is not migrating any data from scratch3 to the new /n/scratch.

  • Please copy any required data to the new space by January 16, when /n/scratch3 is retired.  

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu

2023-12-06 -

2023-12-07

O2 Cluster

To enhance your experience with our network-based storage and prepare for future growth,

HMS IT will make upgrades during:

  • start: Wednesday, December 6, 2023, at 8:00 AM

  • end: Thursday, December 7 at 8:00 AM EST (UTC-5).

    • O2 services were restored at approximately 7:00 PM on Wednesday December 6.

During this time, the O2 Cluster will be offline. This means: 

  • No jobs will run during the outage 

  • O2 login services will be unavailable 

  • Jobs submitted to O2 through websites will not run until after the outage 

  • Other services reliant on O2 will be affected  

Jobs scheduled to run during the outage will be postponed; they will start after the upgrade is complete.

If a job needs to be completed before the upgrade, schedule it as soon as possible. 

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu.

2023-09-18 -

2023-09-22

Standby Storage

HMS IT performed a gradual storage server upgrade on the HMS Standby storage server.

No impact is expected, but O2 users should avoid doing any large data transfers involving the Standby filesystem ( /n/standby ),

just to allow the upgrade to proceed as smoothly as possible.

2023-08-21

O2 Portal, Group and website Storage

A storage outage affected the availability of the following filesystems:

  • /n/data1

  • /n/data2

  • /n/log

  • /n/no_backup2

  • /n/shared_db

  • /n/www

If your O2 jobs access any of these filesystems, they may fail and need to be re-run after the outage is resolved. You may also have problems cd’ing into or seeing data in certain directories. The data is safe; it’s just the access to the data from O2 that is not working. This outage may also affect O2 logins and access to the O2 Portal.

2023-08-01

filesystems /n/data1/ /n/data2 /n/www /n/nobackup /n/shared_db /n/standby

Several storage filesystems serving the O2 cluster and related services were not responding. We temporarily suspended all pending and running jobs.

The Storage team investigated and resolved the issue.

2023-07-16

/n/groups filesystem unavailable

Start Time: Thursday, July 13 at 7:00 PM

IMPACT:

Scheduled migration of /n/groups storage filesystem to the new hardware is taking longer than expected.

Due to this delay, the filesystem /n/groups is not be available at this time in O2.
Any new job submitted to O2 that requires the /n/groups filesystem will fail. Please do not submit new jobs requiring access to /n/groups

We will notify you once the storage migration is completed and /n/groups is available in O2.

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu

--

Access to /n/groups has been restored on O2 login, compute, and transfer nodes. All files should be in the same state they were in on Thursday when the outage began. IT is taking steps to improve the type of situation that led to this issue.

2023-07-13 ->

2023-07-16

PLANNED FULL O2 CLUSTER OUTAGE

To increase the efficiency and security of the O2 cluster, HMS DevOps will upgrade the Slurm job scheduler.

Maintenance Window

  • START: Thursday, July 13 at 7:00 PM

  • END: Sunday, July 16 at noon(Completed at 07/17/2023 1PM)

This upgrade will require the O2 cluster to be offline, and as a result, no new jobs will be accepted during the mentioned period. To prevent disruption to your work, ensure all running jobs are complete before the upgrade commencement time.

Certain services related to the O2 cluster will be affected during the upgrade period. In particular:

  • the O2 login servers at o2.hms.harvard.edu will be offline

  • the O2 Portal will be offline

  • you will not be able to submit or execute jobs, including from websites.

  • the filesystem /n/groups will also be offline