O2 Cluster Status
This page shows all service outages for the O2 cluster, including planned maintenance and unplanned events.
ONLINE
Scheduled Maintenance and Current Outages:
Date | Service | Issue |
|---|
Previous Service Outages:
Date | Service | Issue |
|---|---|---|
2025-11-12 | O2 Portal | Multi-Factor Authentication change on the O2 Portal
-- To create a more consistent and secure sign-in experience, HMS IT updated the O2 Portal to use Okta Verify instead of the Duo Mobile app for authentication on Wednesday, November 12, 2025. After the update, you must sign in using Okta Verify when you access the O2 Portal. Set up Okta Verify for your HMS account to ensure uninterrupted access. If you already use Okta Verify for other HMS services, no action is needed. If you have questions or concerns about using Okta Verify, visit the HMS IT Service Portal. You can also contact HMS IT at 617-432-2000 or itservicedesk@hms.harvard.edu. |
2025-10-22 | O2 Portal | Some users were unable to log in to the O2 Portal, while others are able to do so. IT identified the root cause of the problem, which was resolved on the morning on Thursday, Oct 23. |
2025-10-18 | O2 and related services | The O2 high performance compute cluster, as well as related services such as storage and HMS-hosted servers and databases were unavailable due to an unplanned infrastructure outage. |
2025-08-12 | O2 HPC Cluster | The slurm job scheduler was temporarily degraded, which could have affected the ability to submit new jobs. |
2025-07-29 | O2 Portal | July 29: We are pleased to announce that the major upgrade of the O2 Portal has been successfully completed. The system is functional, including the new version 4 features. Our IT team is finalizing a few remaining tasks to ensure everything is functioning optimally. Researchers experiencing problems may need to log out and in again or clean their browser cache. Please let us know if there are specific applications that are not working as expected. To improve performance and provide new user-friendly features, HMS Research Computing will upgrade the O2 Portal software on Tuesday, July 29, 2025, from 7:00 AM to 9:00 AM EDT (UTC-4). During this upgrade, the O2 Portal will be briefly unavailable. To minimize disruption, plan your portal-dependent tasks accordingly:
Otherwise, existing applications running in the O2 Portal will continue operating normally and will be accessible again after the upgrade is complete. You can still access O2 jobs, login nodes, and the transfer cluster without interruption through a terminal. This upgrade moves us from version 2 to version 4 of the Open OnDemand software. Version 4 offers a more modern and intuitive interface, quicker file editing, improved interactive application management, enhanced insights into job efficiency, and additional security enhancements. These improvements aim to make your experience smoother and more productive. To learn more, see the detailed Open OnDemand v4.0 Release Notes. |
2025-07-22 ->2025-07-23 | O2 HPC Cluster | Tue., July 22: Around 9PM O2 started experiencing authentication issues causing terminals to repeatedly ask for a password. Wed. July 23: Authentication issues resolved, normal O2 terminal login function restored. |
2025-07-11 | O2 HPC Cluster | Mon., July 14: O2 cluster capacity has been restored. Sat., July 12 (9am) update: O2 cluster capacity has been largely restored. While users may encounter additional wait times for jobs to run because of reduced capacity, fixed nodes are processing jobs normally. Jobs that were running before nodes became unresponsive will need to be resubmitted. HMS IT is working to bring remaining nodes up as quickly as possible, with additional capacity expected to become available by Monday. Fri., July 11: Around 4pm this afternoon, many O2 cluster nodes became unresponsive. Researchers may see their submitted jobs not working as expected, Slurm scheduler commands (such as squeue) failing, and interactive sessions closing or freezing up. We're working to resolve this issue as quickly as possible, and we'll follow up with any critical updates. |
April-June 2025: COMPLETED | O2 HPC cluster migration to Red Hat Enterprise Linux (RHEL) 9.x | OVERVIEW: O2 Cluster Linux version update To ensure ongoing security and functionality, we have updated O2 to run under Red Hat Enterprise Linux (RHEL) version 9.x. The O2 cluster’s previous Linux operating system, CentOS 7, had reached its end of life and was no longer able to receive updates or security patches. ACTION REQUIRED! Any software compiled by end users on O2 under the old OS (before June 2025) will need to be recompiled, and then reinstalled in the new RHEL environment, including R, Python, and Conda packages. We understand this may cause some inconvenience, but it is essential for modernizing our IT infrastructure.
After the update, user groups will have slightly different names under RHEL due to a required change which will further standardize O2’s authentication. This will not affect how you login. Your O2 username is still your HMS ID. Changes:
Group names are mostly relevant for those who need to modify the group owner (e.g. using chgrp), such as lab data managers. This will not affect your group memberships, nor access to data, but only what the groups are called. After the update, most role / application users will have slightly different names under RHEL due to a required change which will further standardize O2’s authentication. Change:
Role users are not tied to individuals, and do things like run background processes on a server. Most O2 researchers are not affected by this change. If you use a container on O2 and plan to continue to use it after the this OS update, please send us its full path so we can pre-install it for you: rchelp@hms.harvard.edu
|
2025-06-23 -> 2025-06-26 | O2 HPC Cluster | Planned outage for the O2 cluster migration to Red Hat Enterprise Linux (RHEL) 9.x START: Monday, June 23, 2025, at 9:00 AM (UTC-4) END: Thursday, June 26, 2025 at 5:00 PM (UTC-4)
|
2025-06-17 | O2 Cluster | June 28: Issue was resolved with the update to Red Hat Enterprise Linux. June 18: HMS IT is still investigating the root cause, so slurm errors may still occur, but job scheduling is running more smoothly overall. June 17: Around 10:30 AM this morning, O2 nodes began to close and slurm controllers became unresponsive. HMS IT is investigating the cause of these issues. |
2025-04-03 | O2 Cluster software: /n/app | A software freeze is in place, meaning that HMS Research Computing will no longer update any O2 applications accessible as a "module", and HMS IT will no longer update CentOS 7 software. |
2025-03-18 → 2025-03-27 | O2 Cluster | 03-18 A brief storage outage around 9-11 PM on 03/13 has put a significant portion of the O2 Cluster in a state where no additional jobs will be accepted by affected nodes. Our Dev-Ops and Research Computing teams are working on restarting these nodes to resume normal operation. We don't currently have a time estimate for when this will be complete. However, reboots may need to be run until next week (03/24-03/28). We will update this timeline as more information becomes available. 03-27 A batch of high I/O jobs contributed to nodes to becoming slow to respond, causing Slurm to eventually refuse jobs on these nodes. These jobs have been discontinued, and the affected nodes have been restarted. O2 performance issues have now been resolved. |
2025-01-22 → 2025-01-23 |
| To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array. Outage window: Wednesday, January 22 (5pm) to Thursday, January 23 (approx. 9am-10am), 2025.
which is only accessible from the transfer servers ( |
2025-01-03 → 2025-01-10 | Full O2 outage | HMS IT is undertaking a project to relocate our data center within the Markley Data Center to optimize our IT infrastructure. This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs. Users can bookmark the web page for this project, which will be updated as more information becomes available. O2 will be impacted from Friday, January 3, 5:30 PM EST, to Friday, January 10, 2025.
During this time:
|
2024-12-04 | filesystem | HMS IT is moving to a new Standby storage system designed to efficiently manage large amounts of data on Wednesday December 4th, 9am – 5pm EST. Standby storage will be inaccessible during this period.
Be sure to save any work stored in these affected directories before the migration begins. Do not try to access files in these locations during the migration. The files will be available again after the migration is complete. |
2024-07-17 | O2 web, filesystem | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/www filesystem to a new storage cluster: START: Wednesday, July 17, 2024, at 9:00 AM (UTC-4) END: Wednesday, July 17, 2024, at 12:00 PM (UTC-4) The maintenance took longer than anticipated but completed successfully by 5:15 PM The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. During this outage on July 17 (9:00 AM to 12:00 PM):
In preparation, on Tuesday July 16, at 4:00 PM EDT (the day before the outage):
|
2024-07-03 | O2 | A performance issue affected a number of O2 services, including:
|
2024-06-25 - 2024-06-28 | filesystem | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/groups filesystem to a new storage cluster: START: Tuesday, June 25, 2024, at 5:00 PM (UTC-4) END: Friday, June 28, 2024, at 12:00 PM noon (UTC-4) The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. However, ** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins. |
2024-05-29 - 2024-05-31 | filesystems | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/data2 and /n/no_backup2 filesystems to a new storage cluster: START: Monday, May 29, 2024, at 9:00 AM (UTC-4) END: Wednesday, May 31, 2024 at 5:00 PM (UTC-4) The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. However, ** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins. |
2024-04-13 - 2024-04-16 | filesystems | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/data1 , /n/cluster , and /n/shared_db filesystems to a new storage cluster:
The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. However,
** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins. |
2024-02-13 - 2024-02-15 | O2 Cluster | After a successful storage migration, pending jobs on O2 were allowed to dispatch as of 10AM, while login services started coming online as of 10:30 AM. To provide more robust and reliable storage, HMS IT will migrate all O2 Home folders and the Slurm job scheduler software to a new storage cluster during the following window: START: Tuesday, February 13, 2024, from 5:00 PM EST (UTC-5). END: Thursday, February 15, 2024, from 10:00 AM EST (UTC-5). During this time, the O2 Cluster will be offline. This means:
Jobs scheduled to run during the outage will be postponed with If a job needs to be completed before the upgrade, schedule it as soon as possible. |
2024-02-07 | O2 Cluster | An issue with the O2 storage environment affected access to use O2:
|
2024-02-05 | O2 Cluster | There was a HMS-wide network outage on the morning of Feb 5 which affected access to the O2 cluster as well as most other HMS services. Please note that it is possible that O2 jobs running during the network outage were affected, depending on the type of job, and also the nature of the network outage, which is still being determined. |
2023-12-08 - 2024-01-16 | O2 scratch storage | To provide more robust and reliable storage, HMS IT has deployed a new storage cluster, designated as The /n/scratch3 filesystem is being retired on Jan 16, 2024 The timeline for this update is:
Please update your workflow to use
If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu |
2023-12-06 - 2023-12-07 | O2 Cluster | To enhance your experience with our network-based storage and prepare for future growth, HMS IT will make upgrades during:
During this time, the O2 Cluster will be offline. This means:
Jobs scheduled to run during the outage will be postponed; they will start after the upgrade is complete. If a job needs to be completed before the upgrade, schedule it as soon as possible. If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu. |
2023-09-18 - 2023-09-22 | Standby Storage | HMS IT performed a gradual storage server upgrade on the HMS Standby storage server. No impact is expected, but O2 users should avoid doing any large data transfers involving the Standby filesystem ( /n/standby ), just to allow the upgrade to proceed as smoothly as possible. |
2023-08-21 | O2 Portal, Group and website Storage | A storage outage affected the availability of the following filesystems:
If your O2 jobs access any of these filesystems, they may fail and need to be re-run after the outage is resolved. You may also have problems cd’ing into or seeing data in certain directories. The data is safe; it’s just the access to the data from O2 that is not working. This outage may also affect O2 logins and access to the O2 Portal. |
2023-08-01 | filesystems | Several storage filesystems serving the O2 cluster and related services were not responding. We temporarily suspended all pending and running jobs. The Storage team investigated and resolved the issue. |
2023-07-16 |
| Start Time: Thursday, July 13 at 7:00 PM IMPACT: Scheduled migration of Due to this delay, the filesystem We will notify you once the storage migration is completed and /n/groups is available in O2. If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu |
2023-07-13 -> 2023-07-16 | PLANNED FULL O2 CLUSTER OUTAGE | To increase the efficiency and security of the O2 cluster, HMS DevOps will upgrade the Slurm job scheduler. Maintenance Window
This upgrade will require the O2 cluster to be offline, and as a result, no new jobs will be accepted during the mentioned period. To prevent disruption to your work, ensure all running jobs are complete before the upgrade commencement time. Certain services related to the O2 cluster will be affected during the upgrade period. In particular:
|