O2 Cluster Status
This page shows all service outages for the O2 cluster, including planned maintenance and unplanned events.
ONLINE
Two Factor Authentication:
All O2 cluster logins from outside of the HMS network require two-factor authentication. Please see:
Scheduled Maintenance and Current Outages:
Date | Service | Issue |
---|---|---|
April-June 2025: COMPLETED | O2 HPC cluster migration to Red Hat Enterprise Linux (RHEL) 9.x | OVERVIEW: O2 Cluster Linux version update To ensure ongoing security and functionality, we have updated O2 to run under Red Hat Enterprise Linux (RHEL) version 9.x. The O2 cluster’s previous Linux operating system, CentOS 7, had reached its end of life and was no longer able to receive updates or security patches. ACTION REQUIRED! Any software compiled by end users on O2 under the old OS (before June 2025) will need to be recompiled, and then reinstalled in the new RHEL environment, including R, Python, and Conda packages. We understand this may cause some inconvenience, but it is essential for modernizing our IT infrastructure.
After the update, user groups will have slightly different names under RHEL due to a required change which will further standardize O2’s authentication. This will not affect how you login. Your O2 username is still your HMS ID. Changes:
Group names are mostly relevant for those who need to modify the group owner (e.g. using chgrp), such as lab data managers. This will not affect your group memberships, nor access to data, but only what the groups are called. After the update, most role / application users will have slightly different names under RHEL due to a required change which will further standardize O2’s authentication. Change:
Role users are not tied to individuals, and do things like run background processes on a server. Most O2 researchers are not affected by this change. If you use a container on O2 and plan to continue to use it after the this OS update, please send us its full path so we can pre-install it for you: rchelp@hms.harvard.edu
|
Previous Service Outages:
Date | Service | Issue |
---|---|---|
2025-06-23 -> 2025-06-26 | O2 HPC Cluster | Planned outage for the O2 cluster migration to Red Hat Enterprise Linux (RHEL) 9.x START: Monday, June 23, 2025, at 9:00 AM (UTC-4) END: Thursday, June 26, 2025 at 5:00 PM (UTC-4)
|
2025-06-17 | O2 Cluster | June 28: Issue was resolved with the update to Red Hat Enterprise Linux. June 18: HMS IT is still investigating the root cause, so slurm errors may still occur, but job scheduling is running more smoothly overall. June 17: Around 10:30 AM this morning, O2 nodes began to close and slurm controllers became unresponsive. HMS IT is investigating the cause of these issues. |
2025-04-03 | O2 Cluster software: /n/app | A software freeze is in place, meaning that HMS Research Computing will no longer update any O2 applications accessible as a "module", and HMS IT will no longer update CentOS 7 software. |
2025-03-18 → 2025-03-27 | O2 Cluster | 03-18 A brief storage outage around 9-11 PM on 03/13 has put a significant portion of the O2 Cluster in a state where no additional jobs will be accepted by affected nodes. Our Dev-Ops and Research Computing teams are working on restarting these nodes to resume normal operation. We don't currently have a time estimate for when this will be complete. However, reboots may need to be run until next week (03/24-03/28). We will update this timeline as more information becomes available. 03-27 A batch of high I/O jobs contributed to nodes to becoming slow to respond, causing Slurm to eventually refuse jobs on these nodes. These jobs have been discontinued, and the affected nodes have been restarted. O2 performance issues have now been resolved. |
2025-01-22 → 2025-01-23 |
| To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array. Outage window: Wednesday, January 22 (5pm) to Thursday, January 23 (approx. 9am-10am), 2025.
which is only accessible from the transfer servers ( |
2025-01-03 → 2025-01-10 | Full O2 outage | HMS IT is undertaking a project to relocate our data center within the Markley Data Center to optimize our IT infrastructure. This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs. Users can bookmark the web page for this project, which will be updated as more information becomes available. O2 will be impacted from Friday, January 3, 5:30 PM EST, to Friday, January 10, 2025.
During this time:
|
2024-12-04 | filesystem | HMS IT is moving to a new Standby storage system designed to efficiently manage large amounts of data on Wednesday December 4th, 9am – 5pm EST. Standby storage will be inaccessible during this period.
Be sure to save any work stored in these affected directories before the migration begins. Do not try to access files in these locations during the migration. The files will be available again after the migration is complete. |
2024-07-17 | O2 web, filesystem | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/www filesystem to a new storage cluster: START: Wednesday, July 17, 2024, at 9:00 AM (UTC-4) END: Wednesday, July 17, 2024, at 12:00 PM (UTC-4) The maintenance took longer than anticipated but completed successfully by 5:15 PM The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. During this outage on July 17 (9:00 AM to 12:00 PM):
In preparation, on Tuesday July 16, at 4:00 PM EDT (the day before the outage):
|
2024-07-03 | O2 | A performance issue affected a number of O2 services, including:
|
2024-06-25 - 2024-06-28 | filesystem | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/groups filesystem to a new storage cluster: START: Tuesday, June 25, 2024, at 5:00 PM (UTC-4) END: Friday, June 28, 2024, at 12:00 PM noon (UTC-4) The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. However, ** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins. |
2024-05-29 - 2024-05-31 | filesystems | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/data2 and /n/no_backup2 filesystems to a new storage cluster: START: Monday, May 29, 2024, at 9:00 AM (UTC-4) END: Wednesday, May 31, 2024 at 5:00 PM (UTC-4) The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. However, ** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins. |
2024-04-13 - 2024-04-16 | filesystems | As part of the HMS Research Data Migration Project, HMS IT will migrate the /n/data1 , /n/cluster , and /n/shared_db filesystems to a new storage cluster:
The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. However,
** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins. |
2024-02-13 - 2024-02-15 | O2 Cluster | After a successful storage migration, pending jobs on O2 were allowed to dispatch as of 10AM, while login services started coming online as of 10:30 AM. To provide more robust and reliable storage, HMS IT will migrate all O2 Home folders and the Slurm job scheduler software to a new storage cluster during the following window: START: Tuesday, February 13, 2024, from 5:00 PM EST (UTC-5). END: Thursday, February 15, 2024, from 10:00 AM EST (UTC-5). During this time, the O2 Cluster will be offline. This means:
Jobs scheduled to run during the outage will be postponed with If a job needs to be completed before the upgrade, schedule it as soon as possible. |
2024-02-07 | O2 Cluster | An issue with the O2 storage environment affected access to use O2:
|
2024-02-05 | O2 Cluster | There was a HMS-wide network outage on the morning of Feb 5 which affected access to the O2 cluster as well as most other HMS services. Please note that it is possible that O2 jobs running during the network outage were affected, depending on the type of job, and also the nature of the network outage, which is still being determined. |
2023-12-08 - 2024-01-16 | O2 scratch storage | To provide more robust and reliable storage, HMS IT has deployed a new storage cluster, designated as The /n/scratch3 filesystem is being retired on Jan 16, 2024 The timeline for this update is:
Please update your workflow to use
If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu |
2023-12-06 - 2023-12-07 | O2 Cluster | To enhance your experience with our network-based storage and prepare for future growth, HMS IT will make upgrades during:
During this time, the O2 Cluster will be offline. This means:
Jobs scheduled to run during the outage will be postponed; they will start after the upgrade is complete. If a job needs to be completed before the upgrade, schedule it as soon as possible. If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu. |
2023-09-18 - 2023-09-22 | Standby Storage | HMS IT performed a gradual storage server upgrade on the HMS Standby storage server. No impact is expected, but O2 users should avoid doing any large data transfers involving the Standby filesystem ( /n/standby ), just to allow the upgrade to proceed as smoothly as possible. |
2023-08-21 | O2 Portal, Group and website Storage | A storage outage affected the availability of the following filesystems:
If your O2 jobs access any of these filesystems, they may fail and need to be re-run after the outage is resolved. You may also have problems cd’ing into or seeing data in certain directories. The data is safe; it’s just the access to the data from O2 that is not working. This outage may also affect O2 logins and access to the O2 Portal. |
2023-08-01 | filesystems | Several storage filesystems serving the O2 cluster and related services were not responding. We temporarily suspended all pending and running jobs. The Storage team investigated and resolved the issue. |
2023-07-16 |
| Start Time: Thursday, July 13 at 7:00 PM IMPACT: Scheduled migration of Due to this delay, the filesystem We will notify you once the storage migration is completed and /n/groups is available in O2. If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu |
2023-07-13 -> 2023-07-16 | PLANNED FULL O2 CLUSTER OUTAGE | To increase the efficiency and security of the O2 cluster, HMS DevOps will upgrade the Slurm job scheduler. Maintenance Window
This upgrade will require the O2 cluster to be offline, and as a result, no new jobs will be accepted during the mentioned period. To prevent disruption to your work, ensure all running jobs are complete before the upgrade commencement time. Certain services related to the O2 cluster will be affected during the upgrade period. In particular:
However, not all services will be unavailable.
The upgrade is vital to keep our systems current with necessary security and bug fixes, resulting in enhanced performance for users. The process involves a database schema modification, which is time-consuming, hence the need for downtime. If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu |
2023-06-22 | O2 job scheduler | The slurm job scheduler is currently experiencing high loads and might fail to accept new jobs, often returning the error:
sbatch: error: slurm_persist_conn_open_without_init: failed to open -- The HMS IT DevOps team implemented a fix to remediate this issue |
2023-05-02 | Globus file transfer | We will be performing an upgrade on the Globus File Transfer service, which is used on O2 to share data both internally and with collaborators from outside of HMS. This work will upgrade Globus from version 4 to 5, which will provide improved stability and scalability as well as new user-facing features to be announced. As Globus v4 will be considered end-of-life as of July 31, the upgrade is necessary to maintain support with the vendor. Maintenance Window: Tuesday, May 2, 9AM – 5PM , (Completed by 3PM) Impact:
We ask that any necessary long-running transfers be completed prior to May 2. Please plan accordingly, contact your external collaborators, and reach out to us as soon as possible at rchelp@hms.harvard.edu if there are any concerns. |
2023-04-16 | O2 network | HMS IT will be performing emergency maintenance on the O2 network to fix a critical issue. This will temporarily reduce O2's compute capacity. Maintenance Window:
Impact:
|
2023-03-22 | O2 network | There is an ongoing issue with networking gear in the data center that is impacting O2 jobs. In particular, many jobs will be unable to access storage and will either fail immediately or just hang until they time out. Some nodes may need to be rebooted, which will cause more jobs to fail. HMS IT is working to address the issue. 1:45pm update: All pending jobs on O2 (submitted today or before) have been paused for dispatch, except for jobs on the "interactive" partition to allow for some compute access. However, you might still experience problems accessing data from within those running interactive jobs. 5:30pm update: The issue was resolved and pending jobs were allowed to resume. |
2023-03-05 | HMS network | During 6am - 9am on Sunday Mar 5, HMS IT is performing emergency maintenance on the core switches in the data center where O2 is hosted. Hopefully there will be no impact to O2, but there is a possibility that this work may temporarily affect O2 cluster jobs, logins, or access to data. |
2022-11-09 | Globus | The Globus Server software will be upgraded from v4.0.62 to v4.0.63. This change is necessary to update all security certificates to accommodate a Globus change to their certificate authority (CA) of choice. No other software modifications are made with this update. Service Window: 12PM - 12:10PM. The process should only take a few minutes. Impact:
|
2022-11-08 | O2 Portal | The Open OnDemand software that powers the O2 Portal will be upgraded. Service Window:
Impact:
This upgrade will not impact O2 jobs, access to O2 login nodes, or access to the transfer cluster. |
2022-07-29 | O2 cluster | A network outage at HMS affected access to O2 and other HMS services. HMS IT has addressed the core network issues. Any cluster jobs which were running when the outage occurred may require being resubmitted, depending on the nature of the jobs. |
2022-06-06 | scratch filesystem: /n/scratch3 | 2022-06-27: If you still need to recover any data you had on /n/scratch3 on June 6, you MUST contact HMS Research Computing to request a data restore: rchelp@hms.harvard.edu. All we need is your HMS ID (the one you log into O2 with). The old copy of scratch data will be REMOVED ON JULY 5 to provide space for the new /n/scratch3 to grow. 2022-06-10: PLEASE CONTACT RCHELP@HMS.HARVARD.EDU IF YOU WISH TO ACCESS THE DATA YOU HAD IN /n/scratch3 AS OF JUNE 6.
At some point in the future, we will need to delete the copy of data from June 6 (hundreds of TB) to recover space, so please contact us in the next week if you want that data back. 2022-06-08: Update on the continuing /n/scratch3 outage:
PLEASE EMAIL rchelp@hms.harvard.edu with your login name (HMS ID, like abc123) if you wish to get a copy of all the data in your /n/scratch3 directory. For logistical reasons, we can not provide self-service retrieval of that data.
We ask for your continued patience as the data is restored and HMS IT fulfills your requests. 2022-06-07: Thank you for your patience while we investigate the sudden unavailability of our O2 scratch storage solution. HMS IT has identified the root cause and have implemented an interim solution for O2 scratch. You should be able to go back to using scratch storage for your O2 jobs now. We are investigating the impact to science that was running when we experienced this sudden outage, and will be reaching out to those affected labs to help address any impacts to research that were created when this incident occurred. HMS RC has created new scratch folders for all O2 users under the same location, e.g. /n/scratch3/users/a/abc123 , but you will need to recreate directories underneath it. |
2022-04-28 | Slurm job scheduler | O2's job scheduler software (slurm) will be upgraded from version 21.08.4 to 21.08.7
This is a minor update and we do not expect any significant impact to O2 users during the upgrade process, but at times slurm related commands (e.g. sbatch, srun) may be slow to load or could fail. |
2022-03-24 | O2 login services, general access | An HMS infrastructure issue is affecting access to O2, including login services.
|
2022-03-21 | O2 login services | O2 logins were offline this morning due to a wider HMS authentication issue.
|
2021-12-15 | GPU scratch filesystem: /n/scratch_gpu | We have decided to repurpose the dedicated GPU scratch space storage /n/scratch_gpu .This is because /n/scratch_gpu was only very lightly used by the community, and its 1 PB of storage can be more efficiently put to use increasing capacity on other O2 group filesystems. Schedule:
We're sorry to remove this resource for those who have been making use of it. We encourage everyone to use O2's main scratch space under /n/scratch3 for GPU jobs in addition to regular jobs. |
2021-11-19 → 2021-11-20 (all day for both days) | O2 Cluster and Storage | On November 19 and 20, HMS IT will be performing maintenance on both the storage servers and the O2 cluster for improved stability and performance. A full outage of O2 is required during which it will not be possible to run or submit jobs. Slurm will be upgraded, and compute nodes will be patched and rebooted. The storage server software will be updated. We will configure O2 to not accept any job submissions that overlap into the outage window. For long running jobs, please be sure they complete before November 19. The Maintenance Window is all day for both days:
Details: O2/SLURM
FILE TRANSFER SERVERS / GLOBUSTo allow access for data transfer on filesystems unaffected by the outage, file transfer servers (transfer.rc.hms.harvard.edu) and Globus will remain online. The following filesystems will be offline for the duration of this outage:
The following filesystems are not affected and will remain online:
Web HostingWebsites hosted by Research Computing (O2 and “Orchestra” hosting) will remain online, except during brief outages when the /n/www filesystem is affected by the storage maintenance:
|
2021-08-26 | O2 cluster storage | A storage outage affected availability of the following filesystems:
If your O2 jobs access any of these filesystems, they may die and need to be re-run after the outage is resolved. You may also have problems cd'ing into or seeing data in certain directories. The data is safe; it's just the access to the data from O2 that is not working. |
2021-06-05 | O2 interactive logins and network | HMS IT will be updating and restarting network switches which serve the O2 cluster network from 9:00 PM EDT to 01:00 AM EDT. IMPACT:
|
2021-05-19 | Orchestra production Database Servers | There will be an emergency maintenance of Orchestra production database services from 9:00 PM EDT to 11:59 PM EDT. This maintenance is for security remediation on the database servers. The database services that will be offline during the maintenance period are listed here:
|
2021-05-18 | Orchestra development and staging Database Servers | There will be an emergency maintenance of Orchestra development and staging database services from 9:00 PM EDT to 11:59 PM EDT. This maintenance is for security remediation on the database servers. The database services that will be offline during the maintenance period are listed here:
|
2021-04-19 | Weekly jobs report / Job priority rewards | The weekly O2 report email and the extra priority reward QoS are currently unavailable due to a problem with the Slurm database queries. |
2021-03-03 | O2 cluster | HMS IT will be performing a storage server upgrade that will impact the O2 Cluster. The new storage, which has all flash drives and more current hardware, will both improve the performance and reliability of O2’s storage and replace aging infrastructure. This is the first stage of a two-phase upgrade to improve O2’s storage. A full outage of the O2 job scheduler is required, as some of the migrating data are used by the Slurm scheduler itself. While it will not be possible to run jobs during the outage, unaffected data will remain accessible via O2’s file transfer servers. We will be configuring O2 to not accept any job submissions that overlap into the outage window. For long running jobs, please be sure they complete before March 3. Maintenance Schedule:
The following services will be offline:
The following services will remain online:
|
2021-02-24 | Slurm scheduler | The Slurm job scheduler on O2 experienced an outage overnight. Resolved ~ 11:45am with an upgrade to Slurm to fix a bug in the scheduler. Impact: Job submissions and other Slurm commands (e.g. sbatch , srun, squeue) have not been functioning for several hours. Many jobs did continue to run, although some may have failed. |
2021-02-18 → 2021-02-19 (overnight) | O2 database services | In order to perform an urgent and critical infrastructure migration, the virtual machines hosting several legacy databases will be must be shut down for a period of time overnight. During the window, the virtual machines will be migrated to a different storage backend. The outage window will be from 8:00 PM on Thursday, 2/18 until 8:00 AM on Friday, 2/19. There will be no other changes to the database servers at this time.
|
2021-02-09 | Slurm scheduler | 1:15pm - 3:15pm (resolved) There is a currently a problem with the O2 cluster Slurm job scheduler. Jobs already running on compute nodes are not affected, but you may get errors when trying to submit new jobs. |
2021-01-17 | O2 cluster network | HMS IT will be performing priority network maintenance to correct a bug in affected network switches at the Markley Data Center where O2 is hosted. MAINTENANCE WINDOW:
IMPACT:
We have configured O2 to not submit any new jobs to the affected nodes, so the maintenance should only affect longer jobs which are already running on compute-e and compute-p nodes. |
2020-12-04 | /www and websites hosted by Research Computing /n/no_backup2 | A storage issue is affecting the availability of the following filesystems on O2:
The /www outage is resulting in most RC-hosted websites being offline. |
2020-11-18 | Slurm scheduler | The Slurm job scheduler on the O2 HPC cluster is currently having a performance issue and Slurm commands (e.g. sbatch, srun, squeue) may be unavailable. Impact:
|
2020-11-14 | /n/files | To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array. Outage window: Saturday, November 14, 2020, from 8:00 AM to 8:00 PM
which is only accessible from the transfer servers (transfer.rc.hms.harvard.edu) and transfer compute nodes. |
2020-09-26 | O2 cluster | On Saturday, September 26, 2020, from 6 AM to 1 PM EDT, HMS IT will be completing a strategic network upgrade which will increase the HMS campus internet connectivity from 40 to 100 gigabits per second. This upgrade improves support for data-intensive science, online education, and remote work. The O2 cluster will remain fully operational. However, there is the potential for issues related to O2’s authentication service during the maintenance. This could result in any of the following issues:
Jobs which are already running are expected to continue without any problems. |
2020-09-18 | O2 authentication | Intermittent problems with authentication for O2 login, transfer, and compute nodes. |
2020-08-26 | O2 cluster | HMS IT will be performing minor maintenance on the O2 cluster which is expected to improve the responsiveness of the SLURM job scheduler (see outage notes for 8/9/2020) MAINTENANCE WINDOW:
IMPACT:
|
2020-08-09 | O2 cluster | New jobs are intermittently not starting on the cluster (or the sbatch command has errors) due to an issue with cluster-storage communication. We believe that currently running jobs are still executing normally. Disk read/writes may be slower than usual, which can cause other commands to be slow. We will provide details as we get them. |
2020-07-30 | Full O2 cluster | Unplanned SLURM outage, due to unbalanced file system allocations on a primary storage cluster. Service restored 3pm |
2020-07-29 → 2020-07-30 | /n/no_backup2 | Scheduled Maintenance window: 2020-07-29 5:00 PM to 2020-07-30 5:00 PM HMS IT will be migrating data from /n/no_backup2 to a newer filesystem. |