This page shows all service outages for the O2 cluster, including planned maintenance and unplanned events.
We also post updates on the HMS RC Twitter page.
Two Factor Authentication: All O2 cluster logins from outside of the HMS network require two-factor authentication. Please see: |
Date | Service | Issue |
---|---|---|
2021-11-19 → 2021-11-20 | O2 Cluster and Storage | On November 19 and 20, HMS IT will be performing maintenance on both the storage servers and the O2 cluster. A full outage of O2 is required during which it will not be possible to run or submit jobs. Slurm will be upgraded, and compute nodes will be patched and rebooted. The storage server software will be updated. We will configure O2 to not accept any job submissions that overlap into the outage window. For long running jobs, please be sure they complete before November 19. Maintenance Schedule and Details: O2/SLURM
FILE TRANSFER SERVERS / GLOBUSTo allow access for data transfer on filesystems unaffected by the outage, file transfer servers (transfer.rc.hms.harvard.edu) and Globus will remain online. The following filesystems will be offline for the duration of this outage:
The following filesystems are not affected and will remain online:
Web HostingWebsites hosted by Research Computing (O2 and “Orchestra” hosting) will remain online, except during brief outages when the /n/www filesystem is affected by the storage maintenance:
|
Date | Service | Issue |
---|---|---|
2021-08-26 | O2 cluster storage | A storage outage affected availability of the following filesystems:
If your O2 jobs access any of these filesystems, they may die and need to be re-run after the outage is resolved. You may also have problems cd'ing into or seeing data in certain directories. The data is safe; it's just the access to the data from O2 that is not working. |
2021-06-05 | O2 interactive logins and network | HMS IT will be updating and restarting network switches which serve the O2 cluster network from 9:00 PM EDT to 01:00 AM EDT. IMPACT:
|
2021-05-19 | Orchestra production Database Servers | There will be an emergency maintenance of Orchestra production database services from 9:00 PM EDT to 11:59 PM EDT. This maintenance is for security remediation on the database servers. The database services that will be offline during the maintenance period are listed here:
|
2021-05-18 | Orchestra development and staging Database Servers | There will be an emergency maintenance of Orchestra development and staging database services from 9:00 PM EDT to 11:59 PM EDT. This maintenance is for security remediation on the database servers. The database services that will be offline during the maintenance period are listed here:
|
2021-04-19 | Weekly jobs report / Job priority rewards | The weekly O2 report email and the extra priority reward QoS are currently unavailable due to a problem with the Slurm database queries. |
2021-03-03 | O2 cluster | HMS IT will be performing a storage server upgrade that will impact the O2 Cluster. The new storage, which has all flash drives and more current hardware, will both improve the performance and reliability of O2’s storage and replace aging infrastructure. This is the first stage of a two-phase upgrade to improve O2’s storage. A full outage of the O2 job scheduler is required, as some of the migrating data are used by the Slurm scheduler itself. While it will not be possible to run jobs during the outage, unaffected data will remain accessible via O2’s file transfer servers. We will be configuring O2 to not accept any job submissions that overlap into the outage window. For long running jobs, please be sure they complete before March 3. Maintenance Schedule:
The following services will be offline:
The following services will remain online:
|
2021-02-24 | Slurm scheduler | The Slurm job scheduler on O2 experienced an outage overnight. Resolved ~ 11:45am with an upgrade to Slurm to fix a bug in the scheduler. Impact: Job submissions and other Slurm commands (e.g. sbatch , srun, squeue) have not been functioning for several hours. Many jobs did continue to run, although some may have failed. |
2021-02-18 → 2021-02-19 (overnight) | O2 database services | In order to perform an urgent and critical infrastructure migration, the virtual machines hosting several legacy databases will be must be shut down for a period of time overnight. During the window, the virtual machines will be migrated to a different storage backend. The outage window will be from 8:00 PM on Thursday, 2/18 until 8:00 AM on Friday, 2/19. There will be no other changes to the database servers at this time.
|
2021-02-09 | Slurm scheduler | 1:15pm - 3:15pm (resolved) There is a currently a problem with the O2 cluster Slurm job scheduler. Jobs already running on compute nodes are not affected, but you may get errors when trying to submit new jobs. |
2021-01-17 | O2 cluster network | HMS IT will be performing priority network maintenance to correct a bug in affected network switches at the Markley Data Center where O2 is hosted. MAINTENANCE WINDOW:
IMPACT:
We have configured O2 to not submit any new jobs to the affected nodes, so the maintenance should only affect longer jobs which are already running on compute-e and compute-p nodes. |
2020-12-04 | /www and websites hosted by Research Computing /n/no_backup2 | A storage issue is affecting the availability of the following filesystems on O2:
The /www outage is resulting in most RC-hosted websites being offline. |
2020-11-18 | Slurm scheduler | The Slurm job scheduler on the O2 HPC cluster is currently having a performance issue and Slurm commands (e.g. sbatch, srun, squeue) may be unavailable. Impact:
|
2020-11-14 | /n/files | To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array. Outage window: Saturday, November 14, 2020, from 8:00 AM to 8:00 PM
which is only accessible from the transfer servers (transfer.rc.hms.harvard.edu) and transfer compute nodes. |
2020-09-26 | O2 cluster | On Saturday, September 26, 2020, from 6 AM to 1 PM EDT, HMS IT will be completing a strategic network upgrade which will increase the HMS campus internet connectivity from 40 to 100 gigabits per second. This upgrade improves support for data-intensive science, online education, and remote work. The O2 cluster will remain fully operational. However, there is the potential for issues related to O2’s authentication service during the maintenance. This could result in any of the following issues:
Jobs which are already running are expected to continue without any problems. |
2020-09-18 | O2 authentication | Intermittent problems with authentication for O2 login, transfer, and compute nodes. |
2020-08-26 | O2 cluster | HMS IT will be performing minor maintenance on the O2 cluster which is expected to improve the responsiveness of the SLURM job scheduler (see outage notes for 8/9/2020) MAINTENANCE WINDOW:
IMPACT:
|
2020-08-09 | O2 cluster | New jobs are intermittently not starting on the cluster (or the sbatch command has errors) due to an issue with cluster-storage communication. We believe that currently running jobs are still executing normally. Disk read/writes may be slower than usual, which can cause other commands to be slow. We will provide details as we get them. |
2020-07-30 | Full O2 cluster | Unplanned SLURM outage, due to unbalanced file system allocations on a primary storage cluster. Service restored 3pm |
2020-07-29 → 2020-07-30 | /n/no_backup2 | Scheduled Maintenance window: 2020-07-29 5:00 PM to 2020-07-30 5:00 PM HMS IT will be migrating data from /n/no_backup2 to a newer filesystem. |
2020-07-07 | Full O2 cluster outage | Scheduled Maintenance window: All day on July 7:
Actual Maintenance window: 5.00 am - 11.45 pm Once the upgrade is completed on Tuesday evening, all O2 services will become available. O2 will be completely offline to allow for an update to the Linux operating system (to CentOS 7.7) on all cluster systems, as well as an update to the Slurm job scheduler (to version 20.02). These are standard maintenance and security updates. No changes are expected from a usability perspective to O2 or its installed software (e.g. modules). Impact:
Websites hosted by HMS Research Computing will not be affected unless they run jobs on the cluster, since job submissions will be unavailable. |
2020-07-07 | /home data and logins to O2 transfer servers | Scheduled Maintenance window:
The /home filesystem may be unavailable during this window due to planned storage maintenance. While the O2 cluster will also be offline all day on July 7 (see below), logins to the transfer servers at transfer.rc.hms.harvard.edu will still work, so research data will be accessible. However, this separate storage maintenance will result in /home being unavailable during the 7:30 - 10am window, which could disrupt logins. |
2020-07-03 | /www data and websites hosted by Research Computing | Scheduled Maintenance window: 4:00pm - 6:00pm Actual Maintenance window: 4pm - 6.30pm HMS IT will be performing maintenance on the /www filesystem which will result in an temporary outage of websites and any cluster jobs which access data under /www Websites hosted outside of Research Computing, such as through |
2020-06-27 | HMS IT will make upgrades to the high-throughput research network that may sometimes block access | Scheduled Maintenance window: 6:00am - 1pm Actual Maintenance window: 6:00am - 12pm HMS IT will make upgrades to the high-throughput research network that may sometimes block access between O2 and all external networks, including the HMS Quad, all Harvard networks, and the internet. Note that the actual outage may end sooner than 1pm depending on the day's progress. Impact:
|
2020-06-26 | /n/scratch2 goes offline | The /n/scratch2 filesystem is being taken offline and retired. Any data left on /n/scratch2 will be LOST and NOT RECOVERABLE. All users of scratch space must switch their workflows to the new filesystem under /n/scratch3/users . More details at: Scratch3 Storage |
2020-06-15 | /n/scratch2 becomes READ-ONLY | The /n/scratch2 filesystem will made READ-ONLY in preparation for its retirement on June 26. All users of scratch space must switch their workflows to the new filesystem under /n/scratch3/users . More details at: Scratch3 Storage |
2020-05-16 | Network connectivity between O2 and networks outside out the HMS data center. | Scheduled Maintenance window: 5:30am - 1pm Actual Maintenance window: 5.30am - 10 am A planned upgrade to the HMS interior firewall will result in an outage between O2 and all external networks, including the HMS Quad, all Harvard networks, and the internet. Note that the actual outage may end sooner than 1pm depending on the day's progress. Impact:
|
2020-04-13 | /n/app | Maintenance window: 6:00am - 10:00am The filesystem /n/app , which is used to host scientific software applications on O2, will be migrated onto newer, more performant storage.
|
2020-03-29 | O2 cluster /n/data2 /n/groups | Maintenance window: 3.30pm - 7pm High load on one of the storage servers that is known on cluster as /n/data2 and /n/groups, Impact:
The issue was resolved after the high load processes finished. |
2020-02-27 | O2 Cluster | The O2 job scheduler became unavailable due to an unforeseen bug in the scheduler control process. The problem was resolved with a patch applied to the scheduler software. |
2020-01-12 | O2 Cluster | Maintenance window: 4am - 12pm (noon) Network maintenance being performed in the HMS data center will result in outages of 1-3 minutes on the O2 network. Impact:
This work over Jan 11-12 is being done to increase network performance in the HMS data center. After completion, all HMS systems hosted in the data center (including O2, storage, virtual machine infrastructure) will be running on a 100 GB network! |
2020-01-11 | Network connectivity between O2 and networks outside out the HMS data center. | Maintenance window: 4am - 8am Network maintenance being performed on the HMS core network will result in outages of < 5 minutes between O2 and all external networks, including the HMS Quad and all Harvard networks. Impact:
This work over Jan 11-12 is being done to increase network performance in the HMS data center. After completion, all HMS systems hosted in the data center (including O2, storage, virtual machine infrastructure) will be running on a 100 GB network! |
2019-09-02 | /n/scratch2 | Unplanned service degradation for /n/scratch2 filesystem.
Resolved by stopping a service that is misbehaving on the filesystem. Working with Vendor to prevent issues like this in future. |
2019-08-25 | O2 job submissions / queries | The O2 cluster will have planned maintenance during this window:
An update for the /n/scratch2 filesystem will requires a service outage for all O2 systems. Cluster services will be restored as soon as possible on Sunday 8/25, although the outage is scheduled for all day, as needed. No user data will be deleted or otherwise changed during the outage. But, as a precaution, please make sure you have copies of any critical data under /n/scratch2 in particular, since that filesystem is not backed up. Cluster jobs will not be able to run during the upgrade, so we have configured Slurm such that:
During the outage, you WILL NOT be able to:
Websites hosted by Research Computing will not be functionally affected, unless they submit jobs to the cluster (only a few websites do this). But, web developers will be unable to login and edit files. |
2019-08-23 → 2019-08-25 | /n/scratch2 | Planned service outage for /n/scratch2 filesystem:
An update for the /n/scratch2 filesystem requires a service outage. Service will be restored as soon as possible on Sunday 8/25, although the outage is scheduled for all day, as needed. During this outage, all other O2 cluster services will be up and running until Sunday morning 8/25 (see below). Please note:
|
2019-08-21 | O2 job submissions / queries | The Slurm job scheduler went offline at approximately 3:30am on 2019-08-21 . We are currently working to restore this service.
|
2019-08-17 | O2 logins Slurm job submissions | Scheduled power maintenance at Datacenter led to an unexpected power outage causing login nodes and other critical infrastructure services not respond. The issues is fixed by restoring power.
|
2019-08-09 | O2 logins | /home filesystem experienced a service degradation that resulted in not allowing users to login to O2 cluster and submit jobs. The issue has been fixed by vendor.
|
2019-07-07 | O2 logins | A network firewall issue during planned maintenance caused O2 cluster logins to fail and new SLURM job submissions to remain pending. Jobs already running on compute nodes should not have been affected.
|
2019-06-30 → 2019-07-01 | network issues | unplanned service outage for all of o2 cluster. One of the networking devices failed and caused multiple issues across HMS including o2 cluster logins and SLURM job submissions. The
Issue is resolved by replacing the faulty hardware. |
2019-05-24 → 2019-05-25 | /n/scratch2 | Unplanned service degradation for /n/scratch2 filesystem.
Resolved by restarting a service on the filesystem. |
2019-03-{18-22} | /n/scratch2 | Unplanned service degradation. The /n/scratch2 filesystem is currently showing intermittent instability. We are monitoring it closely and will be implementing a number of hardware and software fixes this week resolve the performance problem.
Implemented hardware and software fixes to resolve the core issue on the scratch2 fileserver. |
2019-03-09 | Slurm Job Scheduler | The Slurm Job Scheduler will have planned maintenance during this window:
Cluster jobs will not be able to run during the upgrade, so we have configured Slurm such that:
During the outage, you WILL still be able to:
During the outage, you WILL NOT be able to:
Websites hosted by Research Computing will not be affected, unless they submit jobs to the cluster (only a few websites do this). |
2019-03-09 | /n/files filesystem | The research.files server will have planned maintenance during this window:
During this window, the directory /n/files will not be available from the O2 file transfer servers and compute nodes. |
2019-02-28 | /n/scratch2 | Unplanned Outage: A performance degradation on /n/scratch2 could cause jobs using /n/scratch2 to fail. Duration: 7.00AM - 9.00PM |
2018-12-05 | /n/scratch2 filesystem | The automated process that deletes old files under /n/scratch2 (specifically, files that were last accessed more than 29 days ago), was intentionally disabled by Research Computing for approximately the past month due to an issue on the scratch2 fileserver. So, there are currently files older than 30 days on /n/scratch2 which have not yet been purged as they normally would have been. |
2018-12-03 | O2 logins | All O2 cluster logins from outside of the HMS network will start requiring two-factor authentication. For more details, please see: Two Factor Authentication (2FA) on O2 and Two Factor Authentication FAQ Currently, O2 only requires a password login using your eCommons ID. Due to increased hacking attempts on O2, it is necessary to increase the security of our systems and going to two factor authentication is a big step. HMS users already must use two factor authentication for Harvard Key and HMS VPN logins. O2 logins will work similarly. Two-factor authentication will be required when logging in from:
|
2018-11-28 | MySQL and PostgreSQL Databases TWiki server | A planned maintenance window at: Wednesday, 2018-11-28, 6pm - 7pm for the following services:
Only websites and cluster jobs using these database services were affected. |
2018-11-20 | /n/scratch2 | Intermittent storage issues affected the availability of the /n/scratch2 directories across O2 systems. Duration: 6.00 AM - 6.00 PM |
2018-10-24 | /n/groups /n/data2 | Intermittent storage issues affected the availability of the /n/groups and /n/data2 directories across O2 systems. |
2018-10-10 | authentication service | Instability in O2's authentication service was causing some user accounts to lose group memberships across O2 systems. Services were restored to normal at approximately 10:18am |
2018-10-01 | /n/scratch2 directory | When attempting to write to files under /n/scratch2 , you may see errant behavior such as:
Issue was resolved with a bug fix on the scratch2 storage server. |
2018-09-08 | O2 Login servers | Unplanned Outage: a core HMS network outage caused o2 login nodes unreachable. The issue is resolved by HMS Networking team Duration: 02.30 PM - 5.30 PM |
2018-08-17 | PostgreSQL (production, staging) MySQL (staging) Request Tracker (RT) | These will be offline for approximately 1 hour starting at 9pm EDT for urgent maintenance. |
2018-08-14 | O2 Cluster and web services | Unplanned outage: a failure in the HMS virtual machine hosting infrastructure caused service outages in Research Computing's web services and, to a lesser extent, on the O2 cluster. The outage did not affect running cluster jobs, though. Duration: 02:20 pm - 06:20 pm |
2018-08-06 | O2 Cluster | Unplanned outage: Cisco networking hardware failed and caused many jobs to fail. The defect hardware has been replaced and everything is stable. Duration: 05:00 am - 08:00 pm |
2018-04-25 → 2018-04-26 | O2 login servers | 2 login servers, login03 and login05, required reboots due to resource-intensive end user processes locking up those systems. |
2018-04-11 | O2 /home cluster | A severe network latency to the /home storage cluster impacted logins and processes trying to access this cluster. Duration: 11:00am - 05:00pm |
2018-04-10 | O2 Cluster | Unplanned outage: networking issues disrupted communication to/from the login nodes. Running/pending jobs were not impacted. |
2018-04-03 | /home filesystem | The fileserver for /home was getting close to maximum capacity and running on older hardware. This planned maintenance involved migrating all /home to data to a new fileserver with more capacity. This required a full shutdown of O2's Slurm job scheduler and unmounting /home from all cluster and infrastructure systems. |
2018-03-13 → 2018-03-14 | /n/scratch2 filesystem | A hardware failure on the /n/scratch2 fileserver resulted in /n/scratch2 being non-writable. On 3/14, hardware was replaced and the filesystem repaired, after which service returned to normal. |
"Unplanned SLURM outage due to scheduler issues.