O2 Cluster Status

This page shows all service outages for the O2 cluster, including planned maintenance and unplanned events.

OPERATIONAL

Two Factor Authentication:

All O2 cluster logins from outside of the HMS network require two-factor authentication. Please see:

Scheduled Maintenance and Current Outages:

Date

Service

Issue

Date

Service

Issue



Previous Service Outages:



Date

Service

Issue

Date

Service

Issue

2024-07-17

O2 web, filesystem /n/www

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/www filesystem to a new storage cluster:

START: Wednesday, July 17, 2024, at 9:00 AM (UTC-4)

END: Wednesday, July 17, 2024, at 12:00 PM (UTC-4) The maintenance took longer than anticipated but completed successfully by 5:15 PM

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

During this outage on July 17 (9:00 AM to 12:00 PM):

  • Websites and services hosted on O2 Web Hosting will be unavailable for the duration of the migration.

  • The genomebrowser-uploads website will be unavailable

  • The /n/www filesystem will be inaccessible

In preparation, on Tuesday July 16, at 4:00 PM EDT (the day before the outage):

  • O2 compute nodes will not have access to /n/www. Any O2 cluster job relying on /n/www will fail.

  • But until the full outage begins on July 17, websites will be up and /n/www will still remain available on O2 login nodes.

2024-07-03

O2

A performance issue affected a number of O2 services, including:

  • O2 Portal

  • Jupyter Notebooks

  • RStudio

  • MATLAB

2024-06-25 -

2024-06-28

filesystem /n/groups

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/groups filesystem to a new storage cluster:

START: Tuesday, June 25, 2024, at 5:00 PM (UTC-4)

END: Friday, June 28, 2024, at 12:00 PM noon (UTC-4)

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

However,  /n/groups will be inaccessible during this period.

** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins.

2024-05-29 -

2024-05-31

filesystems /n/data2 /n/no_backup2

As part of the HMS Research Data Migration Project,

HMS IT will migrate the /n/data2 and /n/no_backup2 filesystems to a new storage cluster:

START: Monday, May 29, 2024, at 9:00 AM (UTC-4)

END: Wednesday, May 31, 2024 at 5:00 PM (UTC-4)

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

However,  /n/data2 and /n/no_backup2 will be inaccessible during this period.

** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins.

2024-04-13 -

2024-04-16

filesystems /n/data1 /n/cluster /n/shared_db

As part of the HMS Research Data Migration Project,

The O2 Cluster will be online during this time because this change does not impact the Slurm job scheduler. 

However,  /n/data1/n/cluster , and /n/shared_db will be inaccessible during this period. Keep in mind:

  • /n/data1 contains shared folders for some groups that use O2.

  • /n/cluster contains tools including: quota-v2, O2sacct, O2squeue, O2usage, scratch_create_directory.sh

  • /n/shared_db contains public databases for research use

** Any running jobs which rely on accessing these filesystems will fail once the maintenance begins.

2024-02-13 -

2024-02-15

O2 Cluster

After a successful storage migration, pending jobs on O2 were allowed to dispatch as of 10AM, while login services started coming online as of 10:30 AM.

To provide more robust and reliable storage, 

During this time, the O2 Cluster will be offline. This means: 

  • No jobs will run during the outage.

  • O2 sign-in services will be unavailable.

  • Jobs submitted to O2 through websites will not run until after the outage.

  • Globus file transfer will be unavailable.

  • Other services reliant on O2 will be affected.

Jobs scheduled to run during the outage will be postponed with ReqNodeNotAvail, Reserved for maintenance; they will start after the upgrade is complete. 

If a job needs to be completed before the upgrade, schedule it as soon as possible.

2024-02-07

O2 Cluster

An issue with the O2 storage environment affected access to use O2:

  • O2 logins are unavailable

  • O2 Portal is unavailable

  • O2 transfer cluster is unavailable

  • Currently running jobs may be affected

2024-02-05

O2 Cluster

There was a HMS-wide network outage on the morning of Feb 5 which affected access to the O2 cluster as well as most other HMS services.

Please note that it is possible that O2 jobs running during the network outage were affected, depending on the type of job, and also the nature of the network outage, which is still being determined.

2023-12-08 -

2024-01-16

O2 scratch storage

To provide more robust and reliable storage, HMS IT has deployed a new storage cluster, designated as /n/scratch, to replace the current /n/scratch3

Please update your workflow to use/n/scratch by January 8, and see our documentation about changes in the new https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2652045313

  • HMS IT is not migrating any data from scratch3 to the new /n/scratch.

  • Please copy any required data to the new space by January 16, when /n/scratch3 is retired.  

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu

2023-12-06 -

2023-12-07

O2 Cluster

To enhance your experience with our network-based storage and prepare for future growth,

HMS IT will make upgrades during:

  • start: Wednesday, December 6, 2023, at 8:00 AM

  • end: Thursday, December 7 at 8:00 AM EST (UTC-5).

    • O2 services were restored at approximately 7:00 PM on Wednesday December 6.

During this time, the O2 Cluster will be offline. This means: 

  • No jobs will run during the outage 

  • O2 login services will be unavailable 

  • Jobs submitted to O2 through websites will not run until after the outage 

  • Other services reliant on O2 will be affected  

Jobs scheduled to run during the outage will be postponed; they will start after the upgrade is complete.

If a job needs to be completed before the upgrade, schedule it as soon as possible. 

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu.

2023-09-18 -

2023-09-22

Standby Storage

HMS IT performed a gradual storage server upgrade on the HMS Standby storage server.

No impact is expected, but O2 users should avoid doing any large data transfers involving the Standby filesystem ( /n/standby ),

just to allow the upgrade to proceed as smoothly as possible.

2023-08-21

O2 Portal, Group and website Storage

A storage outage affected the availability of the following filesystems:

  • /n/data1

  • /n/data2

  • /n/log

  • /n/no_backup2

  • /n/shared_db

  • /n/www

If your O2 jobs access any of these filesystems, they may fail and need to be re-run after the outage is resolved. You may also have problems cd’ing into or seeing data in certain directories. The data is safe; it’s just the access to the data from O2 that is not working. This outage may also affect O2 logins and access to the O2 Portal.

2023-08-01

filesystems /n/data1/ /n/data2 /n/www /n/nobackup /n/shared_db /n/standby

Several storage filesystems serving the O2 cluster and related services were not responding. We temporarily suspended all pending and running jobs.

The Storage team investigated and resolved the issue.

2023-07-16

/n/groups filesystem unavailable

Start Time: Thursday, July 13 at 7:00 PM

IMPACT:

Scheduled migration of /n/groups storage filesystem to the new hardware is taking longer than expected.

Due to this delay, the filesystem /n/groups is not be available at this time in O2.
Any new job submitted to O2 that requires the /n/groups filesystem will fail. Please do not submit new jobs requiring access to /n/groups

We will notify you once the storage migration is completed and /n/groups is available in O2.

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu

--

Access to /n/groups has been restored on O2 login, compute, and transfer nodes. All files should be in the same state they were in on Thursday when the outage began. IT is taking steps to improve the type of situation that led to this issue.

2023-07-13 ->

2023-07-16

PLANNED FULL O2 CLUSTER OUTAGE

To increase the efficiency and security of the O2 cluster, HMS DevOps will upgrade the Slurm job scheduler.

Maintenance Window

  • START: Thursday, July 13 at 7:00 PM

  • END: Sunday, July 16 at noon(Completed at 07/17/2023 1PM)

This upgrade will require the O2 cluster to be offline, and as a result, no new jobs will be accepted during the mentioned period. To prevent disruption to your work, ensure all running jobs are complete before the upgrade commencement time.

Certain services related to the O2 cluster will be affected during the upgrade period. In particular:

  • the O2 login servers at o2.hms.harvard.edu will be offline

  • the O2 Portal will be offline

  • you will not be able to submit or execute jobs, including from websites.

  • the filesystem /n/groups will also be offline

However, not all services will be unavailable.

  • The O2 transfer servers at transfer.rc.hms.harvard.edu will remain operational.

  • any services not relying on the O2 job scheduler will continue functioning as usual during the upgrade period.

The upgrade is vital to keep our systems current with necessary security and bug fixes, resulting in enhanced performance for users. The process involves a database schema modification, which is time-consuming, hence the need for downtime.

If you have any questions or concerns, contact Research Computing at rchelp@hms.harvard.edu

2023-06-22

O2 job scheduler

The slurm job scheduler is currently experiencing high loads and might fail to accept new jobs, often returning the error:

 

sbatch: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:slurmdb-prod01.rc.hms.harvard.edu

--

The HMS IT DevOps team implemented a fix to remediate this issue

2023-05-02

Globus file transfer

We will be performing an upgrade on the Globus File Transfer service, which is used on O2 to share data both internally and with collaborators from outside of HMS. This work will upgrade Globus from version 4 to 5, which will provide improved stability and scalability as well as new user-facing features to be announced. As Globus v4 will be considered end-of-life as of July 31, the upgrade is necessary to maintain support with the vendor.

Maintenance Window: Tuesday, May 2, 9AM – 5PM , (Completed by 3PM)

Impact:

  • any collections created in HMSRC will be unavailable both internally to HMS users and to external collaborators

  • all data transfers which are still in progress will be canceled

We ask that any necessary long-running transfers be completed prior to May 2. Please plan accordingly, contact your external collaborators, and reach out to us as soon as possible at rchelp@hms.harvard.edu if there are any concerns.

2023-04-16

O2 network

HMS IT will be performing emergency maintenance on the O2 network to fix a critical issue. This will temporarily reduce O2's compute capacity.

Maintenance Window:

  • SUNDAY, 2023-04-16, 06:00 AM , expected to be completed before 01:00 PM

Impact:

  • Long-running jobs on affected compute nodes may be killed

  • Job pending times may increase for the "short" and "medium" job partitions in the days before the maintenance

2023-03-22

O2 network

There is an ongoing issue with networking gear in the data center that is impacting O2 jobs. In particular, many jobs will be unable to access storage and will either fail immediately or just hang until they time out. Some nodes may need to be rebooted, which will cause more jobs to fail. HMS IT is working to address the issue.

1:45pm update: All pending jobs on O2 (submitted today or before) have been paused for dispatch, except for jobs on the "interactive" partition to allow for some compute access. However, you might still experience problems accessing data from within those running interactive jobs.

5:30pm update: The issue was resolved and pending jobs were allowed to resume.

2023-03-05

HMS network

During 6am - 9am on Sunday Mar 5, HMS IT is performing emergency maintenance on the core switches in the data center where O2 is hosted.

Hopefully there will be no impact to O2, but there is a possibility that this work may temporarily affect O2 cluster jobs, logins, or access to data.

2022-11-09

Globus

The Globus Server software will be upgraded from v4.0.62 to v4.0.63. This change is necessary to update all security certificates to accommodate a Globus change to their certificate authority (CA) of choice. No other software modifications are made with this update.

Service Window: 12PM - 12:10PM. The process should only take a few minutes.

Impact:

  • Access to the Globus web UI is not impacted, however the HMS-RC endpoint will be unavailable while the Globus services are restarted.

  • Any Globus transfers started before the upgrade will pause momentarily, then resume once the endpoint is back online.

  • This upgrade will not impact O2 jobs or access to O2 login nodes or the transfer cluster.

2022-11-08

O2 Portal

The Open OnDemand software that powers the O2 Portal will be upgraded.

Service Window:

  • Tuesday, Nov 8, between 8:30 - 9:30 am. The process should only take a few minutes.

Impact:

  • The O2 Portal interface will not be available during the upgrade.

  • O2 Portal Applications started before the upgrade should remain running and then be accessible again afterwards.

This upgrade will not impact O2 jobs, access to O2 login nodes, or access to the transfer cluster.

2022-07-29

O2 cluster

A network outage at HMS affected access to O2 and other HMS services. HMS IT has addressed the core network issues.

Any cluster jobs which were running when the outage occurred may require being resubmitted, depending on the nature of the jobs.

2022-06-06

scratch filesystem: /n/scratch3

2022-06-27:

If you still need to recover any data you had on /n/scratch3 on June 6, you MUST contact HMS Research Computing to request a data restore: rchelp@hms.harvard.edu. All we need is your HMS ID (the one you log into O2 with).

The old copy of scratch data will be REMOVED ON JULY 5 to provide space for the new /n/scratch3 to grow.


2022-06-10:

PLEASE CONTACT RCHELP@HMS.HARVARD.EDU IF YOU WISH TO ACCESS THE DATA YOU HAD IN /n/scratch3 AS OF JUNE 6.

  • We believe that ALL data that was on the old scratch3 server as of June 6 has been copied to a safe location.

  • We have restored data to individual scratch folders for dozens of users who have requested this.

  • If your data has been restored, please move ("mv") it out of your _RESTORE directory rather than copying it, as space is currently limited.

  • We will not delete old data under /n/scratch3 for at least 30 more days. At some point after this, we will resume the deletion process to recover space.

At some point in the future, we will need to delete the copy of data from June 6 (hundreds of TB) to recover space, so please contact us in the next week if you want that data back.


2022-06-08:

Update on the continuing /n/scratch3 outage:

  • The interim scratch solution is up (on a different storage system) at /n/scratch3 and is being actively used by many researchers on O2.

  • We are gradually recovering the data from the old /n/scratch3. We do not yet have a timeline, but we hope to have all of it recovered within a few days.

  • Our current understanding is that no data were lost, except any new files being created as the outage began.

PLEASE EMAIL rchelp@hms.harvard.edu with your login name (HMS ID, like abc123) if you wish to get a copy of all the data in your /n/scratch3 directory. For logistical reasons, we can not provide self-service retrieval of that data.

  • Data will not be deleted for 30 days after being restored, no matter when the files were last accessed.

We ask for your continued patience as the data is restored and HMS IT fulfills your requests.


2022-06-07:

Thank you for your patience while we investigate the sudden unavailability of our O2 scratch storage solution. HMS IT has identified the root cause and have implemented an interim solution for O2 scratch. You should be able to go back to using scratch storage for your O2 jobs now. We are investigating the impact to science that was running when we experienced this sudden outage, and will be reaching out to those affected labs to help address any impacts to research that were created when this incident occurred.

HMS RC has created new scratch folders for all O2 users under the same location, e.g. /n/scratch3/users/a/abc123 , but you will need to recreate directories underneath it.

2022-04-28

Slurm job scheduler

O2's job scheduler software (slurm) will be upgraded from version 21.08.4 to 21.08.7

  • Maintenance window: 05:00AM - 06:00AM

This is a minor update and we do not expect any significant impact to O2 users during the upgrade process, but at times slurm related commands (e.g. sbatch, srun) may be slow to load or could fail.

2022-03-24

O2 login services, general access

An HMS infrastructure issue is affecting access to O2, including login services.

  • was resolved by approximately 2:30pm

  • some percentage of O2 cluster jobs were affected by the outage and failed

2022-03-21

O2 login services

O2 logins were offline this morning due to a wider HMS authentication issue.

  • was resolved by approximately 11:00am

2021-12-15

GPU scratch filesystem: /n/scratch_gpu

We have decided to repurpose the dedicated GPU scratch space storage /n/scratch_gpu .This is because /n/scratch_gpu was only very lightly used by the community, and its 1 PB of storage can be more efficiently put to use increasing capacity on other O2 group filesystems.

Schedule:

  • Dec 15:  /n/scratch_gpu will become read-only across O2. Existing data will remain in place.

  • Jan 26:  /n/scratch_gpu will be removed from O2

We're sorry to remove this resource for those who have been making use of it. We encourage everyone to use O2's main scratch space under /n/scratch3 for GPU jobs in addition to regular jobs.

2021-11-19 →

2021-11-20

(all day for both days)

O2 Cluster and Storage

On November 19 and 20, HMS IT will be performing maintenance on both the storage servers and the O2 cluster for improved stability and performance. A full outage of O2 is required during which it will not be possible to run or submit jobs. Slurm will be upgraded, and compute nodes will be patched and rebooted. The storage server software will be updated.

We will configure O2 to not accept any job submissions that overlap into the outage window. For long running jobs, please be sure they complete before November 19.

The Maintenance Window is all day for both days:

  • BEGIN: Nov 19, 12:01AM

  • END: Nov 20, 11:59PM

Details:

O2/SLURM

  • O2 will be unavailable for the duration of this outage.

  • No jobs will able to be dispatched nor submitted.

  • Login services to o2.hms.harvard.edu will be unavailable.

FILE TRANSFER SERVERS / GLOBUS

To allow access for data transfer on filesystems unaffected by the outage, file transfer servers (transfer.rc.hms.harvard.edu) and Globus will remain online.

The following filesystems will be offline for the duration of this outage:

  • /n/standby, /n/no_backup2, /n/data1, /n/data2, /n/shared_db, /n/www

The following filesystems are not affected and will remain online:

  • /home , /n/groups , /n/scratch3 , /n/scratch_gpu , /n/files

Web Hosting

Websites hosted by Research Computing (O2 and “Orchestra” hosting) will remain online, except during brief outages when the /n/www filesystem is affected by the storage maintenance:

  • Nov 19: There will be a brief outage between 7-8am

  • Nov 20: There will be a brief outage, but the exact time depends on the storage maintenance progress and is hard to predict.

  • Also, any websites which use O2 to run jobs will be unable to do so throughout the outage.

2021-08-26

O2 cluster storage

A storage outage affected availability of the following filesystems:

  • /n/data1

  • /www

  • /n/no_backup2

  • /n/shared_db

If your O2 jobs access any of these filesystems, they may die and need to be re-run after the outage is resolved. You may also have problems cd'ing into or seeing data in certain directories. The data is safe; it's just the access to the data from O2 that is not working.

2021-06-05

O2 interactive logins and network

HMS IT will be updating and restarting network switches which serve the O2 cluster network from 9:00 PM EDT to 01:00 AM EDT.

IMPACT:

  • Interactive logins to O2 login and transfer servers will be disabled, as those systems will require a restart. Any old screen/tmux sessions on these servers will be lost.

  • Network storage mounts on O2 ( /home , lab and scratch storage) may potentially be unstable on individual systems. Any affected mounts will be restored and verified during the maintenance window.

  • New cluster jobs should be prevented from running though the maintenance window. Any jobs which are still running and cannot tolerate periods hung network storage may be impacted.

2021-05-19

Orchestra production Database Servers

There will be an emergency maintenance of Orchestra production database services from 9:00 PM EDT to 11:59 PM EDT. This maintenance is for security remediation on the database servers. The database services that will be offline during the maintenance period are listed here:

  • mysql.orchestra

  • pgsql.orchestra

  • pgsql96.orchestra

2021-05-18

Orchestra development and staging Database Servers

There will be an emergency maintenance of Orchestra development and staging database services from 9:00 PM EDT to 11:59 PM EDT. This maintenance is for security remediation on the database servers. The database services that will be offline during the maintenance period are listed here:

  • dev.mysql.orchestra

  • dev.pgsql.orchestra

  • dev.pgsql96.orchestra

  • stage.pgsql.orchestra

  • stage.mysql.orchestra

2021-04-19

Weekly jobs report / Job priority rewards

The weekly O2 report email and the extra priority reward QoS are currently unavailable due to a problem with the Slurm database queries.

2021-03-03

O2 cluster

HMS IT will be performing a storage server upgrade that will impact the O2 Cluster. The new storage, which has all flash drives and more current hardware, will both improve the performance and reliability of O2’s storage and replace aging infrastructure. This is the first stage of a two-phase upgrade to improve O2’s storage.

A full outage of the O2 job scheduler is required, as some of the migrating data are used by the Slurm scheduler itself. While it will not be possible to run jobs during the outage, unaffected data will remain accessible via O2’s file transfer servers. We will be configuring O2 to not accept any job submissions that overlap into the outage window. For long running jobs, please be sure they complete before March 3.

Maintenance Schedule:

  • Wed. March 3, 9:00am – 5:00pm

The following services will be offline:

  • Access to all O2 login servers at: o2.hms.harvard.edu

  • O2 job submission and execution, including from websites which are otherwise not affected.

  • The following filesystems:

  •  

    • /n/data1

    • /n/shared_db

The following services will remain online:

  • Login access to the transfer servers: transfer.rc.hms.harvard.edu

  • Access to all unaffected filesystems, including:

  •  

    • /home

    • /n/groups

    • /n/data2

    • /n/scratch3

    • /n/no_backup2

    • /n/www

2021-02-24

Slurm scheduler

The Slurm job scheduler on O2 experienced an outage overnight. Resolved ~ 11:45am with an upgrade to Slurm to fix a bug in the scheduler.

Impact:

Job submissions and other Slurm commands (e.g. sbatch , srun, squeue) have not been functioning for several hours. Many jobs did continue to run, although some may have failed.

2021-02-18 → 2021-02-19

(overnight)

O2 database services

In order to perform an urgent and critical infrastructure migration, the virtual machines hosting several legacy databases will be must be shut down for a period of time overnight.  During the window, the virtual machines will be migrated to a different storage backend. 

The outage window will be from 8:00 PM on Thursday, 2/18 until 8:00 AM on Friday, 2/19.  There will be no other changes to the database servers at this time.


The following database servers will be affected:

mysql.orchestra
stage.mysql.orchestra
dev.mysql.orchestra
crhealth.mysql.orchestra
pgsql96.orchestra
dev.pgsql96.orchestra
pgsql.orchestra
dev.pgsql.orchestra
stage.pgsql.orchestra

2021-02-09

Slurm scheduler

1:15pm - 3:15pm (resolved)

There is a currently a problem with the O2 cluster Slurm job scheduler. Jobs already running on compute nodes are not affected, but you may get errors when trying to submit new jobs.

2021-01-17

O2 cluster network

HMS IT will be performing priority network maintenance to correct a bug in affected network switches at the Markley Data Center where O2 is hosted.

MAINTENANCE WINDOW:

  • Sunday, Jan 17, 5:00 am– 10:00 am

IMPACT:

  • Some sets of O2 compute nodes will lose network connectivity for about 30 minutes during the window, when the switches get restarted. Affected nodes have names beginning with either of:

  •  

    • compute-e

    • compute-p

  • All remaining compute nodes should not be affected, as they have redundant network hardware which should mitigate the impact.

We have configured O2 to not submit any new jobs to the affected nodes, so the maintenance should only affect longer jobs which are already running on compute-e and compute-p nodes. 

2020-12-04

/www and websites hosted by Research Computing

/n/no_backup2

A storage issue is affecting the availability of the following filesystems on O2:

  • /www

  • /n/no_backup2

The /www outage is resulting in most RC-hosted websites being offline.

2020-11-18

Slurm scheduler

The Slurm job scheduler on the O2 HPC cluster is currently having a performance issue and Slurm commands (e.g. sbatch, srun, squeue) may be unavailable.

Impact:

  • New job submissions may not work

  • Slurm commands may be unresponsive

  • Jobs already running should continue without problems.

2020-11-14

/n/files

To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array.

Outage window: Saturday, November 14, 2020, from 8:00 AM to 8:00 PM

  • This will only affect the O2 filesystem: /n/files

which is only accessible from the transfer servers (transfer.rc.hms.harvard.edu) and transfer compute nodes.

2020-09-26

O2 cluster

On Saturday, September 26, 2020, from 6 AM to 1 PM EDT, HMS IT will be completing a strategic network upgrade which will increase the HMS campus internet connectivity from 40 to 100 gigabits per second. This upgrade improves support for data-intensive science, online education, and remote work.

The O2 cluster will remain fully operational. However, there is the potential for issues related to O2’s authentication service during the maintenance. This could result in any of the following issues:

  • Difficulty logging into O2

  • Authentication timeouts

  • New job submissions could be slow or may fail

Jobs which are already running are expected to continue without any problems.

2020-09-18

O2 authentication

Intermittent problems with authentication for O2 login, transfer, and compute nodes.
This issue can potentially result in: Slow or failed logins to O2, Missing group membership, Failed job submissions

2020-08-26

O2 cluster

HMS IT will be performing minor maintenance on the O2 cluster which is expected to improve the responsiveness of the SLURM job scheduler (see outage notes for 8/9/2020)

MAINTENANCE WINDOW:

  • Wednesday, Aug 26, 8:00 am– 9:00 am

IMPACT: 

  • You will not be able to submit new jobs to O2 during this time.

  • SLURM commands (squeue, sbatch, srun, etc. ) will likely fail with a timeout error.

  • Already running jobs will continue to run normally.

2020-08-09

O2 cluster

New jobs are intermittently not starting on the cluster (or the sbatch command has errors) due to an issue with cluster-storage communication. We believe that currently running jobs are still executing normally. Disk read/writes may be slower than usual, which can cause other commands to be slow. We will provide details as we get them.

2020-07-30

Full O2 cluster

Unplanned SLURM outage, due to unbalanced file system allocations on a primary storage cluster. Service restored 3pm

2020-07-29 → 2020-07-30

/n/no_backup2

Scheduled Maintenance window: 2020-07-29 5:00 PM to 2020-07-30 5:00 PM

HMS IT will be migrating data from /n/no_backup2 to a newer filesystem.

2020-07-07

Full O2 cluster outage

Scheduled Maintenance window: All day on July 7:

  • Tue July 7, 12:00am - 11:59pm

Actual Maintenance window: 5.00 am - 11.45 pm

Once the upgrade is completed on Tuesday evening, all O2 services will become available.

O2 will be completely offline to allow for an update to the Linux operating system (to CentOS 7.7) on all cluster systems, as well as an update to the Slurm job scheduler (to version 20.02).

These are standard maintenance and security updates. No changes are expected from a usability perspective to O2 or its installed software (e.g. modules).

Impact:

  • Logins to O2 (o2.hms.harvard.edu) will be unavailable.

  • O2's job scheduler will be down. Jobs will not run and new job submissions will not work until after the work is completed.

  • Logins to the transfer servers (transfer.rc.hms.harvard.edu) will still be available to access all data on O2, after the morning's storage maintenance is completed (see above).

Websites hosted by HMS Research Computing will not be affected unless they run jobs on the cluster, since job submissions will be unavailable.



2020-07-07

/home data and logins to O2 transfer servers

Scheduled Maintenance window:

  • Tue July 7, 07:30am - 10:00am service restored 1:30pm

The /home filesystem may be unavailable during this window due to planned storage maintenance.

While the O2 cluster will also be offline all day on July 7 (see below), logins to the transfer servers at transfer.rc.hms.harvard.edu will still work, so research data will be accessible.

However, this separate storage maintenance will result in /home being unavailable during the 7:30 - 10am window, which could disrupt logins.

2020-07-03

/www data and websites hosted by Research Computing

Scheduled Maintenance window: 4:00pm - 6:00pm

Actual Maintenance window: 4pm - 6.30pm

HMS IT will be performing maintenance on the /www filesystem which will result in an temporary outage of websites and any cluster jobs which access data under /www

Websites hosted outside of Research Computing, such as through
WARP, OpenScholar, or HMS Windows web hosting, will not be affected.

2020-06-27

HMS IT will make upgrades to the high-throughput research network that may sometimes block access

Scheduled Maintenance window: 6:00am - 1pm

Actual Maintenance window: 6:00am - 12pm

HMS IT will make upgrades to the high-throughput research network that may sometimes block access between O2 and all external networks, including the HMS Quad, all Harvard networks, and the internet.

Note that the actual outage may end sooner than 1pm depending on the day's progress.

Impact:

  • Batch jobs which are already running on O2 will continue to run normally, except:

    • jobs which rely on connections to external networks (e.g. to download data) will be affected during the outage.

  • Jobs in a PENDING state will remain pending until after the outage is complete.

  • Interactive jobs and active logins to O2 and transfer servers will be killed.

  • Websites hosted by Research Computing on infrastructure in the data center (which is most of them) will be inaccessible.

2020-06-26

/n/scratch2 goes offline

The /n/scratch2 filesystem is being taken offline and retired.

Any data left on /n/scratch2 will be LOST and NOT RECOVERABLE.

All users of scratch space must switch their workflows to the new filesystem under /n/scratch3/users . More details at: Scratch3 Storage

2020-06-15

/n/scratch2 becomes READ-ONLY

The /n/scratch2 filesystem will made READ-ONLY in preparation for its retirement on June 26.

All users of scratch space must switch their workflows to the new filesystem under /n/scratch3/users . More details at: Scratch3 Storage

2020-05-16

Network connectivity between O2 and networks outside out the HMS data center.

Scheduled Maintenance window: 5:30am - 1pm

Actual Maintenance window: 5.30am - 10 am

A planned upgrade to the HMS interior firewall will result in an outage between O2 and all external networks, including the HMS Quad, all Harvard networks, and the internet.

Note that the actual outage may end sooner than 1pm depending on the day's progress.

Impact:

  • Batch jobs which are already running on O2 will continue to run normally, except:

    • jobs which rely on connections to external networks (e.g. to download data) will be affected during the outage.

  • Jobs in a PENDING state will remain pending until after the outage is complete.

  • Interactive jobs and active logins to O2 and transfer servers will be killed.

  • Websites hosted by Research Computing on infrastructure in the data center (which is most of them) will be inaccessible.

2020-04-13

/n/app

Maintenance window: 6:00am - 10:00am

The filesystem /n/app , which is used to host scientific software applications on O2, will be migrated onto newer, more performant storage.

  • HMS DevOps and Research Computing have tested this change in a development environment and do not expect it to affect jobs on O2 unless they are trying to directly access /n/app (e.g. to reload a Module).

  • As a precaution, ALL new jobs submitted during this time window will remain pending until after the work is completed. This includes both batch and interactive jobs.

  • Please plan accordingly. If you are very concerned about the robustness of your job to this change, we encourage you to make sure jobs finish before this time, and then wait to submit new ones until after the change.

2020-03-29

O2 cluster

/n/data2

/n/groups

Maintenance window: 3.30pm - 7pm

High load on one of the storage servers that is known on cluster as /n/data2 and /n/groups,

Impact:

  • Logins to O2 cluster and transfer nodes

  • Intermittent issues with data access.

The issue was resolved after the high load processes finished.

2020-02-27

O2 Cluster

The O2 job scheduler became unavailable due to an unforeseen bug in the scheduler control process.

The problem was resolved with a patch applied to the scheduler software.

2020-01-12

O2 Cluster

Maintenance window: 4am - 12pm (noon)

Network maintenance being performed in the HMS data center will result in outages of 1-3 minutes on the O2 network.

Impact:

  • To minimize the possibility of job failures, we will pause all jobs on O2 during this maintenance, and resume the jobs after the maintenance is complete.

  • O2 logins should still work, at least intermittently, during the maintenance. Any new jobs submitted during this period will remain pending until after the maintenance is complete.


This work over Jan 11-12 is being done to increase network performance in the HMS data center. After completion, all HMS systems hosted in the data center (including O2, storage, virtual machine infrastructure) will be running on a 100 GB network!

2020-01-11

Network connectivity between O2 and networks outside out the HMS data center.

Maintenance window: 4am - 8am

Network maintenance being performed on the HMS core network will result in outages of < 5 minutes between O2 and all external networks, including the HMS Quad and all Harvard networks.

Impact:

  • Batch jobs which are already running on O2 will continue to run normally.

  • Interactive jobs will get killed.

  • Jobs which rely on connections to external networks (e.g. to download data) will also be affected during these outages.


This work over Jan 11-12 is being done to increase network performance in the HMS data center. After completion, all HMS systems hosted in the data center (including O2, storage, virtual machine infrastructure) will be running on a 100 GB network!

2019-09-02

/n/scratch2

Unplanned service degradation for /n/scratch2 filesystem.

  • Date: Monday Sept 2 2019

  • Duration: 5.00AM to 11.30AM.

Resolved by stopping a service that is misbehaving on the filesystem. Working with Vendor to prevent issues like this in future.

2019-08-25

O2 job submissions / queries

The O2 cluster will have planned maintenance during this window:

  • Begins: Friday Aug 25 2019 , 08:00AM

  • Ends: Sunday Aug 25 2019, 11:59PM

    • Maintenance was completed by 06:00PM on Aug 25

An update for the /n/scratch2 filesystem will requires a service outage for all O2 systems. Cluster services will be restored as soon as possible on Sunday 8/25, although the outage is scheduled for all day, as needed.

No user data will be deleted or otherwise changed during the outage. But, as a precaution, please make sure you have copies of any critical data under /n/scratch2 in particular, since that filesystem is not backed up.

Cluster jobs will not be able to run during the upgrade, so we have configured Slurm such that:

  • Any job submitted with a wall time which crosses into the maintenance window will remain pending until the outage is over. 

  • If there are any running jobs on O2 when the outage begins (e.g. long jobs that were started awhile ago), they will be paused and Slurm will attempt to restart them after the outage, but we cannot guarantee such jobs will run successfully.

During the outage, you WILL NOT be able to:

  • Login to O2 login servers nor file transfer servers

  • Run any Slurm commands, such as: sbatch, srun, [etc.]

  • Run nor start any cluster jobs on O2

Websites hosted by Research Computing will not be functionally affected, unless they submit jobs to the cluster (only a few websites do this). But, web developers will be unable to login and edit files.

2019-08-23 →

2019-08-25

/n/scratch2

Planned service outage for /n/scratch2 filesystem:

  • Begins: Friday Aug 23 2019 , 08:00AM

  • Ends: Sunday Aug 25 2019, 11:59PM

    • Maintenance was completed by 06:00PM on Aug 25

An update for the /n/scratch2 filesystem requires a service outage. Service will be restored as soon as possible on Sunday 8/25, although the outage is scheduled for all day, as needed.

During this outage, all other O2 cluster services will be up and running until Sunday morning 8/25 (see below).



Please note:

  • We will disable the auto-deletion script for old files under /n/scratch2 for a few days after the outage.

  • For jobs requiring /n/scratch2 which may need to run during this outage window, make sure to submit those with the following sbatch option so they will not start running until the maintenance is completed:  --constraint=scratch2

2019-08-21

O2 job submissions / queries

The Slurm job scheduler went offline at approximately 3:30am on 2019-08-21 . We are currently working to restore this service.

  • 7:30am: The Slurm job scheduler has been restored to service, and O2 job submissions should be operating normally again.

    We are still investigating the root cause of this issue.

2019-08-17

O2 logins

Slurm job submissions

Scheduled power maintenance at Datacenter led to an unexpected power outage causing login nodes and other critical infrastructure services not respond. The issues is fixed by restoring power.

  • Date: Saturday August 17 2019

  • Duration: 6.30 AM to 6.00PM

2019-08-09

O2 logins

/home filesystem experienced a service degradation that resulted in not allowing users to login to O2 cluster and submit jobs. The issue has been fixed by vendor.

  • Date: Friday August 9 2019

  • Duration: 8.00 AM to 11.00AM

2019-07-07

O2 logins

A network firewall issue during planned maintenance caused O2 cluster logins to fail and new SLURM job submissions to remain pending. Jobs already running on compute nodes should not have been affected.

  • Date: Sunday July 7 2019

  • Duration: 6.50 AM to 8.00AM



2019-06-30 → 2019-07-01

network issues

unplanned service outage for all of o2 cluster. One of the networking devices failed and caused multiple issues across HMS including o2 cluster logins and SLURM job submissions. The

  • Date: Sunday June 30 2019

  • Duration: 10.30 PM to 3.30AM

Issue is resolved by replacing the faulty hardware.

2019-05-24 → 2019-05-25

/n/scratch2

Unplanned service degradation for /n/scratch2 filesystem.

  • Date: Friday May 24 2019

  • Duration: 10.30PM to 1AM.

Resolved by restarting a service on the filesystem.

2019-03-{18-22}

/n/scratch2

Unplanned service degradation. The /n/scratch2 filesystem is currently showing intermittent instability. We are monitoring it closely and will be implementing a number of hardware and software fixes this week resolve the performance problem.

  • Duration: 4 days

Implemented hardware and software fixes to resolve the core issue on the scratch2 fileserver.

2019-03-09

Slurm Job Scheduler

The Slurm Job Scheduler will have planned maintenance during this window:

  • Date: Saturday, Mar 9

  • Time: 08:00-19:00

Cluster jobs will not be able to run during the upgrade, so we have configured Slurm such that:

  • Any job submitted with a wall time which crosses into the maintenance window will remain pending until the outage is over. 

  • If there are any running jobs on O2 when the outage begins (e.g. long jobs that were started awhile ago), they will be paused and Slurm will attempt to restart them after the outage, but we cannot guarantee such jobs will run successfully.

During the outage, you WILL still be able to:

  • Login to O2 to access data

  • Copy data to/from the O2 file transfer servers (transfer.rc.hms.harvard.edu) – except to /n/files (due to the storage outage for /n/files also on Mar 9)

During the outage, you WILL NOT be able to:

  • Run any Slurm commands, such as: sbatch, srun, [etc.]

  • Run nor start any cluster jobs on O2

Websites hosted by Research Computing will not be affected, unless they submit jobs to the cluster (only a few websites do this).

2019-03-09

/n/files filesystem

The research.files server will have planned maintenance during this window:

  • Date: Saturday, Mar 9

  • Time: 09:00-15:00

During this window, the directory /n/files will not be available from the O2 file transfer servers and compute nodes.

2019-02-28

/n/scratch2

Unplanned Outage: A performance degradation on /n/scratch2 could cause jobs using /n/scratch2 to fail.

Duration: 7.00AM - 9.00PM

2018-12-05

/n/scratch2 filesystem

The automated process that deletes old files under /n/scratch2  (specifically, files that were last accessed more than 29 days ago), was intentionally disabled by Research Computing for approximately the past month due to an issue on the scratch2 fileserver. So, there are currently files older than 30 days on /n/scratch2 which have not yet been purged as they normally would have been.

We fixed that fileserver issue and resumed the purging of these old files starting Wed, Dec 5.

2018-12-03

O2 logins

All O2 cluster logins from outside of the HMS network will start requiring two-factor authentication.

For more details, please see: Two Factor Authentication (2FA) on O2 and Two Factor Authentication FAQ

Currently, O2 only requires a password login using your eCommons ID. Due to increased hacking attempts on O2, it is necessary to increase the security of our systems and going to two factor authentication is a big step.

HMS users already must use two factor authentication for Harvard Key and HMS VPN logins. O2 logins will work similarly.

Two-factor authentication will be required when logging in from:

  • the HMS Public wireless network

  • Other Harvard networks (FAS, etc)

  • Networks at HMS affiliates (hospitals, etc)

  • Any other external network (home, etc), NOT using the HMS VPN

  • an HMS system (even on campus) which has a public-facing IP address (this is mostly for web and other application servers, not your desktop)

2018-11-28

MySQL and PostgreSQL Databases

TWiki server

A planned maintenance window at:

Wednesday, 2018-11-28, 6pm - 7pm

for the following services:

  • PostgreSQL     (production and staging database servers)

  • MySQL             (production and staging database servers)

  • TWiki                (the website: wiki.med.harvard.edu)

Only websites and cluster jobs using these database services were affected.

2018-11-20

/n/scratch2

Intermittent storage issues affected the availability of the /n/scratch2  directories across O2 systems.

Duration: 6.00 AM - 6.00 PM

2018-10-24

/n/groups

/n/data2

Intermittent storage issues affected the availability of the /n/groups and /n/data2  directories across O2 systems.

2018-10-10

authentication service

Instability in O2's authentication service was causing some user accounts to lose group memberships across O2 systems.

Services were restored to normal at approximately 10:18am

2018-10-01

/n/scratch2 directory

When attempting to write to files under /n/scratch2 , you may see errant behavior such as:

  • Files are successfully written, but warning/error messages are generated

  • Files can not be written, with error messages such as "Bad Address"

Issue was resolved with a bug fix on the scratch2 storage server.

2018-09-08

O2 Login servers

Unplanned Outage: a core HMS network outage caused o2 login nodes unreachable. The issue is resolved by HMS Networking team

Duration: 02.30 PM - 5.30 PM

2018-08-17

PostgreSQL (production, staging)

MySQL (staging)

Request Tracker (RT)

These will be offline for approximately 1 hour starting at 9pm EDT for urgent maintenance.

2018-08-14

O2 Cluster and web services

Unplanned outage: a failure in the HMS virtual machine hosting infrastructure caused service outages in Research Computing's web services and, to a lesser extent, on the O2 cluster. The outage did not affect running cluster jobs, though.

Duration: 02:20 pm - 06:20 pm

2018-08-06

O2 Cluster

Unplanned outage: Cisco networking hardware failed and caused many jobs to fail. The defect hardware has been replaced and everything is stable.

Duration: 05:00 am - 08:00 pm

2018-04-25 → 2018-04-26

O2 login servers

2 login servers, login03 and login05, required reboots due to resource-intensive end user processes locking up those systems.

2018-04-11

O2 /home cluster

A severe network latency to the /home storage cluster impacted logins and processes trying to access this cluster. Duration: 11:00am - 05:00pm

2018-04-10

O2 Cluster

Unplanned outage: networking issues disrupted communication to/from the login nodes.  Running/pending jobs were not impacted.

2018-04-03

/home filesystem

The fileserver for /home was getting close to maximum capacity and running on older hardware.



This planned maintenance involved migrating all /home to data to a new fileserver with more capacity. This required a full shutdown of O2's Slurm job scheduler and unmounting /home from all cluster and infrastructure systems.

2018-03-13 → 2018-03-14

/n/scratch2 filesystem

A hardware failure on the /n/scratch2 fileserver resulted in /n/scratch2 being non-writable.

On 3/14, hardware was replaced and the filesystem repaired, after which service returned to normal.







"Unplanned SLURM outage due to scheduler issues.