NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

O2 HPC Cluster and Computing Nodes Hardware

What is a HPC Cluster

A typical configuration for a High Performance Computing Cluster contains the following components:

  • Login Nodes: Servers where the users connect remotely and from where they can submit jobs to the cluster. No memory or cpu intense process should ever be executed on the login nodes. In O2 we are strictly limiting cpu and memory access on login nodes, so intense processes executed on login nodes will most likely be killed or have very poor performance.


  • Computing Nodes: Servers designed specifically to support intense memory and cpu processes as well as special resources (GPU, TiB of memory, etc.). Any job correctly submitted to the cluster is eventually dispatched by the scheduler on the first available compute node.


  • Storage Server: A system of servers storing the data used on the Cluster. These are usually accessible on both login and compute nodes 


  • Scheduler: The scheduler main task is to efficiently manage the cluster computing resources and to dispatch jobs on computing nodes accordingly with the different job priorities while maximizing the cluster efficiency. 









O2 Cluster Architecture

O2 currently includes 390 computing nodes for a total of 12260 cores and ~106TiB of memory

  • 232 nodes, each node hostname is composed by the prefix compute-a-16- or compute-a-17- and the node number, for example compute-a-16-28, compute-a-16-29, ..., compute-a-16-171. Each node has 32 physical compute cores, 256GiB of memory and is connected to the network with a 10Gb ethernet card and in addition with a 40Gb Infiniband card.   

  • 69 nodes, each node hostname is composed by the prefix compute-e-16- and the node number. Each node has 28 physical compute cores, 256GiB of memory and is connected to the network with a 10Gb ethernet card.

  • 17 nodes, each node hostname is composed by the prefix compute-f-16- and the node number. Each node has 20 physical compute cores, 188GiB of memory and is connected to the network with a 10Gb ethernet card.

  • 11 heterogenous high memory nodes,  each node hostname is composed by the prefix compute-h-16- and the node number; 7 nodes have 750GiB of memory, 1 node 300GiB and the other node 1TiB

  • 27 GPU compute nodes, each node hostname is composed by the prefix compute-[g,gc]- and the node number, for a total of 133 GPU cards, including Tesla K80, M40, V100, V100s and RTX 6000,8000

  • 3 transfer nodes, each node hostname is composed by the prefix compute-t-16- and the node number. Each node is a VM with 4 cores and 6GiB of memory, those nodes are intended for data transfer to/from the /n/files filesystem.




Detailed Node Hardware Information

Compute-a-[16,17] CPU

vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
cache size : 40960 KB
flags*  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc  

Compute-e-16 CPU

vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
cache size : 35840 KB
flags*  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

Compute-f-16 CPU

vendor_id : GenuineIntel
model name :  Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
cache size : 25600 KB
flags*  :  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt

Compute-h-16 CPU

vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz   or   Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cache size : 20480 KB  or 15360 KB
flags*  :  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt

Compute-[g,gc]- CPU

vendor_id : GenuineIntel
model name :  Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz  or Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz or Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
cache size :  30720 KB or 25600 KB
flags*  :  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

* this information might not be relevant to most users but can be helpful if you are writing complex compiled codes or application