Getting Started with Longwood Cluster

Longwood is the newest High-Performance Compute Cluster at HMS. It is located at the Massachusetts Green High Performance Computing Center.

Specifications

Longwood contains a total of 64 H100 GPUs, plus 2 Grace Hopper nodes:

8 NVIDIA DGX nodes, each with 8 GPUs and 80GB of VRAM per GPU
2 NVIDIA Grace Hopper nodes, each containing a GPU with approximately 96GB of VRAM and over 400GB of RAM memory accessible from the GPUs.

This provides a heterogeneous environment with both Intel (DGX) and ARM (Grace Hopper) architectures. Module management is supported through LMOD, allowing easy loading of software suites like the NVIDIA NeMo deep learning toolkit and more.

How to connect

The cluster is currently only accessible via secure shell (ssh) command line from the HMS network:

HMS wired LAN
HMS Secure wireless network
HMS VPN

Two-factor authentication (DUO) is not required for logins because all connections must originate from an HMS network. Currently, the login server hostname is: login.dgx.rc.hms.harvard.edu

Example login command (for HMS ID: ab123):

ssh ab123@login.dgx.rc.hms.harvard.edu

Filesystems

/home
- Max: 100GiB
/n/scratch
- Individual scratch folders are created by HMS RC when making a new user account.
- Max: 25TiB or 2.5 million files
- Path: /n/scratch/users/<first_hms_id_char>/<hms_id>
/n/lw_groups
- Shared folders for labs, created by request. Contact rchelp@hms.harvard.edu if interested.

Snapshots

.snapshot is a feature available on Longwood. This enables recovery of data accidentally deleted by users, daily:14 days and weekly: 60 days.

Note: snapshots are NOT available for data under /n/scratch

For example: If you deleted foo.txt from ~/project1, you can recover it as:

cp ~/project1/.snapshot/<select-a-snapshot-directory>/foo.txt ~/project1

Scheduler

job scheduling is handled by the Slurm workload manager
The slurm/23.02.7 module is loaded by default and required to submit jobs

Software and Tools

Several popular tools are available as modules. Use the module -t spider command for a list of all modules.
Modules are available in two stacks tailored for each architecture:
- Intel: module load dgx
- ARM: module load grace
Modules automatically loaded: DefaulModules and slurm
NVIDIA NeMo™ and BioNeMo™ are available in Longwood
Users can also install additional custom tools locally
Singularity Containers are also supported
- Containers are located at /n/app/containers/

Partitions

gpu_dgx - the standard partition
gpu_grace - this targets the special Grace Hopper nodes. You’ll need to be using software compiled for ARM
TimeLimit is up to 5 days for both partitions

At this time it is possible to submit a job to either gpu_dgx or gpu_grace without using the flag --gres=gpu: to run a CPU-only job

Email List: longwood-cluster-announce

Everyone with a Longwood account gets automatically subscribed to the HMS email list longwood-cluster-announce@hms.harvard.edu , which is used by HMS IT to communicate Longwood cluster related information regarding service outages and other notifications. Subscription to this list is compulsory, and please note that cluster users cannot send messages to this list. Any questions related to Longwood should be sent to: rchelp@hms.harvard.edu

Questions?

Contact Research Computing for support with software and tools - rchelp@hms.harvard.edu

Contact Research Data Management for support or questions about storage offerings - rdmhelp@hms.harvard.edu