/
Getting Started with Longwood Cluster

Getting Started with Longwood Cluster

Longwood is the newest High-Performance Compute Cluster at HMS. It is located at the Massachusetts Green High Performance Computing Center.

Specifications

Longwood contains a total of 64 H100 GPUs, plus 2 Grace Hopper nodes:

  • 8 NVIDIA DGX nodes, each with 8 GPUs and 80GB of VRAM per GPU

  • 2 NVIDIA Grace Hopper nodes, each containing a GPU with approximately 96GB of VRAM and over 400GB of RAM memory accessible from the GPUs.

This provides a heterogeneous environment with both Intel (DGX) and ARM (Grace Hopper) architectures. Module management is supported through LMOD, allowing easy loading of software suites like the NVIDIA NeMo deep learning toolkit and more.

How to connect

The cluster is currently only accessible via secure shell (ssh) command line from the HMS network:

  • HMS wired LAN

  • HMS Secure wireless network

  • HMS VPN

Two-factor authentication (DUO) is not required for logins because all connections must originate from an HMS network. Currently, the login server hostname is: login.dgx.rc.hms.harvard.edu

Example login command (for HMS ID: ab123):

ssh ab123@login.dgx.rc.hms.harvard.edu

Filesystems

  • /home

    • Max: 100GiB

  • /n/scratch

    • Individual scratch folders are created by HMS RC when making a new user account.

    • Max: 25TiB or 2.5 million files

    • Path: /n/scratch/users/<first_hms_id_char>/<hms_id>

  • /n/lw_groups

 

To transfer data to/from the Longwood cluster, please use the server transfer.dgx.rc.hms.harvard.edu . From O2, you must initiate the transfer from a compute node.

Snapshots

.snapshot is a feature available on Longwood. This enables recovery of data accidentally deleted by users, daily:14 days and weekly: 60 days.

Note: snapshots are NOT available for data under /n/scratch

For example: If you deleted foo.txt from ~/project1, you can recover it as:

cp ~/project1/.snapshot/<select-a-snapshot-directory>/foo.txt ~/project1

Scheduler

  • job scheduling is handled by the Slurm workload manager

  • The slurm/23.02.7 module is loaded by default and required to submit jobs

Software and Tools

  • Several popular tools are available as modules. Use the module -t spider command for a list of all modules.

  • Modules installed by the RC team are available in two stacks tailored for each architecture:

    • Intel: module load dgx

    • ARM: module load grace

  • Modules automatically loaded: DefaulModules and slurm

  • NVIDIA NeMo™ and BioNeMo™ are available in Longwood

  • Users can also install additional custom tools locally

  • It is possible to load any module directly from login nodes, but the actual software (under /n/app) is only available on compute nodes

  • Singularity Containers are also supported

    • Containers are located at /n/app/containers/

Partitions

  • gpu_dgx - the standard GPU partition

  • gpu_grace - this targets the special Grace Hopper nodes. You’ll need to be using software compiled for ARM

  • gpu_dia - the DIA dedicated GPU partition which takes priority over gpu_dgx

  • cpu - the partition available to run jobs that do not require a GPU card. This partition does not include Grace Hopper nodes. If you require a Grace Hopper node, you can still submit a job to gpu_grace without specifying a GPU resource.

  • TimeLimit is up to 5 days for both partitions

 

If you need to run a CPU-only job, please use the cpu partition.

Jobs requesting only CPU submitted to gpu_dia or gpu_dgx will never start.

 

Limits

  • GPU LIMIT
    Each Longwood user can only allocate up to 1800 hours of GPU time at any time.
    The limit renews continuously and is only counted against the user's currently running jobs (not past jobs), and more specifically, only against what is left on the walltime of those running jobs.

    For example, if the cluster is empty and a user submits 100 single-card jobs with a walltime of 3 days, only 25 of those jobs will be able to start right away as those will allocate the total 1800 GPU hours (24 hr * 3 days * 25 cards).

    However, after only 3 hours of running, the allocated GPU hours for that user would drop to 1725, and another 3-day job (worth an additional 72 GPU hour allocation) could start, and so on.

    If a user doesn’t have any GPU job running, the user always has 1800 GPU hours;

    If a user has GPU jobs running, then the user has 1800 GPU hours minus the sum of the remaining allocated GPU hours of those running jobs.

    This limit's primary purpose is to prevent a single user from locking most of the cluster for several days.

Email List: longwood-cluster-announce

Everyone with a Longwood account gets automatically subscribed to the HMS email list longwood-cluster-announce@hms.harvard.edu , which is used by HMS IT to communicate Longwood cluster related information regarding service outages and other notifications. Subscription to this list is compulsory, and please note that cluster users cannot send messages to this list. Any questions related to Longwood should be sent to: rchelp@hms.harvard.edu

Questions?

Contact Research Computing for support with software and tools - rchelp@hms.harvard.edu

Contact Research Data Management for support or questions about storage offerings - rdmhelp@hms.harvard.edu

Related content

O2Portal
O2Portal
More like this
Available Software
Available Software
Read with this
Longwood Cluster
Longwood Cluster
More like this
Using Slurm Basic
Using Slurm Basic
Read with this
How to use HMS RC Desktop App
How to use HMS RC Desktop App
More like this
Longwood Cluster Status
Longwood Cluster Status
Read with this