Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

The O2 cluster is a collection of hundreds of computers with thousands of processing cores. We're using the SLURM (Simple Linux Utility for Resource Management) scheduler on O2. SLURM is basically a system for ensuring that the hundreds of users "fairly" share the processors and memory in the cluster.

...

  • hostnode, and server are all just fancy words for a computer

  • core Cluster nodes can run multiple jobs at a time. Currently, most of the nodes on O2 have 32 cores, meaning they can run 32 simple jobs at a time. Some jobs can utilize more than one core.

  • master host: The system that performs overall coordination of the SLURM cluster. In our case we have a primary server named slurm-prod01 and one backup.

  • submission host: When logging in to o2.hms.harvard.edu, you'll actually end up on a "login node" called login01, login02, login03, login04 or login05. You submit your SLURM jobs from there. Do not run any computationally heavy processes on login nodes!

  • execution host: (or "compute node") A system where SLURM jobs will actually run. In O2, compute nodes are named compute-x-yy-zz. x, yy and zz tell the location of the machine in the computer room. 

  • partition: Partitions can also be referred to as "queues." When submitting a job, you'll need to specify a partition to submit your job to. If you're not sure which partition to use, see the How to choose a partition in O2 page. We aimed to name the partitions logically, so the partition name will tell you the type of jobs that can be run through them, what resources those jobs can access, who can submit jobs to a given partition, and so forth.

  • filesystem: From your perspective, just a fancy word for a big disk drive. The different filesystems O2 sees have different characteristics, in terms of the speed of reading/writing data, whether they're backed up, etc. See Filesystems.

...

HMS Research Computing has detailed the various research data storage offerings here. Based upon where you are in the research data lifecycle, you may need to leverage different storage options. When you're in the process of analyzing your data, you can utilize "Active" storage. When you no longer need to use some data for a period of time (but will need to pull it back later or reference or analysis), you can move this data to "Standby" storage. Please reference the aforementioned HMS RC Storage page for a comparison of the various storage options, and to help identify which Storage option is right for your project. You can also reach out to Research Data Management for assistance with managing your research data using the available Storage options.

...

We can also get more granular in talking about where to store your files, specifically what directories a cluster user can write to on O2. Every cluster user has a /home directory with a 100GiB quota. When you login to the cluster, you will be placed in your home directory. This directory is named like /home/user, where "user" is replaced with your eCommons in lowercase. Additionally, each cluster user can use /n/scratch3 for storage of temporary or intermediary files. The per-user scratch3 quota is 10TiB or 1 million files/directories. Any file on scratch3 that is not accessed for 30 days will be deleted, and there are no backups of scratch3 data. A cluster user must create their scratch3 directory using a provided script. Finally, there are lab or group directories. These are located under /n/groups, /n/data1, or /n/data2. The quota for group directories is shared for all group members, and there is not a standard quota that all groups have. Any of these directory options (/home, /n/scratch3, group directory) can be used to store data that will be computed against on O2. When you submit a job to the SLURM scheduler on O2, you will need to mention in which directory your data (that you want to compute on) is stored and where you want the output data to be stored.

...

Partition

Specification *

short

12 hours

medium

5 days

long

30 days

interactive

2 job limit, 12 hours (default 4GB memory)

mpi

5 days limit

priority

2 job limit, 30 day limit

transfer

4 cores max, 5 days limit, 5 concurrently cores per user

gpu, gpu_requeue, gpu_quad, gpu_mpi_quad

see Using O2 GPU resources

...

Note:  Any time <userid> is mentioned in this document, it should be replaced with your HMS account, formerly called an eCommons ID (and omit the <>). Likewise, <jobid> should be replaced with an actual job ID, such as 12345. The name of a batch job submission script should be inserted wherever <jobscript> is mentioned.

SLURM

command

Sample command syntax

Meaning

sbatch

sbatch <jobscript>

Submit a batch (non-interactive) job.

srun

srun --pty -t 0-0:5:0 -p interactive /bin/bash

Start an interactive session for five minutes in the interactive queue .with default 1 CPU core and 4GB of memory

squeue

squeue -u <userid>

View status of your jobs in the queue. Only non-completed jobs will be shown.

We have an easier-to-use alternative command called O2squeue.

scontrol

scontrol show job <jobid>

Look at a running job in detail. For more information about the job, add the -dd parameter.

scancel

scancel <jobid>

Cancel a job. scancel can also be used to kill job arrays or job steps.

scontrol

scontrol hold <jobid>  

Pause a job

scontrol

scontrol release <jobid>

Release a held job (allow it to run)

sacct

sacct -j <jobid>

Check job accounting data. Running sacct is most useful for completed jobs.

We have an easier-to-use alternative command called O2sacct.

sinfo

sinfo

See node and partition information. Use the -N parameter to see information per node.

...

We recommend using commands such as O2squeue or O2sacct for job monitoring instead of relying on the email notifications, which do not contain much information.

...

Again, we would recommend leveraging commands like O2squeue or O2sacct for job monitoring instead of the SLURM email notifications.

...

You can then run your commands, and logout when you're done. The interactive queue has a limit of 12 hours and a default allocation of 1 CPU core and 4GB of memory.

You can request extra memory or multiple cores (up to 20) in an interactive session srun.

...


Example monitoring commands

...

By default, you will not receive an execution report by e-mail when a job you have submitted to Slurm completes. If you would like to receive such notifications, please use --mail-type and optionally --mail-user in your job submission. Currently, the SLURM emails contain minimal information (jobid, job name, run time, status, and exit code); this was an intentional design feature from the SLURM developers. At this time, we suggest running sacct/O2sacct queries for more detailed information than the job emails provide.

...

We recommend monitoring your jobs using commands on O2 such as squeue or O2squeue, and sacct or O2sacct.

More information for a job can be found by running sacct -j <jobid>, or by using O2sacct <jobid>. The following command can be used to obtain accounting information for a completed job:

...

For much more information on figuring out why jobs don't start, or don't finish, see the separate page on troubleshooting Slurm jobs.

If you are contacting Research Computing because a job did not behave as expected, it's often helpful to include the job report in your request. Just attach it to the email, or paste it into the form.

...

For other available "filename patterns" (what %N and %jare called by the SLURM developers), please reference the SLURM sbatch documentation.


Running graphical programs on O2 with X11

...


If you accidentally ask for only one core in the job submission (sbatch -c 1), but try to let your program use multiple cores (tophat -p 8), you will not be able to do so. On O2, CPU usage is restricted to the cores you request, which is performed by the cgroups plugin. You will not receive an explicit error message, but nevertheless, your job will be confined to the allocated cores. For this reason, you may observe a performance decay when running jobs on O2 as compared to other systems that allow you to use multiple cores that are not explicitly allocated to you. 

...