Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

The basic process of running jobs:

  1. You login via SSH (secure shell) to the host: o2.hms.harvard.edu

    1. If you are connecting to O2 from outside of the HMS network, you will be required to use two-factor authentication.

    2. Please reference this wiki page for detailed instructions on how to login to the cluster.

  2. If necessary, you copy your data to the cluster from your desktop or another location. See the File Transfer page for data transfer instructions. You may also want to copy large data inputs to the scratch3 filesystem for faster processing.

  3. You submit your job - for example, a program to map DNA reads to a genome - specifying how long your job will take to run and what partition to run in. You can modify your job submission in many ways, like requesting a large amount of memory, as described below.

  4. Your job sits in a partition (job status PENDING), and when it's your turn to run, SLURM finds a computer that's not too busy.

  5. Your job runs on that computer (also known as a "compute node"), and goes to job status RUNNING. While it's running, you don't interact with the program. If you're running a program that requires user input or pointing and clicking, see Interactive Sessions below.

  6. The job finishes running (job status COMPLETED, or FAILED if it had an error). You get an email when the job is finished if you specified --mail-type (and optionally --mail-user) in your job submission command.

  7. If necessary, you might want to copy data back from the scratch filesystem to a backed-up location, or to your desktop.

Definitions

Here's some of the terminology that you'll find used in relation to SLURM in this and other documents:

...

Note:  Any time <userid> is mentioned in this document, it should be replaced with your HMS account, formerly called an eCommons ID (and omit the <>). Likewise, <jobid> should be replaced with an actual job ID, such as 12345. The name of a batch job submission script should be inserted wherever <jobscript> is mentioned.

SLURM

command

Sample command syntax

Meaning

sbatch

sbatch <jobscript>

Submit a batch (non-interactive) job.

srun

srun --pty -t 0-0:5:0 -p interactive /bin/bash

Start an interactive session for five minutes in the interactive queue with default 1 CPU core and 4GB of memory

squeue

squeue -u <userid>

View status of your jobs in the queue. Only non-completed jobs will be shown.

We have an easier-to-use alternative command called O2squeue.

scontrol

scontrol show job <jobid>

Look at a running job in detail. For more information about the job, add the -dd parameter.

scancel

scancel <jobid>

Cancel a job. scancel can also be used to kill job arrays or job steps.

scontrol

scontrol hold <jobid>  

Pause a job

scontrol

scontrol release <jobid>

Release a held job (allow it to run)

sacct

sacct -j <jobid>

Check job accounting data. Running sacct is most useful for completed jobs.

We have an easier-to-use alternative command called O2sacct.

sinfo

sinfo

See node and partition information. Use the -N parameter to see information per node.

...

Workflow

Best queue to use

Time limit

1 or 2 jobs at a time

priority

1 month

> 2 short jobs

short

12 hours

> 2 medium jobs

medium

5 days

> 2 long jobs

long

1 month

MPI parallel job using multiple nodes (with sbatch -a and mpirun.sh)

mpi

5 days

GPU job

gpu, gpu_quad,gpu_mpi_quad, gpu_requeue

5 days and 1 day

Interactive work (e.g. MATLAB graphical interface, editing/testing code) rather than a batch job

interactive

12 hours

...

...