|
...
The basic process of running jobs:
You login via SSH (secure shell) to the host:
o2.hms.harvard.edu
If you are connecting to O2 from outside of the HMS network, you will be required to use two-factor authentication.
Please reference this wiki page for detailed instructions on how to login to the cluster.
If necessary, you copy your data to the cluster from your desktop or another location. See the File Transfer page for data transfer instructions. You may also want to copy large data inputs to the scratch3 filesystem for faster processing.
You submit your job - for example, a program to map DNA reads to a genome - specifying how long your job will take to run and what partition to run in. You can modify your job submission in many ways, like requesting a large amount of memory, as described below.
Your job sits in a partition (job status
PENDING
), and when it's your turn to run, SLURM finds a computer that's not too busy.Your job runs on that computer (also known as a "compute node"), and goes to job status
RUNNING
. While it's running, you don't interact with the program. If you're running a program that requires user input or pointing and clicking, see Interactive Sessions below.The job finishes running (job status
COMPLETED
, orFAILED
if it had an error). You get an email when the job is finished if you specified--mail-type
(and optionally--mail-user
) in your job submission command.If necessary, you might want to copy data back from the scratch filesystem to a backed-up location, or to your desktop.
Definitions
Here's some of the terminology that you'll find used in relation to SLURM in this and other documents:
...
Note: Any time <userid>
is mentioned in this document, it should be replaced with your HMS account, formerly called an eCommons ID (and omit the <>). Likewise, <jobid>
should be replaced with an actual job ID, such as 12345. The name of a batch job submission script should be inserted wherever <jobscript>
is mentioned.
SLURM command | Sample command syntax | Meaning |
---|---|---|
sbatch |
| Submit a batch (non-interactive) job. |
srun |
| Start an interactive session for five minutes in the interactive queue with default 1 CPU core and 4GB of memory |
squeue |
| View status of your jobs in the queue. Only non-completed jobs will be shown. We have an easier-to-use alternative command called O2squeue. |
scontrol |
| Look at a running job in detail. For more information about the job, add the |
scancel |
| Cancel a job. |
scontrol |
| Pause a job |
scontrol |
| Release a held job (allow it to run) |
sacct |
| Check job accounting data. Running We have an easier-to-use alternative command called O2sacct. |
sinfo |
| See node and partition information. Use the |
...
Workflow | Best queue to use | Time limit |
---|---|---|
1 or 2 jobs at a time |
| 1 month |
> 2 short jobs |
| 12 hours |
> 2 medium jobs | medium | 5 days |
> 2 long jobs |
| 1 month |
MPI parallel job using multiple nodes (with sbatch |
| 5 days |
| 5 days and 1 day | |
Interactive work (e.g. MATLAB graphical interface, editing/testing code) rather than a batch job |
| 12 hours |
...
groups
(/n/groups
)log
(/n/log
- for web hosting users)files
(/n/files
- accessible throughtransfer
partition, but you must request access to run jobs in this partition)scratch3
(/n/scratch3 - for storage of temporary or intermediary files)
...