|
...
Note: Any time <userid>
is mentioned in this document, it should be replaced with your HMS ID (formerly eCommons ID) and omit the <>. Likewise, <jobid>
should be replaced with an actual job ID, such as 12345. The name of a batch job submission script should be inserted wherever <jobscript>
is mentioned.
SLURM command | Sample command syntax | Meaning |
---|---|---|
sbatch |
| Submit a batch (non-interactive) job. |
srun |
| Start an interactive session for five minutes in the interactive queue with default 1 CPU core and 4GB of memory |
squeue |
| View status of your jobs in the queue. Only non-completed jobs will be shown. We have an easier-to-use alternative command called O2squeue. |
scontrol |
| Look at a running job in detail. For more information about the job, add the |
scancel |
| Cancel a job. |
scontrol |
| Pause a job |
scontrol |
| Release a held job (allow it to run) |
sacct |
| Check job accounting data. Running We have an easier-to-use alternative command called O2_jobs_report. |
sinfo |
| See node and partition information. Use the |
...
We recommend using commands such as O2squeue or O2sacctO2_jobs_report for job monitoring instead of relying on the email notifications, which do not contain much information.
...
Again, we would recommend leveraging commands like O2squeue or O2sacctO2_jobs_report for job monitoring instead of the SLURM email notifications.
...
If we save this script as srun_in_sbatch.sh
, it can be submitted by sbatch srun_in_sbatch.sh
. After the job completes, you can see the job statistics (which will be broken down by numbered job steps) by running sacct -j <jobid>
or by using O2sacct O2_jobs_report -j <jobid>
.
Job Arrays
Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks.
...
squeue
- by defaultsqueue
will show information about all users' jobs. Use-u <userid>
to get information just about yours. An easier-to-use alternative command to squeue is called O2squeue.scontrol
- mostscontrol
options can't be invoked by regular users, butscontrol show job <jobid>
is a useful command that gives detailed job information. This command only works for currently running jobs.sstat
- shows status information for currently running jobs. Many fields can be requested using the--format
parameter. Reference the job status fields in the sstat documentation for more information.sacct
- reports accounting information for jobs and job steps. This works for both running or completed jobs, but it is most useful for completed jobs. Many fields can be requested using the--format
parameter. Check the job accounting fields in the sacct documentation for more information. An easier-to-use alternative command to sacct is called O2sacctO2_jobs_report.
Example monitoring commands
...
By default, you will not receive an execution report by e-mail when a job you have submitted to Slurm completes. If you would like to receive such notifications, please use --mail-type
and optionally --mail-user
in your job submission. Currently, the SLURM emails contain minimal information (jobid, job name, run time, status, and exit code); this was an intentional design feature from the SLURM developers. At this time, we suggest running sacct
/O2sacct
O2_jobs_report
queries for more detailed information than the job emails provide.
...
We recommend monitoring your jobs using commands on O2 such as squeue
or O2squeue
, and sacct
or O2sacct
O2_jobs_report
.
More information for a job can be found by running sacct -j <jobid>
, or by using O2sacct O2_jobs_report -j <jobid>.
The following command can be used to obtain accounting information for a completed job:
...
Similarly, if your job used too much memory, you will receive an error like: Job <jobid> exceeded memory limit <memorylimit>, being killed
. For this job, sacct
or O2sacct
O2_jobs_report
will report a larger MaxRSS
than ReqMem
, and OUT_OF_MEMORY
job status. You will need to rerun the job, requesting more memory.
...