|
Errors in submitting jobs
...
Unfortunately, O2 job emails are not very detailed, but they do report the exit status of your job (e.g. COMPLETED
or FAILED
). It can be difficult to interpret how a job ran with the limited information O2 job emails contain. We suggest using the O2sacct
O2_jobs_report
command or examining output and error files (specified through the -o
and -e
parameters in your job submission) instead.
To look up information for a completed job:
O2sacct O2_jobs_report -j <jobid>
See the page on O2sacct O2_jobs_report for more details on using this command.
...
Slurm allows you to have "job steps", which are tasks that are part of a job (See the official Slurm Quick Start Guide for more information). By default, there will be one job step per job. Depending on which memory limit your job exceeds (job limit or step limit), you will see one of the above messages.
In sacct
or O2sacct
O2_jobs_report
, jobs that use too much memory will have OUT_OF_MEMORY
status. You can compare the reported memory usage (MaxRSS
) from sacct
/O2sacct
O2_jobs_report
to the amount of memory you requested; you may notice that the reported MaxRSS
is smaller than what you asked for. The memory accounting mechanism can miss a quick increase in memory usage, as it polls only at certain intervals. The mechanism that enforces memory limits (the Cgroups plugin) is much more sensitive to memory spikes, and can kill a job before the scheduler sees that too much memory was used. See the following error message example below.
...
This is a reporting function in Slurm active as of March 2019. If you see this message, then there are a couple of things to keep in mind. The first is that you are experiencing the same problem as above, with error: Exceeded job memory limit
or the equivalent. The second, is that even though you are experiencing the same error, the reported memory usage in the sacct
or O2sacct
O2_jobs_report
information for that job, may not necessarily report usage that exceeds your memory request. This is because the mechanism that reports the oom-kill
event is at the kernel level (cgroups
), and "checks" utilization far more frequently (near-constantly) than SLURM does (approximately every 30 seconds). When cgroups detects that a process has exceeded its allocation (even for a moment), the above message is sent to Slurm, and the job is killed. The job is then given the OUT_OF_MEMORY
state. However, cgroups does not pass SLURM the actual usage amount, which is why sacct
or O2sacctO2_jobs_report
may not be accurate if the job is killed by cgroups in between scheduler polling periods.
...
Your job will report different states before, during, and after execution. The most common ones are seen below, but this is not an exhaustive list. Look at Job State Codes in the squeue
manual or this section in the sacct
manual for more detail.
Job State Long form | Job State Short form | Meaning |
---|---|---|
|
| Job was killed, either by the user who submitted it, a system administrator, or by the Cgroups plugin (for using more resources than requested). |
|
| Job has ended in a zero exit status, and all processes from the job are no longer running. |
|
| This status differs from COMPLETED because some processes may still be running from this job. |
|
| Job did not complete successfully, and ended in a non-zero exit status. |
|
| The node or nodes that the job was running on had a problem. |
|
| The job tried to use more memory than was requested through the scheduler. |
|
| Job is queued, so it is not yet running. Look at the Jobs that never start section for details on why jobs can remain in pending state. |
|
| Job has been allocated resources, and is currently running. |
|
| Job exited because it reached its walltime limit. |
Slurm Job Reasons
If your job is pending, the squeue
command will show a "reason" why it is unable to run. Some of these reasons are detailed below, but please reference Job Reason Codes in the squeue
manual for more detail.
Job Reason | Meaning |
---|---|
| The job cannot run because you are currently using the maximum amount of memory allowed overall per user (12TiB). A similar reason will be seen if you have hit the maximum amount of cores allowed to be used at one type per user (1500 cores). |
| The Lab might have reached its allocatable CPU hour limit; this might happen if few users in the Lab have allocated thousands of medium jobs (2~5 days) or hundreds of longer jobs. |
| The job can't start until a job dependency finishes. |
| The job will stay pending, as it has been held by an administrator. |
| The job will stay pending, as it has been held by the user. |
| A node that the job requires is in "down" state, meaning that the node can't be currently used. |
| Your job has lower priority than others in the queue. The jobs with higher priority must dispatch first. |
| This job is unable to run because you have submittted more jobs of a certain type (e.g. >2 jobs in interactive partition, or >2 jobs in priority partition) than are allowed to run at one time. The "QOS" refers to "quality of service", through which these number of concurrent jobs are limited. For example, you will see this reason if you try to have more than two jobs running one time in the interactive partition. |
| A node that the job requests using cannot currently accept jobs. |
| This job reason is most commonly seen when there is an upcoming reservation for a maintenance window. Reservations are used to ensure that the required resources are available during a specific time frame. RC uses reservations to reserve all the nodes in the cluster during times when maintenance will be done, so no user jobs will be affected. |
| The required resources for running this job are not yet available. |