...
Errors in submitting jobs
No O2/Slurm account
You can log in to O2 login nodes may be able to login to the O2 cluster with your eCommons credentials (e.g. if you already had an Orchestra accountaccount with our previous cluster, Orchestra), but you will not be able to submit jobs if your eCommons ID is not tied to an account with the Slurm scheduleryou do not have an O2 account. If this is the case, when you try to submit jobs, you'll see errors like:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
See this section on the Using Slurm Basic page for how to verify if you have a Slurm account. To request a Slurm scheduler account, fill out this form. (Note that you can only use that form if you had an Orchestra account. If you didn't, please use the form accessible through hereTo request an O2 account, please use the O2 Account Request Form. (Please note that this form requires an eCommons ID to access.)
Scheduler is busy
Sometimes when the cluster is being very heavily utilized, the Slurm controller will be busy, and unable to accept your job submission request. If this is happening, you will notice that your job submission commands will hang; you can recieve receive an error message like:
Unable to allocate resources: Socket timed out on send/recv operation
If this happens, wait a few minutes and try submitting your job again. Please contact us if the error persists.
Missing -t
...
If you ask for more memory than any node has, your job will not be submitted. You will need to reduce the amount of memory requested in your job submission (currently should be less than 250GB)need to reduce the amount of memory requested in your job submission (currently should be less than 250GB).
If you need to access nodes with high memory (>250GB), please contact us and we can grant you access to the associated highmem
partition.
Incorrect job dependency syntax
...
The way you have specified a job dependency is incorrect. See Moving from Orchestra to O2 this wiki page section on job dependencies for more information. Correct the syntax of the command, and resubmit the job.
...
If your job has a lower priority than other jobs in the queue, it will stay pending. In squeue
, such jobs will have Priority
reported as the REASON
. You can also run sprio
to see the factors that make up a job's scheduling priority; by default sprio
will show all jobs in the queue, you can limit this by running sprio -u <userid>
(substitute <userid> with your eCommons)your eCommons). For more information on how SLURM scheduling and job priority works on O2, please see the Job Priority wiki page.
Jobs that start running and then exit
...
- You underestimated how much work the program needed to do
- The job is running very inefficiently because you did not tell the program to use all the resources you requested. For example, you might give a 1-hour time limit for a 4-core job (
sbatch -t 1:00 -n c 4
). You're planning to do about 4 hours of work in one "wall clock" hour, since 4 cores will be working on the job simultaneously. If you run the program with the wrong parameters (like forgetting to tell bowtie to use 4 threads), the program ends up running on just one core. After an hour, the program will only be about one quarter done. - The job is running slowly because you forgot to request the correct number of cores in the job submission. For example, you requested only 1 core (
sbatch -n c 1
), but direct your program to use more than one (telling Trimmomatic to use 6 threads). On O2, a job's CPU usage is strictly confined using the Slurm Cgroups plugin to the requested resources. If you observe a performance decay from Orchestra to O2on O2 as compared to another cluster, that may happen because your job was using multiple cores on Orchestra the other cluster even if you did not explicitly request them. You won't get an error on O2 if you try to use more cores than you have allocated. However, your job will run slower than expected, because although it looks like it is using multiple cores, it is restricted to the one you requested. - In other cases, another person's job on the same computer as yours might be taking up more resources than it's supposed to, or the storage system might be overloaded, making your job run slowly.
...
In sacct
, jobs that use too much memory will have CANCELLED
, FAILED
, or OUT_OF_MEMORY
status. You can compare the reported memory usage (MaxVMSize
MaxRSS
) from sacct
to the amount of memory you requested; you may notice that the reported MaxVMSize
MaxRSS
is smaller than what you asked for. The memory accounting mechanism can miss a quick increase in memory usage, as it polls only at certain intervals. The mechanism that enforces memory limits (the Cgroups plugin) is much more sensitive to memory spikes, and can kill a job before the scheduler sees that too much memory was used (see the following error message example below).
...
(Note for advanced users: MaxVMSize and MaxRSS measure slightly different types of memory. In some cases, jobs will be killed because the MaxRSS goes above the requested memory. But these values are usually within 20% or so of each other. Contact Research Computing if you have detailed questions about this.)
Alternatively, you may begin seeing something like the following:
Detected (n) oom-kill event(s) in step (jobid).batch
This is a new reporting function in Slurm active as of March 2019. If you see this message, then there are a couple of things to keep in mind. The first is that you are experiencing the same problem as above, with error: Exceeded job memory limit
or the equivalent. The second, is that even though you are experiencing the same error, the reported memory usage in the sacct
information for that job, may not necessarily report usage that exceeds your memory request. This is because the mechanism that reports the oom-kill
event is at the kernel level (cgroups
), and "checks" utilization far more frequently (near-constantly) than slurm SLURM does (approximately every 30 seconds). When cgroups detects that a process has exceeded its allocation (even for a moment), the above message is sent to Slurm, and the job is killed. The job is then given the OUT_OF_MEMORY
state. However, cgroups does not pass Slurm SLURM the actual usage amount, which is why sacct
may not be accurate if the job is killed by cgroups in between scheduler polling periods.
If you are using job steps in your submission, you'll be able to identify exactly which step triggered the event. The solution is still the same as above if you see this error: resubmit the job (step) with increased memory requirements, and (eventually) the problem should go away. If you continue to receive this error and you think it is incorrect, please contact Research Computing, but do keep in mind that programs can have unpredictable memory usage patterns.
...
Sometimes the problem is from the program itself, and has nothing to do with SlurmSLURM. Usually, you will get an error message in the file(s) you wrote with the -o
and/or -e
parameters. You can also look at the accounting information (through sacct
) for your job. If you are unable to figure out why the program exited, you can contact Research Computing. Depending on the program and error, we might or might not be able to diagnose and/or fix the problem.
...
Job State Long form | Job State Short form | Meaning |
---|---|---|
CANCELLED | CA | Job was killed, either by the user who submitted it, a system administrator, or by the Cgroups plugin (for using more resources than requested). |
COMPLETED | CD | Job has ended in a zero exit status, and all processes from the job are no longer running. |
COMPLETING | CG | This status differs from COMPLETED because some processes may still be running from this job. |
FAILED | F | Job did not complete successfully, and ended in a non-zero exit status. |
NODE_FAIL | N | The node or nodes that the job was running on had a problem. |
OUT_OF_MEMORY | OOM | The job tried to use more memory than was requested through the scheduler. |
PENDING | PD | Job is queued, so it is not yet running. Look at the Jobs that never start section for details on why jobs can remain in pending state. |
RUNNING | R | Job has been allocated resources, and is currently running. |
TIMEOUT | TO | Job exited because it reached its walltime limit. |
...
Job Reason | Meaning |
---|---|
| The job cannot run because you are currently using the maximum amount of memory allowed overall per user (12TB). A similar reason will be seen if you have hit the maximum amount of cores allowed to be used at one type per user (1500 cores). |
Dependency | The job can't start until a job dependency finishes. |
| The job will stay pending, as it has been held by an administrator. |
JobHeldUser | The job will stay pending, as it has been held by the user. |
NodeDown | A node that the job requires is in "down" state, meaning that the node can't be currently used. |
Priority | Your job has lower priority than others in the queue. The jobs with higher priority must dispatch first. |
| This job is unable to run because you have submittted more jobs of a certain type (e.g. >2 jobs in interactive partition, or >2 jobs in priority partition) than are allowed to run at one time. The "QOS" refers to "quality of service", through which these number of concurrent jobs are limited. For example, you will see this reason if you try to have more than two jobs running one time in the interactive partition. |
ReqNodeNotAvail | A node that the job requests using cannot currently accept jobs. ReqNodeNotAvail is a generic reason; in the simplest case, it means that the node is fully in use and is unable to run any more jobs. In other scenarios, this reason can indicate that there is a problem with the node. |
ReqNodeNotAvail, UnavailableNodes: | This job reason is most commonly seen when there is an upcoming reservation for a maintenance window. Reservations are used to ensure that the required resources are available during a specific timeframetime frame. RC uses reservations to reserve all the nodes in the cluster during times when maintenance will be done, so no user jobs will be affected. |
Resources | The required resources for running this job are not yet available. |
...