Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • You underestimated how much work the program needed to do
  • You thought you were requesting more time than you did. A common pitfall is requesting time in minutes with -t, when you thought you were requesting in hours. For example, -t 10 means 10 minutes, not 10 hours. The valid time limit formats for Slurm are "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". 
  • The job is running very inefficiently because you did not tell the program to use all the resources you requested. For exampleexample, you might give a 1-hour time limit for a 4-core job (sbatch -t 1:00 -c 4). You're planning to do about 4 hours of work in one "wall clock" hour, since 4 cores will be working on the job simultaneously. If you run the program with the wrong parameters (like forgetting to tell bowtie to use 4 threads), the program ends up running on just one core. After an hour, the program will only be about one quarter done. 
  • The job is running slowly because you forgot to request the correct number of cores in the job submission. For example, you requested only 1 core (sbatch -c 1), but direct your program to use more than one (telling Trimmomatic to use 6 threads). On O2, a job's CPU usage is strictly confined using the Slurm Cgroups plugin to the requested resources. If you observe a performance decay on O2 as compared to another cluster, that may happen because your job was using multiple cores on the other cluster even if you did not explicitly request them. You won't get an error on O2 if you try to use more cores than you have allocated. However, your job will run slower than expected, because although it looks like it is using multiple cores, it is restricted to the one you requested. 
  • In other cases, another person's job on the same computer as yours might be taking up more resources than it's supposed to, or the storage system might be overloaded, making your job run slowly.

...

Slurm allows you to have "job steps", which are tasks that are part of a job (See the official Slurm Quick Start Guide for more information). By default, there will be one job step per job. Depending on which memory limit your job exceeds (job limit or step limit), you will see one of the above messages.

In sacct or O2sacct, jobs that use too much memory will have CANCELLED , FAILED , or OUT_OF_MEMORY status. You can compare the reported memory usage (MaxRSS) from sacct/O2sacct to the amount of memory you requested; you may notice that the reported MaxRSS is smaller than what you asked for. The memory accounting mechanism can miss a quick increase in memory usage, as it polls only at certain intervals. The mechanism that enforces memory limits (the Cgroups plugin) is much more sensitive to memory spikes, and can kill a job before the scheduler sees that too much memory was used. See the following error message example below.

...

(Note for advanced users: MaxVMSize and MaxRSS measure slightly different types of memory. In some cases, jobs will be killed because the MaxRSS goes above the requested memory. But these values are usually within 20% or so of each other. Contact Research Computing if you have detailed questions about this.)

...

This is a reporting function in Slurm active as of March 2019. If you see this message, then there are a couple of things to keep in mind. The first is that you are experiencing the same problem as above, with error: Exceeded job memory limit or the equivalent. The second, is that even though you are experiencing the same error, the reported memory usage in the sacct or O2sacct information for that job, may not necessarily report usage that exceeds your memory request. This is because the mechanism that reports the oom-kill event is at the kernel level (cgroups), and "checks" utilization far more frequently (near-constantly) than SLURM does (approximately every 30 seconds). When cgroups detects that a process has exceeded its allocation (even for a moment), the above message is sent to Slurm, and the job is killed. The job is then given the OUT_OF_MEMORY state. However, cgroups does not pass SLURM the actual usage amount, which is why sacct or O2sacct may not be accurate if the job is killed by cgroups in between scheduler polling periods.

...