NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

Troubleshooting Slurm Jobs





Errors in submitting jobs

No O2/Slurm account

You may be able to login to the O2 cluster with your HMS account credentials (formerly called an HMS eCommons ID) if you had an account with our previous cluster, Orchestra; however, you will not be able to submit jobs if you do not have an O2 account. If this is the case, when you try to submit jobs, you'll see errors like:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

To request an O2 account, please use the O2 Account Request Form. Please note that this form requires an HMS account to access. If you want to register for an HMS account or need to manage your existing account, please go to https://harvardmed.service-now.com/stat?id=account.

Scheduler is busy

Sometimes when the cluster is being very heavily utilized, the Slurm controller will be busy, and unable to accept your job submission request. If this is happening, you will notice that your job submission commands will hang; you can receive an error message like:

Unable to allocate resources: Socket timed out on send/recv operation

If this happens, wait a few minutes and try submitting your job again. Please contact us if the error persists.

Missing -t

Job not submitted: please specify runtime limit with -t.

Requested time limit is invalid (missing or exceeds some limit)

See Using Slurm Basic. All job submission commands need an estimate of how long the job will run, specified with the -t parameter. Your job will be killed once it hits this limit, so you may want to slightly overestimate.

Missing -p 

Job not submitted: please specify partition with -p.

Unable to allocate resources: Invalid partition name specified

See Using Slurm Basic. All job submission commands need to include a partition to run the job in, specified with the -p parameter. For more details on which partition to choose, reference How to choose a partition in O2.

Runtime is too short for the specified partition

Job not submitted: please submit jobs that are less than 12 hours long to the short (or priority) partition.

or 

Job not submitted: please submit jobs that are less than 5 days long to the medium (or priority) partition.

If you submit a job to the wrong partition (based on the runtime of your job), you will encounter these errors. See How to choose a partition in O2 for details on the maximum runtime limit of each partition.

For example, if you submit a job to the medium or the long partition, but the runtime limit you specified using -t is less than the maximum run time of the short partition (12 hours), your job will not start. You should use the short partition instead of medium or long.

You will also see an error message if you submit a job to the long partition when the runtime is shorter than the maximum runtime limit of the medium partition (5 days). In this case, you should submit the job to the medium partition instead.

Wrong partition for batch job

Batch job not submitted: not permitted on this partition: interactive

Batch job submission failed: Invalid partition name specified

Jobs that are not interactive, or batch jobs, should not be submitted to the interactive partition. Submit your batch job to another partition, such as short, medium, long, priority, or mpi (if applicable), instead.

Too many CPUs requested

Job not submitted: jobs may not be submitted with greater than 20 cores to short partition.

Currently, the maximum number of cores that can be requested in any partition (besides mpi) is 20; this "jobs may not be submitted with greater than 20 cores" error can occur for jobs submitted to any of the following queues: short, medium, long, priority and interactive. To avoid this error, simply reduce the number of cores requested with the -n parameter in your job submission.

Too much memory requested

Memory specification can not be satisfied

Batch job submission failed: Requested node configuration is not available

If you ask for more memory than any node has, your job will not be submitted. You will need to reduce the amount of memory requested in your job submission (currently should be less than 250GiB).

If you need to access nodes with high memory (>250GiB), please contact us and we can grant you access to the associated highmem partition.

Incorrect job dependency syntax

Job dependency problem

The way you have specified a job dependency is incorrect. See this wiki page section on job dependencies for more information. Correct the syntax of the command, and resubmit the job.

Jobs that never start

Requested resources not available

If your jobs requested many cores or a large amount of memory, they may not start running very quickly. You can run squeue -u <userid> -t PD (substitute <userid> with your eCommons) to see the REASON why your jobs are not running. If the REASON seen in squeue is Resources, then the resources you requested in the job submission are not yet available. You may also see the ReqNodeNotAvail job reason if you requested that your job is run on a node that is not available (other jobs are running there, or the node is offline). 

Upcoming maintenance window / reservation

Prior to a planned outage, Research Computing will create a reservation to reserve all O2 nodes during that time, so maintenance can occur. If a job that has not yet started looks like it will finish after the maintenance completes, it will stay pending; the job will dispatch after the maintenance period is over. In squeue, such jobs that cannot run due to an upcoming maintenance window will show the  ReqNodeNotAvail, UnavailableNodes: reason.

Dependency has not been satisfied

You set a dependency for your job with the --dependency parameter that has not been met yet. For example, your job will not start until another job completes. In squeue, the REASON why this job is pending is reported as Dependency. 

Dependency will never be satisfied

You set a dependency for your job with the --dependency parameter that will never be met. For example, your job will not start until another job successfully completes. If the previous job failed, the current job will never run. In squeue, the REASON why this job is pending is reported as DependencyNeverSatisfied. Our configuration already removes jobs whose dependencies will never be fulfilled, but if something changes, then you will be able to see jobs with invalid dependencies in squeue, and you will need to manually kill them.

Priority

If your job has a lower priority than other jobs in the queue, it will stay pending. In squeue, such jobs will have Priority reported as the REASON. You can also run sprio to see the factors that make up a job's scheduling priority; by default sprio will show all jobs in the queue, you can limit this by running sprio -u <userid> (substitute <userid> with your eCommons). For more information on how SLURM scheduling and job priority works on O2, please see the Job Priority wiki page.

Jobs that start running and then exit

Jobs can exit for a number of reasons, such as an error in the code, exceeding a time or resource limit you specified, or due to a problem that the node was running on. 

Unfortunately, O2 job emails are not very detailed, but they do report the exit status of your job (e.g. COMPLETED or FAILED). It can be difficult to interpret how a job ran with the limited information O2 job emails contain. We suggest using the O2_jobs_report command or examining output and error files (specified through the -o and -e parameters in your job submission) instead. 

To look up information for a completed job:

O2_jobs_report -j <jobid>

See the page on O2_jobs_report for more details on using this command.

Exceeded run time

If your job runs beyond the "wall clock" time limit you requested with -t in your job submission command, then it will be killed. This can occur if:

  • You underestimated how much work the program needed to do

  • You thought you were requesting more time than you did. A common pitfall is requesting time in minutes with -t, when you thought you were requesting in hours. For example, -t 10 means 10 minutes, not 10 hours. The valid time limit formats for Slurm are "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". 

  • The job is running very inefficiently because you did not tell the program to use all the resources you requested. For example, you might give a 1-hour time limit for a 4-core job (sbatch -t 1:00 -c 4). You're planning to do about 4 hours of work in one "wall clock" hour, since 4 cores will be working on the job simultaneously. If you run the program with the wrong parameters (like forgetting to tell bowtie to use 4 threads), the program ends up running on just one core. After an hour, the program will only be about one quarter done. 

  • The job is running slowly because you forgot to request the correct number of cores in the job submission. For example, you requested only 1 core (sbatch -c 1), but direct your program to use more than one (telling Trimmomatic to use 6 threads). On O2, a job's CPU usage is strictly confined using the Slurm Cgroups plugin to the requested resources. If you observe a performance decay on O2 as compared to another cluster, that may happen because your job was using multiple cores on the other cluster even if you did not explicitly request them. You won't get an error on O2 if you try to use more cores than you have allocated. However, your job will run slower than expected, because although it looks like it is using multiple cores, it is restricted to the one you requested. 

  • In other cases, another person's job on the same computer as yours might be taking up more resources than it's supposed to, or the storage system might be overloaded, making your job run slowly.

 The TIMEOUT state will be reported for such jobs in sacct. You may see an error like this:

slurmstepd: error: *** JOB 1384071 ON compute-a-16-87 CANCELLED AT 2017-05-22T16:50:20 DUE TO TIME LIMIT ***

Resubmit your job with an increased time limit requested with the -t parameter.

Exceeded requested memory

If your job uses more memory than you requested using the --mem-per-cpu or --mem parameters in your job submission, it will be killed. You may see errors like this:

slurmstepd: error: Exceeded job memory limit

or 

slurmstepd: error: Exceeded step memory limit at some point.

Slurm allows you to have "job steps", which are tasks that are part of a job (See the official Slurm Quick Start Guide for more information). By default, there will be one job step per job. Depending on which memory limit your job exceeds (job limit or step limit), you will see one of the above messages.

In sacct or O2_jobs_report, jobs that use too much memory will have OUT_OF_MEMORY status. You can compare the reported memory usage (MaxRSS) from sacct/O2_jobs_report to the amount of memory you requested; you may notice that the reported MaxRSS is smaller than what you asked for. The memory accounting mechanism can miss a quick increase in memory usage, as it polls only at certain intervals. The mechanism that enforces memory limits (the Cgroups plugin) is much more sensitive to memory spikes, and can kill a job before the scheduler sees that too much memory was used. See the following error message example below.

To avoid this error, resubmit your job with an increased amount of memory requested with the --mem-per-cpu or --mem parameters.

(Note for advanced users: MaxVMSize and MaxRSS measure slightly different types of memory. In some cases, jobs will be killed because the MaxRSS goes above the requested memory. But these values are usually within 20% or so of each other. Contact Research Computing if you have detailed questions about this.)

Alternatively, you may begin seeing something like the following:

Detected (n) oom-kill event(s) in step (jobid).batch

This is a reporting function in Slurm active as of March 2019. If you see this message, then there are a couple of things to keep in mind. The first is that you are experiencing the same problem as above, with error: Exceeded job memory limit or the equivalent. The second, is that even though you are experiencing the same error, the reported memory usage in the sacct or O2_jobs_report information for that job, may not necessarily report usage that exceeds your memory request. This is because the mechanism that reports the oom-kill event is at the kernel level (cgroups), and "checks" utilization far more frequently (near-constantly) than SLURM does (approximately every 30 seconds). When cgroups detects that a process has exceeded its allocation (even for a moment), the above message is sent to Slurm, and the job is killed. The job is then given the OUT_OF_MEMORY state. However, cgroups does not pass SLURM the actual usage amount, which is why sacct or O2_jobs_report may not be accurate if the job is killed by cgroups in between scheduler polling periods.

If you are using job steps in your submission, you'll be able to identify exactly which step triggered the event. The solution is still the same as above if you see this error: resubmit the job (step) with increased memory requirements, and (eventually) the problem should go away. If you continue to receive this error and you think it is incorrect, please contact Research Computing, but do keep in mind that programs can have unpredictable memory usage patterns.

Errors in the code

Sometimes the problem is from the program itself, and has nothing to do with SLURM. Usually, you will get an error message in the file(s) you wrote with the -o and/or -e parameters. You can also look at the accounting information (through sacct) for your job. If you are unable to figure out why the program exited, you can contact Research Computing. Depending on the program and error, we might or might not be able to diagnose and/or fix the problem.

Jobs fail with the message: Unable to satisfy CPU bind request

The current version of Slurm (23.02.3) does not allow running sbatch jobs that contain srun, mpirun, or mpiexec commands such as

sbatch -p partition -t DD-HH:MM --wrap="srun <your command>"

or an equivalent sbatch script, if those jobs are submitted from within an interactive srun job. In this case, a variables conflict would cause the job to fail with an error like:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x001A800.
srun: error: Task launch for StepId=12345.0 failed on node compute-e-16-182: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

To prevent those errors, you could remove the srun command or submit the sbatch+srun jobs from a login node instead of an interactive job.

Slurm Job States

Your job will report different states before, during, and after execution. The most common ones are seen below, but this is not an exhaustive list. Look at Job State Codes in the squeue manual or this section in the sacct manual for more detail.

Job State

Long form

Job State

Short form

Meaning

Job State

Long form

Job State

Short form

Meaning

CANCELLED

CA

Job was killed, either by the user who submitted it, a system administrator, or by the Cgroups plugin (for using more resources than requested).

COMPLETED

CD

Job has ended in a zero exit status, and all processes from the job are no longer running.

COMPLETING

CG

This status differs from COMPLETED because some processes may still be running from this job.

FAILED

F

Job did not complete successfully, and ended in a non-zero exit status.

NODE_FAIL

N 

The node or nodes that the job was running on had a problem.

OUT_OF_MEMORY

OOM

The job tried to use more memory than was requested through the scheduler.

PENDING

PD

Job is queued, so it is not yet running. Look at the Jobs that never start section for details on why jobs can remain in pending state.

RUNNING

R 

Job has been allocated resources, and is currently running.

TIMEOUT

TO

Job exited because it reached its walltime limit.

Slurm Job Reasons

If your job is pending, the squeue command will show a "reason" why it is unable to run. Some of these reasons are detailed below, but please reference Job Reason Codes in the squeue manual for more detail. 

Job Reason

Meaning

Job Reason

Meaning

AssocGrpMemLimit

The job cannot run because you are currently using the maximum amount of memory allowed overall per user (12TiB). A similar reason will be seen if you have hit the maximum amount of cores allowed to be used at one type per user (1500 cores).

AssocGrpCPURunMinutesLimit

The Lab might have reached its allocatable CPU hour limit; this might happen if few users in the Lab have allocated thousands of medium jobs (2~5 days) or hundreds of longer jobs.

Dependency

The job can't start until a job dependency finishes.

JobHeldAdmin

The job will stay pending, as it has been held by an administrator.

JobHeldUser

The job will stay pending, as it has been held by the user.

NodeDown

A node that the job requires is in "down" state, meaning that the node can't be currently used.

Priority

Your job has lower priority than others in the queue. The jobs with higher priority must dispatch first.

QOSMaxJobsPerUserLimit

This job is unable to run because you have submittted more jobs of a certain type (e.g. >2 jobs in interactive partition, or >2 jobs in priority partition) than are allowed to run at one time. The "QOS" refers to "quality of service", through which these number of concurrent jobs are limited. For example, you will see this reason if you try to have more than two jobs running one time in the interactive partition.

ReqNodeNotAvail

A node that the job requests using cannot currently accept jobs. ReqNodeNotAvail is a generic reason; in the simplest case, it means that the node is fully in use and is unable to run any more jobs. In other scenarios, this reason can indicate that there is a problem with the node.

ReqNodeNotAvail, UnavailableNodes:

This job reason is most commonly seen when there is an upcoming reservation for a maintenance window. Reservations are used to ensure that the required resources are available during a specific time frame. RC uses reservations to reserve all the nodes in the cluster during times when maintenance will be done, so no user jobs will be affected.

Resources

The required resources for running this job are not yet available.