Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: remove eCommons terminology, change gb to gib, remove old Orchestra reference

...

You may be able to login to the O2 cluster with your eCommons HMS account credentials (e.g. formerly called an HMS eCommons ID) if you had an account with our previous cluster, Orchestra); however, but you will not be able to submit jobs if you do not have an O2 account. If this is the case, when you try to submit jobs, you'll see errors like:

...

To request an O2 account, please use the O2 Account Request Form. ( Please note that this form requires an eCommons ID HMS account to access. )If you want to register for an HMS account or need to manage your existing account, please go to https://harvardmed.service-now.com/stat?id=account.

Scheduler is busy

Sometimes when the cluster is being very heavily utilized, the Slurm controller will be busy, and unable to accept your job submission request. If this is happening, you will notice that your job submission commands will hang; you can receive an error message like:

...

If you ask for more memory than any node has, your job will not be submitted. You will need to reduce the amount of memory requested in your job submission (currently should be less than 250GB250GiB).

If you need to access nodes with high memory (>250GB>250GiB), please contact us and we can grant you access to the associated highmem partition.

...

Jobs can exit for a number of reasons, such as an error in the code, exceeding a time or resource limit you specified, or due to a problem that the node was running on. If you are familiar with running jobs on Orchestra, you are used to receiving verbose job report emails. node was running on. 

Unfortunately, O2 job emails are not as very detailed, but they do report the exit status of your job (e.g. COMPLETED or FAILED). It can be difficult to interpret how a job ran with the limited information O2 job emails contain. We suggest using the sacct O2sacct command or examining output and error files (specified through the -o and -e parameters in your job submission) instead. 

To look up information for a completed job:

sacct O2sacct -j <jobid>

By default, sacct reports only a few fields, so you may want to use the --format parameter for additional fields. See the Monitoring Jobs section See the page on O2sacct for more details on using this command.

Exceeded run time

If your job runs beyond the "wall clock" time limit you requested with -t in your job submission command, then it will be killed. This can occur if:

...

In sacct, jobs that use too much memory will have CANCELLED , FAILED , or OUT_OF_MEMORY status. You can compare the reported memory usage (MaxRSS) from sacct to the amount of memory you requested; you may notice that the reported MaxRSS is smaller than what you asked for. The memory accounting mechanism can miss a quick increase in memory usage, as it polls only at certain intervals. The mechanism that enforces memory limits (the Cgroups plugin) is much more sensitive to memory spikes, and can kill a job before the scheduler sees that too much memory was used (see . See the following error message example below).

To avoid this error, resubmit your job with an increased amount of memory requested with the --mem-per-cpu or --mem parameters.

...

Job ReasonMeaning

AssocGrpMemLimit

The job cannot run because you are currently using the maximum amount of memory allowed overall per user (12TB12TiB). A similar reason will be seen if you have hit the maximum amount of cores allowed to be used at one type per user (1500 cores).
DependencyThe job can't start until a job dependency finishes.

JobHeldAdmin

The job will stay pending, as it has been held by an administrator.
JobHeldUser

The job will stay pending, as it has been held by the user.

NodeDownA node that the job requires is in "down" state, meaning that the node can't be currently used.
PriorityYour job has lower priority than others in the queue. The jobs with higher priority must dispatch first.

QOSMaxJobsPerUserLimit

This job is unable to run because you have submittted more jobs of a certain type (e.g. >2 jobs in interactive partition, or >2 jobs in priority partition) than are allowed to run at one time. The "QOS" refers to "quality of service", through which these number of concurrent jobs are limited. For example, you will see this reason if you try to have more than two jobs running one time in the interactive partition.
ReqNodeNotAvailA node that the job requests using cannot currently accept jobs. ReqNodeNotAvail is a generic reason; in the simplest case, it means that the node is fully in use and is unable to run any more jobs. In other scenarios, this reason can indicate that there is a problem with the node.
ReqNodeNotAvail, UnavailableNodes:

This job reason is most commonly seen when there is an upcoming reservation for a maintenance window. Reservations are used to ensure that the required resources are available during a specific time frame. RC uses reservations to reserve all the nodes in the cluster during times when maintenance will be done, so no user jobs will be affected.

ResourcesThe required resources for running this job are not yet available.

...