Job's priority in O2 is influenced mostly by two parameters: the user's recent cluster utilization and the job's pending time. The more a user consumes resources (CPU, memory, gpu, etc.), the longer additional jobs might pend. The longer jobs pend to higher, their priority becomes. The paragraphs below contain more details about the scheduler algorithm.
Job priority calculation
SLURM computes the overall priority of each job based on six factors: job age, user fairshare, job size, partition, QOS, TRES. The six factors can have values between 0 and 1 and are calculated as described below:
Age= The Value is based on the job pending time (since eligible) normalized against the PriorityMaxAge parameter, currently PriorityMaxAge is set to 7-00:00:00.
JobSize = The job size factor correlates to the number of nodes or CPUs the job has requested, the larger the job, the closer to 1 is the jobsize factor. Currently the contribution from this factor is negligible.
Partition = The value is calculated as the ratio between the priority of the partition requested by the job against the maximum partition priority. Currently the max partition priority is set to 14 (for partition interactive)
QOS = The Quality of Services factor is calculated as the ratio between the job's qos priority and the maximum qos priority. By default each job is submitted with the qos "normal" which has a zero priority value
TRES = not currently active, should always be zero
FairShare =This value is proportional to the ratio of resources available to each users and the amount of resources that has been consumed by the user submitting the job, see below for details.
Each of these factors is then augmented by a custom multiplier in order to obtain the overall JobPriority value accordingly with the formula:
where the multipliers are currently set to the values:
S is the normalized number of shares made available for each users. In our current setup all users get the same number of raw share
Uis the normalized usage. This is calculated as U= Uh / Rh where Uhis the user historical usage subject to the half-life decay and Rhis the total historical usage across the cluster also subject to the half-life decay
and the periods are based on the PriorityDecayHalfLife time interval, currently set to 6:00:00 (6 hours).
Currently Usage is calculated as: Allocated_Ncpus*elapsed_seconds+Allocated_Mem_GiB*0.0625*elapsed_seconds+Allocated_NGPUs*5*elapsed_seconds
dis the FairShareDampeningFactor. This is used to reduce the impact of resource consumption on the fairshare value and to account for the ratio of active users against total users. The value is currently set to 10 and it is dynamically changed as needed.
The initial fairshare value (with zero normalized usage) for each user is equal to 1; if a user is consuming exactly his/her share amount of available resources then his/her fairshare value will be 0.5.
It takes approximately 48 hours for a fully depleted fairshare to return from 0 to 1, assuming no additional usage is being accumulated by the user during those ~48 hours.
Two useful commands to see the priority of pending jobs and fairshare are sprio and sshare
The scheduler tries first to dispatch jobs in the partition interactive, then jobs in the partition priority and finally jobs submitted to all remaining partitions. As a consequence interactive and priority jobs will most likely be dispatched first, even if they have a lower overall priority than jobs pending on other partitions (short,medium,long,mpi,etc.).
Low priority jobs might be dispatched before high priority jobs only if doing so does not impact the expected start time of the high priority jobs and if the required resources by the low priority jobs are free and idle.
How to manage priority of your own jobs
By default the scheduler will try to dispatch jobs with the same priority values based on their jobid numbers, so the first step to control priority of your own jobs is to submit them in the same order you would want them to be dispatched.
Once jobs have been submitted, and are still pending, there are two commands that can be used to modify their relative priority:
scontrol top <jobid>
This command increases the priority of job <jobid> to match the maximum priority of all user’s jobs and subtracts that priority from all those other jobs in equal decrements.
For example consider the case with the following six pending jobs:
and in this case the total priority of the user is “conserved”.
scontrol update jobid=<jobid(s)> nice=<+value>
This command can be applied to multiple jobs at the same time and it will subtract the desired priority points to the given jobs; note that positive nice value reduce the job priority (and negative values cannot be applied)
and in this case the total priority of the user is not conserved. However the priority of the jobs across the cluster usually varies by thousands of points, so changing it with small “nice” values should have a negligible impact on the jobs pending time.
There is a limit on the total CPU-hours that can be reserved by a single lab at any given time. This limit was introduced to prevent a single lab from locking down a large portion of the cluster for extended periods of time. This limit will become active only if multiple users in the same lab are allocating a large portion of the O2 cluster resources. This can for example happen if few users have thousands of multi-day or hundreds of multi-week running jobs. When this limit becomes active the remaining pending jobs will display the message AssocGrpCPURunMinute.
Note: The "gpu" partition has additional limits that might trigger the above message, for more details about the "gpu" partition please refer to the "Using O2 GPU resources" wiki page.
Job priority rewards
Many users have been submitting jobs requesting more memory (--mem) and/or run time (-t) than they need. This leads to longer pending times for the submitting users as well as other cluster users.
In order to improve this situation, we are introducing a system where priority points (technically called “quality of service” or QOS) will be assigned, on a weekly basis, to users that have been submitting jobs requesting reasonably accurate resources.
We understand that it is often not possible to predict exactly the memory and/or run time required by each job, but many users are requesting more then 10X the amount of memory and wall time actually need. Our goal is for O2 users to test workflows they will be using heavily, and to check resources actually consumed by their jobs and adjust future job submissions accordingly.
In order to see how accurately you are requesting resources you can:
check the report you receive every Monday, which contains information about your overall usage for the previous 7 days.
In order to see if this reward QOS has been granted to any of your currently pending jobs you can run the command "sprio -l -u $USER”, a non zero value for the QOS column indicates that reward points were granted.
Below are a few examples of workflows where the resource requests should or shouldn’t be changed.
Ex 1 (Good):
Your jobs’ runtime varies between 1 and 4 hours and you are requesting a wall-time of 5 hours for all your jobs.
There is no need for further optimization. You are using only a little extra time, and it may be hard to predict which jobs will be the slower ones.
Ex 2 (Bad):
You submit a thousand jobs with a runtime of < 5 min, but 10 of those jobs run for 8 hours. You are requesting 8 hours for all your jobs.
You should request ~10 minutes wall time for all the jobs, and resubmit the small number of jobs that run for too long. Your pend times will be substantially shorter, so the overall time to run all the jobs will likely be shorter. (Of course, if you can predict which jobs will take a long time, you can submit those ten separately. But it may be difficult to predict that.)
Ex 3 (Good):
Your job’s memory consumption varies uniformly between 1 and 10 GiB and you are requesting 11GiB
No need to optimize.
Ex 4 (Bad):
You submit a thousand jobs. The vast majority of those jobs use of 1-3GiB of memory but 10 jobs use ~10GiB. You are requesting 12 GiB of memory all your jobs.
If you reduce the requested memory for all jobs, and resubmit the limited number of jobs that might fail (exceeding memory limits), the pend times for all jobs – and therefore the total time – will likely be less. We can work with you to modify your workflows to easily identify and rerun the failed jobs.
Ex 5 (Good):
You run several single core jobs, without explicitly requesting any memory allocation. (I.e., your script has no “#SBATCH –mem” line). The scheduler allocates by default 1GiB of memory for each job but they use only ~100MiB (nearly .1 GiB) of memory each.
There is no need to optimize further. By default only 1GiB is allocated for each core, so you’re not wasting too much.
Ex 6 (Bad):
You run many 10-core jobs, without explicitly requesting any memory allocation. The jobs are using only a total of ~100MiB of memory each. Unlike the previous example, 10GiB of memory will be allocated for each job (1GiB per core). You should explicitly request a smaller amount of memory, to reduce your pending times and other users’ pending times.
Ex 7 (Bad):
Your jobs run for 10-30 minutes, but you just submit them with the short partition default time limit of 12 hours.
If you are running more than a few jobs, please reduce the requested time to an hour or so. Your jobs will probably pend for a shorter time, and help the scheduler run more efficiently.
Note: Reward QOS are not compatible and cannot be added with other custom priority QOS that might have been granted for special situations.