Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

...

Job's priority in O2 is influenced mostly by two parameters: the user's recent cluster utilization and the job's pending time. The more a user consumes resources (CPU, memory, gpu, etc.), the longer additional jobs might pend. The longer jobs pend to higher, their priority becomes. The paragraphs below contain more details about the scheduler algorithm. 

Job priority calculation

SLURM computes the overall priority of each job based on six factors: job age, user fairshare, job size, partition, QOS, TRES. The six factors can have values between 0 and 1 and are calculated as described below:

Age= The Value is based on the job pending time (since eligible) normalized against the PriorityMaxAge parameter, currently PriorityMaxAge is set to 7-00:00:00. 

JobSize = The job size factor correlates to the number of nodes or CPUs the job has requested, the larger the job, the closer to 1 is the jobsize factor. Currently the contribution from this factor is negligible.

Partition = The value is calculated as the ratio between the priority of the partition requested by the job against the maximum partition priority. Currently the max partition priority is set to 14 (for partition interactive) 

QOS = The Quality of Services factor is calculated as the ratio between the job's qos priority and the maximum qos priority. By default each job is submitted with the qos "normal" which has a zero priority value  

TRES = not currently active, should always be zero

FairShare =This value is proportional to the ratio of resources available to each users and the amount of resources that has been consumed by the user submitting the job, see below for details.

Each of these factors is then augmented by a custom multiplier in order to obtain the overall JobPriority value accordingly with the formula:

JobPriority=Age*PriorityWeightAge+

                     Fairshare*PriorityWeightFairShare+

                    JobSize*PriorityWeightJobSize+

                    Partition*PriorityWeightPartition+

                    QOS*PriorityWeightQOS+

                    TRES*PriorityWeightTRES

where the multipliers are currently set to the values:

PriorityWeightAge = 200000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize = 10000
PriorityWeightPartition = 400000
PriorityWeightQOS = 2000000
PriorityWeightTRES = (null)

FairShare Calculation

Each user fairshare is currently calculated as 

Code Block
F = 2**(-U/(S*d))

where: 

S is the normalized number of shares made available for each users. In our current setup all users get the same number of raw share

U is the normalized usage.  This is calculated as   U= Uh / Rh  where Uh is the user historical usage subject to the half-life decay  and Rh is the total historical usage across the cluster also subject to the half-life decay 

Uh and Rh are calculated as

Uh = Ucurrent_period + (0.5* Ulast_period)+((0.5**2)*Uperiod-2)+... 

Rh = Rcurrent_period + (0.5* Rlast_period)+((0.5**2)*Rperiod-2)+...

and the periods are based on the PriorityDecayHalfLife time interval, currently set to 6:00:00 (6 hours).  

Currently Usage is calculated as: Allocated_Ncpus*elapsed_seconds+Allocated_Mem_GiB*0.0625*elapsed_seconds+Allocated_NGPUs*5*elapsed_seconds

d is the FairShareDampeningFactor. This is used to reduce the impact of resource consumption on the fairshare value and to account for the ratio of active users against total users. The value is currently set to 10 and it is dynamically changed as needed. 

The initial fairshare value (with zero normalized usage) for each user is equal to 1; if a user is consuming exactly his/her share amount of available resources then his/her fairshare value will be 0.5.

It takes approximately 48 hours for a fully depleted fairshare to return from 0 to 1, assuming no additional usage is being accumulated by the user during those ~48 hours.

Two useful commands to see the priority of pending jobs and fairshare are sprio and sshare

...

languagetext

...

Changes to Job Priority (as of Nov 2022)

Beginning on November 2022, HMS-RC started to lower the job priority of O2 Users who consistently requested substantially more cluster RAM memory than they actually used.

The goal of this initiative is to help researchers get more science done.

HMS-RC will not penalize jobs that are only slightly inefficient, small batches of jobs, or interactive sessions. Inefficient jobs hurt O2's productivity for everyone, increasing pending times both for those submitting those jobs and for others waiting in the queue. So improving jobs' efficiency will help all Users.

A User significantly over requesting memory will first receive a warning notice; if no attempt is done to improve RAM efficiency the User’s priority will be lowered and we will notify the User again.

All notifications will be sent to the email address listed in the User .forward file ($HOME/.forward)

This priority reduction is temporary and will be automatically removed once the User’s efficiency improves.

To learn more about job efficiency and learn how you can identify your RAM utilization please check our wiki page Optimizing O2 jobs

Job priority calculation

SLURM computes the overall priority of each job based on six factors: job age, user fairshare, job size, partition, QOS, TRES. The six factors can have values between 0 and 1 and are calculated as described below:

Age= The Value is based on the job pending time (since eligible) normalized against the PriorityMaxAge parameter, currently PriorityMaxAge is set to 7-00:00:00. 

JobSize = The job size factor correlates to the number of nodes or CPUs the job has requested, the larger the job, the closer to 1 is the jobsize factor. Currently the contribution from this factor is negligible.

Partition = The value is calculated as the ratio between the priority of the partition requested by the job against the maximum partition priority. Currently the max partition priority is set to 14 (for partition interactive) 

QOS = The Quality of Services factor is calculated as the ratio between the job's qos priority and the maximum qos priority. By default each job is submitted with the qos "normal" which has a zero priority value  

TRES = not currently active, should always be zero

FairShare =This value is proportional to the ratio of resources available to each users and the amount of resources that has been consumed by the user submitting the job, see below for details.


Each of these factors is then augmented by a custom multiplier in order to obtain the overall JobPriority value accordingly with the formula:

JobPriority=Age*PriorityWeightAge+

                     Fairshare*PriorityWeightFairShare+

                    JobSize*PriorityWeightJobSize+

                    Partition*PriorityWeightPartition+

                    QOS*PriorityWeightQOS+

                    TRES*PriorityWeightTRES


where the multipliers are currently set to the values:

PriorityWeightAge = 200000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize = 10000
PriorityWeightPartition = 400000
PriorityWeightQOS = 2000000
PriorityWeightTRES = (null)

FairShare Calculation

Each user fairshare is currently calculated as 

Code Block
F = 2**(-U/(S*d))

where: 

S is the normalized number of shares made available for each users. In our current setup all users get the same number of raw share

U is the normalized usage.  This is calculated as   U= Uh / Rh  where Uh is the user historical usage subject to the half-life decay  and Rh is the total historical usage across the cluster also subject to the half-life decay 

Uh and Rh are calculated as

Uh = Ucurrent_period + (0.5* Ulast_period)+((0.5**2)*Uperiod-2)+... 

Rh = Rcurrent_period + (0.5* Rlast_period)+((0.5**2)*Rperiod-2)+...

and the periods are based on the PriorityDecayHalfLife time interval, currently set to 6:00:00 (6 hours).  

Currently Usage is calculated as: Allocated_Ncpus*elapsed_seconds+Allocated_Mem_GiB*0.0625*elapsed_seconds+Allocated_NGPUs*5*elapsed_seconds

d is the FairShareDampeningFactor. This is used to reduce the impact of resource consumption on the fairshare value and to account for the ratio of active users against total users. The value is currently set to 10 and it is dynamically changed as needed. 

The initial fairshare value (with zero normalized usage) for each user is equal to 1; if a user is consuming exactly his/her share amount of available resources then his/her fairshare value will be 0.5.

It takes approximately 48 hours for a fully depleted fairshare to return from 0 to 1, assuming no additional usage is being accumulated by the user during those ~48 hours.


Two useful commands to see the priority of pending jobs and fairshare are sprio and sshare




Code Block
languagetext
login02:~ sprio -l
         JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS        NICE                 0TRES
        64450566444966    uid13      1306112489       5000       4061          0       40003429          0           0
        64450686445056    uid13      1077513061       5000       4061          0       17144000          0           0
        64450786445068    uid13      1020410775       5000       4061          0       11431714          0           0
        64450836445078    uid13      10204       5000       4061          0       1143          0           0
        65869396445083    uid45uid13      10204 6583      5000 4812         574061          0       17141143          0           0
        65869406586939    uid45       6583       4812         57          0       1714          0           0
        65869416586940    uid45       6583       4812         57          0       1714          0           0
        65869426586941    uid45       6583       4812         57          0       1714          0           0
        65869436586942    uid45       6583       4812         57          0       1714          0           0
        65869446586943    uid45       6583       4812         57          0       1714          0           0
        65869456586944    uid32uid45       6583       4812         57          0       1714          0           0
        65869466586945    uid32       6583       4812         57          0       1714          0           0
        65869476586946    uid32       6583       4812         57          0       1714          0           0
        65869486586947    uid32       6583       4812         57          0       1714          0           0
    login02:~ sshare -u $USER -U6586948    uid32       6583   Account    4812   User  RawShares    57  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
rccg0       1714          0           0




login02:~ sshare -u $USER -U
 rp189          1  Account  0.000787     User  RawShares  320NormShares    RawUsage  0.000002EffectvUsage   0.999832

Partition Priority Tiers

The scheduler tries first to dispatch jobs in the partition interactive, then jobs in the partition priority and finally jobs submitted to all remaining partitions. As a consequence interactive and priority jobs will most likely be dispatched first, even if they have a lower overall priority than jobs pending on other partitions (short,medium,long,mpi,etc.).  

Backfill scheduling

Low priority jobs might be dispatched before high priority jobs only if doing so does not impact the expected start time of the high priority jobs and if the required resources by the low priority jobs are free and idle. 

How to manage priority of your own jobs

By default the scheduler will try to dispatch jobs with the same priority values based on their jobid numbers, so the first step to control priority of your own jobs is to submit them in the same order you would want them to be dispatched.

Once jobs have been submitted, and are still pending, there are two commands that can be used to modify their relative priority:

scontrol top <jobid>

This command increases the priority of job <jobid> to match the maximum priority of all user’s jobs and subtracts that priority from all those other jobs in equal decrements.

For example consider the case with the following six pending jobs:

Code Block
JobID   Priority
1            FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
rccg                      rp189          1    0.000787         320      0.000002   0.999832


Partition Priority Tiers

The scheduler tries first to dispatch jobs in the partition interactive, then jobs in the partition priority and finally jobs submitted to all remaining partitions. As a consequence interactive and priority jobs will most likely be dispatched first, even if they have a lower overall priority than jobs pending on other partitions (short,medium,long,mpi,etc.).  

Backfill scheduling

Low priority jobs might be dispatched before high priority jobs only if doing so does not impact the expected start time of the high priority jobs and if the required resources by the low priority jobs are free and idle. 


How to manage priority of your own jobs

By default the scheduler will try to dispatch jobs with the same priority values based on their jobid numbers, so the first step to control priority of your own jobs is to submit them in the same order you would want them to be dispatched.

Once jobs have been submitted, and are still pending, there are two commands that can be used to modify their relative priority:

scontrol top <jobid>


This command increases the priority of job <jobid> to match the maximum priority of all user’s jobs and subtracts that priority from all those other jobs in equal decrements.


For example consider the case with the following six pending jobs:


Code Block
JobID   Priority
1            100
2            100
3            100
4            100
5            100
6             90

...

Code Block
JobID   Priority
6             90
1             89
2             89
3             89
4             89
5              89

...

       89

and in this case the total priority of the user is not conserved. However the priority of the jobs across the cluster usually varies by thousands of points, so changing it with small “nice” values should have a negligible impact on the jobs pending time.


There is a limit on the total CPU-hours that can be reserved by a single lab at any given time. This limit was introduced to prevent a single lab from locking down a large portion of the cluster for extended periods of time. This limit will become active only if multiple users in the same lab are allocating a large portion of the O2 cluster resources. This can for example happen if few users have thousands of multi-day or hundreds of multi-week running jobs. When this limit becomes active the remaining pending jobs will display the message AssocGrpCPURunMinute.

Note: The "gpu" partition has additional limits that might trigger the above message, for more details about the "gpu" partition please refer to the "Using O2 GPU resources" wiki page.


Job priority rewards


Many users have been submitting jobs requesting more memory (--mem) and/or run time (-t) than they need. This leads to longer pending times for the submitting users as well as other cluster users.

...

  • check the report you receive every Monday, which contains information about your overall usage for the previous 7 days.

  • get more detailed info at any time by running the command “O2sacctO2_jobs_report” directly from the O2 command line. Use “O2sacct “O2_jobs_report -h” to get help or check our wiki page about getting information about current and past jobs.

...

If you are running more than a few jobs, please reduce the requested time to an hour or so. Your jobs will probably pend for a shorter time, and help the scheduler run more efficiently.


Note: Reward QOS are not compatible and cannot be added with other custom priority QOS that might have been granted for special situations.