Get information about current and past jobs

1 O2squeue
- 1.1 State
- 1.2 Nodelist (Reason)
- 1.3 Eligible Time
- 1.4 Start Time
- 1.5 TRES
2 O2_jobs_report
3 O2 Resource Utilization

RC created two simplified commands, O2squeue and O2_jobs_report based on slurm squeue and sacct, that can be used to gather information about your active (pending or running) jobs and your past jobs.

As usual, feel free to contact rchelp@hms.harvard.edu with any questions about the reports from these commands.

O2squeue

This command is based on the Slurm command squeue and will return information about your job that are currently pending or running . For example:

login05:~ O2squeue
JOBID     PARTITION     STATE       TIME_LIMIT     TIME           NODELIST(REASON)         ELIGIBLE_TIME         START_TIME            TRES_ALLOC
21801263  interactive   RUNNING     12:00:00       2:09:52        compute-a-16-160         2020-11-09T11:35:49   2020-11-09T11:36:19   cpu=1,mem=2G,node=1,billing=1

O2squeue can take as inputs the string R or PD to list only running or pending jobs.

State

The STATE field will normally be either PENDING or RUNNING. For other job codes, see the “JOB STATE CODES” section of the Slurm squeue page.

Nodelist (Reason)

When a job is running, NODELIST(REASON) lists the node (or nodes, for a parallel MPI job) that the job is running on.

When a job is pending, NODELIST(REASON) describes the reason why the job is pending. The reasons are explained below. Some of the most common reasons are in bold. Many of the other reasons will only apply if you submitted the job with a special QOS (Quality of Service) or job dependency or reservation.

BadConstraints: The job's constraints can not be satisfied.

Dependency: This job is waiting for a dependent job to complete. (See the --dependency option on the Slurm sbatch page.)

InvalidQOS: The job's QOS is invalid.

JobHeldAdmin: The job is held (forced to pend) by a system administrator.

JobHeldUser: The job is held by the user.

None: The job has not been evaluated yet by the scheduler. (This can happen if the scheduler is working through a huge batch of submitted jobs.)

Priority: One or more higher priority jobs exist for this partition or advanced reservation. Over time, your job's priority will gradually increase, so your job should eventually run.

QOSJobLimit: The job's QOS has reached its maximum job count.

QOSResourceLimit: The job's QOS has reached some resource limit.

QOSTimeLimit: The job's QOS has reached its time limit.

ReqNodeNotAvail: Some node explicitly required by the job submission is not currently available.

Reservation: The job is waiting for its advanced reservation to become available.

Resources: The job is waiting for resources to become available.

Eligible Time

The field ELIGIBLE_TIME indicates the time when a job becomes eligible to be dispatched. This is usually the submit time, unless the job cannot be dispatched immediately, such as job dependencies or unavailable resources requested (like specific nodes that are having a planned outage).

Start Time

For running jobs, START_TIME indicates the time when the job was dispatched. For pending jobs, it indicates the expected start time. Note that the expected start time is only calculated for the first few pending jobs of each user, and it is, in general, an upper bound value, assuming that all jobs will run for their maximum time.

TRES

TRES indicates the resources requested with flags like -t (or --time) in the job submission command or sbatch script.

O2_jobs_report

O2_jobs_report is based on the Slurm command sacct and can be used to query the Slurm database for information on your past jobs.

The command gets information including CPU (compute time), Memory (RAM), and WallTime efficiency. It prints information for every single job or as an overall report. It is also possible to select specific dates, jobs, jobs'names, partitions, and jobs' states.

The RC team checks jobs' efficiency only for jobs that are marked as COMPLETED by the Slurm scheduler. To query only COMPLETED jobs with O2_jobs_report add the flag --state=COMPLETED

By default, the tool will only show jobs starting from midnight of the previous day. So if you run the command at 9 am on Thursday, you’ll get the data for 24 hours on Wednesday PLUS the first 9 hours of Thursday.

Use --start or --lastdays to specify a custom time range if looking for older jobs

You can customize your query using the flags described below. To see all the available options from the O2 shell, you can run O2_jobs_report -h

-j JOBID, --jobid JOBID
                          The specific jobid numbers; can be multiple comma-separated jobids with no spaces, for example --jobid=123,456,78

-s START, --start START
                          The desired start date for the query; the date must be entered using the format YYYY-MM-DD. This flag is not compatible with --lastdays

--lastdays LASTDAYS      
                          Query jobs from the previous LASTDAYS days. This is equivalent to using --start=YYYY-MM-DD with the desired start date. 
                          This flag is not compatible with --start

-e END, --end END       
                          Specify an end date for the query; the default end date is tomorrow.

--account ACCOUNT       
                          Specify your entire Slurm account. The report will include jobs from every user in your Lab.

--jobname JOBNAME       
                          Specify a Slurm job name for the query (that you submitted with sbatch -J); this flag can be used with a comma-separated list of jobnames with no spaces

--state STATE         
                          Specify a list of job states describing how jobs ended. 
                          Possible options are CANCELLED, COMPLETED, FAILED, NODE_FAIL, OUT_OF_MEMORY, PREEMPTED, and TIMEOUT. 
                          This flag can be used with a comma-separated list of jobs states with no spaces, for example --state=COMPLETED,FAILED

-p PARTITION, --partition PARTITION
                          Jobs submitted to specific partitions; you can specify multiple comma-separated partitions with no spaces.

--report              
                          Print a summary report instead of detailed information for each job

--verbose             
                          Use this flag to display the verbose information for each job directly as it is returned from the Slurm sacct command

By default, O2_jobs_report will report information for each job, for example:

login04:~ O2_jobs_report

JOBID        USER     ACCOUNT      PARTITION       STATE           STARTTIME       WALLTIME(hr)   nCPU,RAM(GB),nGPU    PENDINGTIME(hr)    CPU_EFF(%) RAM_EFF(%) WALLTIME_EFF(%)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
4992748      rc       rccg         transfer        COMPLETED       2023-03-16      24.0           1,2.0,0              0.0                57.1       0.0        0.1

5013407      rc       rccg         transfer        COMPLETED       2023-03-16      24.0           1,2.0,0              0.01               0.0        0.0        0.0

5045324      rc       rccg         transfer        COMPLETED       2023-03-16      24.0           1,2.0,0              0.02               0.0        0.0        0.0

5077214      rc       rccg         transfer        COMPLETED       2023-03-17      24.0           1,2.0,0              0.01               33.7       0.0        0.1

5100444      rc       rccg         transfer        COMPLETED       2023-03-17      24.0           1,2.0,0              0.02               0.0        0.0        0.0

However, it is possible to see a summary report by using the flag --report, for example:

login04:~ O2_jobs_report --report

JOBS STATES COUNT FROM 2023-03-16 TO 2023-03-18
==========  ===========
  USERNAME    COMPLETED
==========  ===========
        rc            5
==========  ===========

JOBS PARTITIONS COUNT FROM 2023-03-16 TO 2023-03-18
==========  ==========
  USERNAME    transfer
==========  ==========
        rc           5
==========  ==========

JOBS STATISTICS FROM 2023-03-16 TO 2023-03-18
======  ============  =========================  ===========================  ======================  ==================  ============================  ===================
  User    Total Jobs    Median Pending Time(hr)    Average Allocated RAM(GB)    Average Used RAM(GB)    Max Used RAM(GB)    Jobs Using > 1/2 Alloc RAM    RAM Efficiency(%)
======  ============  =========================  ===========================  ======================  ==================  ============================  ===================
    rc             5                       0.01                            2                       0                   0                             0                 0.03
======  ============  =========================  ===========================  ======================  ==================  ============================  ===================

======  =======================  ===================  =====================  ========================  ===========================
  User    Average Allocated CPU    CPU Efficiency(%)    Average Runtime(hr)    WallTime Efficiency(%)    Jobs Using > 1/2 WallTime
======  =======================  ===================  =====================  ========================  ===========================
    rc                        1                 34.4                   0.01                         0                            0
======  =======================  ===================  =====================  ========================  ===========================

O2 Resource Utilization

We created a simplified script called O2usage, which runs sreport in the background and requires minimal arguments.

The script can be executed from everywhere in O2 and requires two inputs, the starting and ending dates for the desired time interval, both dates should be in the format of YYYY-MM-DD.

For example:

login05:~ O2usage 2021-03-15 2021-04-15
 
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2021-03-15 - 2021-04-15
Usage reported in Hours, memory is in MiB hours
--------------------------------------------------------------------------------
 
Cluster|Account|User|Name|Resource|Usage
o2|rccg|rp189|Potami|cpu|315
o2|rccg|rp189|Potami|mem|423467
o2|rccg|rp189|Potami|gres/gpu|12

The usage is reported in CPU hours, MiB hours, and GPU hours.

If the User executing the query is the PI responsible for the Slurm Account (Lab), then the O2usage script will report the utilization for the entire Lab.

HMS IT RC O2