Get information about current and past jobs



RC created two simplified commands, O2squeue and O2_jobs_report  based on slurm squeue and sacct, that can be used to gather information about your active (pending or running) jobs and your past jobs. 

As usual, feel free to contact rchelp@hms.harvard.edu with any questions about the reports from these commands.

O2squeue

This command is based on the Slurm command squeue and will return information about your job that are currently pending or running . For example:

login05:~ O2squeue JOBID PARTITION STATE TIME_LIMIT TIME NODELIST(REASON) ELIGIBLE_TIME START_TIME TRES_ALLOC 21801263 interactive RUNNING 12:00:00 2:09:52 compute-a-16-160 2020-11-09T11:35:49 2020-11-09T11:36:19 cpu=1,mem=2G,node=1,billing=1

 

O2squeue can take as inputs the string R or PD to list only running or pending jobs.


State

The STATE field will normally be either PENDING or RUNNING.  For other job codes, see the “JOB STATE CODES” section of the Slurm squeue page.

Nodelist (Reason)

When a job is running, NODELIST(REASON) lists the node (or nodes, for a parallel MPI job) that the job is running on.

When a job is pending, NODELIST(REASON) describes the reason why the job is pending. The reasons are explained below. Some of the most common reasons are in bold. Many of the other reasons will only apply if you submitted the job with a special QOS (Quality of Service) or job dependency or reservation.

BadConstraints: The job's constraints can not be satisfied.

Dependency: This job is waiting for a dependent job to complete. (See the --dependency option on the Slurm sbatch page.)

InvalidQOS: The job's QOS is invalid.

JobHeldAdmin: The job is held (forced to pend) by a system administrator.

JobHeldUser: The job is held by the user.

None: The job has not been evaluated yet by the scheduler. (This can happen if the scheduler is working through a huge batch of submitted jobs.)

Priority: One or more higher priority jobs exist for this partition or advanced reservation. Over time, your job's priority will gradually increase, so your job should eventually run.

QOSJobLimit: The job's QOS has reached its maximum job count.

QOSResourceLimit: The job's QOS has reached some resource limit.

QOSTimeLimit: The job's QOS has reached its time limit.

ReqNodeNotAvail: Some node explicitly required by the job submission is not currently available.

Reservation: The job is waiting for its advanced reservation to become available.

Resources: The job is waiting for resources to become available.

Eligible Time

The field ELIGIBLE_TIME indicates the time when a job becomes eligible to be dispatched. This is usually the submit time, unless the job cannot be dispatched immediately, such as job dependencies or unavailable resources requested (like specific nodes that are having a planned outage).

Start Time

For running jobs, START_TIME indicates the time when the job was dispatched. For pending jobs, it indicates the expected start time. Note that the expected start time is only calculated for the first few pending jobs of each user, and it is, in general, an upper bound value, assuming that all jobs will run for their maximum time.

TRES

TRES indicates the resources requested with flags like -t (or --time) in the job submission command or sbatch script.

 



O2_jobs_report

O2_jobs_report is based on the Slurm command sacct and can be used to query the Slurm database for information on your past jobs.

The command gets information including CPU (compute time), Memory (RAM), and WallTime efficiency. It prints information for every single job or as an overall report. It is also possible to select specific dates, jobs, jobs'names, partitions, and jobs' states.

The RC team checks jobs' efficiency only for jobs that are marked as COMPLETED by the Slurm scheduler. To query only COMPLETED jobs with O2_jobs_report add the flag --state=COMPLETED

By default, the tool will only show jobs starting from midnight of the previous day. So if you run the command at 9 am on Thursday, you’ll get the data for 24 hours on Wednesday PLUS the first 9 hours of Thursday.

Use --start or --lastdays to specify a custom time range if looking for older jobs

You can customize your query using the flags described below. To see all the available options from the O2 shell, you can run O2_jobs_report -h

-j JOBID, --jobid JOBID The specific jobid numbers; can be multiple comma-separated jobids with no spaces, for example --jobid=123,456,78
-s START, --start START The desired start date for the query; the date must be entered using the format YYYY-MM-DD. This flag is not compatible with --lastdays
--lastdays LASTDAYS Query jobs from the previous LASTDAYS days. This is equivalent to using --start=YYYY-MM-DD with the desired start date. This flag is not compatible with --start
-e END, --end END Specify an end date for the query; the default end date is tomorrow.
--account ACCOUNT Specify your entire Slurm account. The report will include jobs from every user in your Lab.
--jobname JOBNAME Specify a Slurm job name for the query (that you submitted with sbatch -J); this flag can be used with a comma-separated list of jobnames with no spaces
--state STATE Specify a list of job states describing how jobs ended. Possible options are CANCELLED, COMPLETED, FAILED, NODE_FAIL, OUT_OF_MEMORY, PREEMPTED, and TIMEOUT. This flag can be used with a comma-separated list of jobs states with no spaces, for example --state=COMPLETED,FAILED
-p PARTITION, --partition PARTITION Jobs submitted to specific partitions; you can specify multiple comma-separated partitions with no spaces.
--report Print a summary report instead of detailed information for each job
--verbose Use this flag to display the verbose information for each job directly as it is returned from the Slurm sacct command

 

By default, O2_jobs_report will report information for each job, for example:

login04:~ O2_jobs_report JOBID USER ACCOUNT PARTITION STATE STARTTIME WALLTIME(hr) nCPU,RAM(GB),nGPU PENDINGTIME(hr) CPU_EFF(%) RAM_EFF(%) WALLTIME_EFF(%) -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4992748 rc rccg transfer COMPLETED 2023-03-16 24.0 1,2.0,0 0.0 57.1 0.0 0.1 5013407 rc rccg transfer COMPLETED 2023-03-16 24.0 1,2.0,0 0.01 0.0 0.0 0.0 5045324 rc rccg transfer COMPLETED 2023-03-16 24.0 1,2.0,0 0.02 0.0 0.0 0.0 5077214 rc rccg transfer COMPLETED 2023-03-17 24.0 1,2.0,0 0.01 33.7 0.0 0.1 5100444 rc rccg transfer COMPLETED 2023-03-17 24.0 1,2.0,0 0.02 0.0 0.0 0.0

 

However, it is possible to see a summary report by using the flag --report, for example:

login04:~ O2_jobs_report --report JOBS STATES COUNT FROM 2023-03-16 TO 2023-03-18 ========== =========== USERNAME COMPLETED ========== =========== rc 5 ========== =========== JOBS PARTITIONS COUNT FROM 2023-03-16 TO 2023-03-18 ========== ========== USERNAME transfer ========== ========== rc 5 ========== ========== JOBS STATISTICS FROM 2023-03-16 TO 2023-03-18 ====== ============ ========================= =========================== ====================== ================== ============================ =================== User Total Jobs Median Pending Time(hr) Average Allocated RAM(GB) Average Used RAM(GB) Max Used RAM(GB) Jobs Using > 1/2 Alloc RAM RAM Efficiency(%) ====== ============ ========================= =========================== ====================== ================== ============================ =================== rc 5 0.01 2 0 0 0 0.03 ====== ============ ========================= =========================== ====================== ================== ============================ =================== ====== ======================= =================== ===================== ======================== =========================== User Average Allocated CPU CPU Efficiency(%) Average Runtime(hr) WallTime Efficiency(%) Jobs Using > 1/2 WallTime ====== ======================= =================== ===================== ======================== =========================== rc 1 34.4 0.01 0 0 ====== ======================= =================== ===================== ======================== ===========================

O2 Resource Utilization

We created a simplified script called O2usage, which runs sreport in the background and requires minimal arguments.

The script can be executed from everywhere in O2 and requires two inputs, the starting and ending dates for the desired time interval, both dates should be in the format of YYYY-MM-DD.

For example:

login05:~ O2usage 2021-03-15 2021-04-15 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2021-03-15 - 2021-04-15 Usage reported in Hours, memory is in MiB hours -------------------------------------------------------------------------------- Cluster|Account|User|Name|Resource|Usage o2|rccg|rp189|Potami|cpu|315 o2|rccg|rp189|Potami|mem|423467 o2|rccg|rp189|Potami|gres/gpu|12

The usage is reported in CPU hours, MiB hours, and GPU hours. 

If the User executing the query is the PI responsible for the Slurm Account (Lab), then the O2usage script will report the utilization for the entire Lab.