View Source

This page shows you how to run a regular bash script as a pipeline. The runAsPipeline script, accessible through the rcbio/1.3.3 module, converts an input bash script to a pipeline that easily submits jobs to the Slurm scheduler for you.

Features of the new pipeline:

Submit each step as a cluster job using sbatch.
Automatically arrange dependencies among jobs.
Email notifications are sent when each job fails or succeeds.
If a job fails, all its downstream jobs automatically are killed.
When re-running the pipeline on the same data folder, if there are any unfinished jobs, the user is asked to kill them or not.
When re-running the pipeline on the same data folder, the user is asked to confirm to re-run or not if a job or a step was done successfully earlier.
For re-run, if the script is not changed, runAsPipeline does not re-process the bash script and directly use old one.
If user has more than one Slurm account, adding -A or —account= to command line to let all jobs to use that Slurm account.
When adding new input data and re-run the workflow, affected successfully finished jobs will be auto re-run.

Please read below for an example.

Log on to O2

If you need help connecting to O2, please review the Using Slurm Basic and the How to Login to O2 wiki pages.

From Windows, download and install MobaXterm for Windows: https://mobaxterm.mobatek.net/，to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your HMS Account instead of user123:

ssh user123@o2.hms.harvard.edu

Start interactive job, and create working directory

# if you have multiple slurm accounts, you'll have to add in -A or --account=
srun --pty -p interactive -t 0-12:0:0 --mem 2000MB -c 1 /bin/bash
mkdir ~/testRunBashScriptAsSlurmPipeline  
cd ~/testRunBashScriptAsSlurmPipeline

Load the pipeline related modules

# This will setup the path and environment variables for the pipeline
module load rcbio/1.3.3

Build some testing data in the current folder

echo -e "John Paul\nMike Smith\nNick Will\nJulia Johnson\nTom Jones"  > universityA.txt
cp universityA.txt universityB.txt

Take a look at the example files

# this command shows the content of file universityA.txt
cat universityA.txt

# Below is the content of universityA.txt
John Paul
Mike Smith
Nick Will
Julia Johnson
Tom Jones

# this command shows the content of file universityB.txt
cat universityB.txt

# below is the content of universityB.txt
John Paul
Mike Smith
Nick Will
Julia Johnson
Tom Jones

The original bash script

# Use cp command to make a copy of script ‘bashScriptV1.sh'
cp /n/app/rcbio/1.3.3/bin/bashScriptV1.sh .

# Use cat command to see the content of bashScriptV1.sh
cat bashScriptV1.sh

#!/bin/sh
for i in A B; do            

    u=university$i.txt   

    grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt       
  
    grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt

done

cat John.txt Mike.txt Nick.txt Julia.txt > all.txt

How does this bash script work?

There is a loop that goes through the two university text files (for loop in line 8 above) to search for John and Mike (line 12 above), and then searches for Nick and Julia (line 14 above). After all searching is finished (line 16 above), then the results are merged into a single text file (line 18 above) . This means that the merge step (line 18 above) has to wait until the earlier two steps (line 12 and 14 above) are finished. However, the runAsPipeline workflow builder can't read this script directly. We will need to create a modified bash script that adds parts that explicitly tell the workflow builder the order in which the jobs need to run, among other things.

The modified bash script

cp /n/app/rcbio/1.3.3/bin/bashScriptV2.sh .
cat bashScriptV2.sh

# Below is the conten of bashScriptV2.sh
#!/bin/sh
for i in A B; do            
    
    u=university$i.txt    
            
    #@1,0,find1,u,sbatch -p short -c 1 -t 50:0
    grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt        
     
    #@2,0,find2,u,sbatch -p short -c 1 -t 50:0
    grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
                   
done

#@3,1.2,merge     
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt

Notice that there are a few things added to the script here:

Step 1 is denoted by #@1,0,find1,u,sbatch -p short -c 1 -t 50:0 (line 10 above), which means this is step 1 that depends on no other step, is named find1, and file $u needs to be copied to the /tmp directory. The sbatch command tells the pipeline runner the sbatch command to run this step.
Step 2 is denoted by #@2,0,find2,u (line 13 above), which means this is step2 that depends on no other step, is named find2, and file $u needs to be copy to /tmp directory. The sbatch command tells the pipeline runner the sbatch command to run this step.
Step 3 is denoted by #@3,1.2,merge, (line 18), which means that this is step3 that depends on step1 and step2, and the step is named merge. Notice, there is no sbatch here, so the pipeline runner will use default sbatch command from command line (see below).

Notice the format of step annotation is #@stepID,dependIDs,stepName,reference,sbatchOptions. Reference is optional, which allows the pipeline runner to copy data (file or folder) to local /tmp folder on the computing node to speed up the software. sbatchOptions is also optional, and when it is missing, the pipeline runner will use the default sbatch command given from command line (see below).

Here are two more examples:

#@4,1.3,map,,sbatch -p short -c 1 -t 50:0 Means step4 depends on step1 and step3, this step is named map, there is no reference data to copy, and submit this step with sbatch -p short -c 1 -t 50:0

#@3,1.2,align,db1.db2 Means step3 depends on step1 and step2, this step is named align, $db1 and $db2 are reference data to be copied to /tmp , and submit with the default sbatch command (see below).

Test run the modified bash script as a pipeline

runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp

This command will generate new bash script of the form slurmPipeLine.checksum.sh in flag folder. The checksum portion of the filename will have a MD5 hash that represents the file contents. We include the checksum in the filename to detect when script contents have been updated.

This runAsPipeline command will run a test of the script, meaning does not really submit jobs. It will only show a fake job ids like 1234 for each step. If you were to append run at the end of the command, the pipeline would actually be submitted to the Slurm scheduler.

Ideally, with useTmp, the software should run faster using local /tmp disk space for database/reference than the network storage. For this small query, the difference is small, or even slower if you use local /tmp. If you don't need /tmp, you can use noTmp.

With useTmp, the pipeline runner copy related data to /tmp and all file paths will be automatically updated to reflect a file's location in /tmp when using the useTmp option.

Sample output from the test run

Note that only step 2 used -t 50:0, and all other steps used the default -t 10:0. The default walltime limit was set in the runAsPipeline command, and the walltime parameter for step 2 was set in the bash_script_v2.sh script.

runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp

Fri Sep 24 09:46:15 EDT 2021
Running: /n/app/rcbio/1.3.3/bin/runAsPipeline bashScriptV2.sh sbatch -p short -t 10:0 -c 1 useTmp

Currently Loaded Modules:
  1) rcbio/1.3.3

converting bashScriptV2.sh to flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.sh

find loop start: for i in A B; do

find job marker:
#@1,0,find1,u,sbatch -p short -c 1 -t 50:0
sbatch options: sbatch -p short -c 1 -t 50:0

find job:
grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt

find job marker:
#@2,0,find2,u,sbatch -p short -c 1 -t 50:0
sbatch options: sbatch -p short -c 1 -t 50:0

find job:
grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
find loop end: done

find job marker:
#@3,1.2,merge

find job:
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt
flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.sh bashScriptV2.sh is ready to run. Starting to run ...
Running flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.sh bashScriptV2.sh

Currently Loaded Modules:
  1) rcbio/1.3.3

---------------------------------------------------------

step: 1, depends on: 0, job name: find1, flag: find1.A reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.A -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.sh
# This is testing, so no job is submitted. In real run it should submit job such as: Submitted batch job 1349

step: 2, depends on: 0, job name: find2, flag: find2.A reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 2.0.find2.A -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.sh
# This is testing, so no job is submitted. In real run it should submit job such as: Submitted batch job 1560

step: 1, depends on: 0, job name: find1, flag: find1.B reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.B -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.sh
# This is testing, so no job is submitted. In real run it should submit job such as: Submitted batch job 1766

step: 2, depends on: 0, job name: find2, flag: find2.B reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 2.0.find2.B -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.sh
# This is testing, so no job is submitted. In real run it should submit job such as: Submitted batch job 1970

step: 3, depends on: 1.2, job name: merge , flag: merge reference:
depend on multiple jobs
sbatch -p short -t 10:0 -c 1 --requeue --nodes=1 --dependency=afterok:1349:1766:1560:1970 -J 3.1.2.merge -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.sh
# This is testing, so no job is submitted. In real run it should submit job such as: Submitted batch job 2172

All submitted jobs:
job_id       depend_on              job_flag
1349        null                  1.0.find1.A
1560        null                  2.0.find2.A
1766        null                  1.0.find1.B
1970        null                  2.0.find2.B
2172        ..1349.1766..1560.1970  3.1.2.merge
---------------------------------------------------------
Note: This is just a test run, so no job is actually submitted. In real run it should submit jobs and report as above.

Run the modified bash script as a pipeline

Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run parameter to the command. If run is not specified, test mode will be used, which does not submit jobs and gives the placeholder of 1234for jobids in the command's output.

runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp run

# Below is the output
Fri Sep 24 09:48:12 EDT 2021
Running: /n/app/rcbio/1.3.3/bin/runAsPipeline bashScriptV2.sh sbatch -p short -t 10:0 -c 1 useTmp run

Currently Loaded Modules:
  1) rcbio/1.3.3

converting bashScriptV2.sh to flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.run.sh

find loop start: for i in A B; do

find job marker:
#@1,0,find1,u,sbatch -p short -c 1 -t 50:0
sbatch options: sbatch -p short -c 1 -t 50:0

find job:
grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt

find job marker:
#@2,0,find2,u,sbatch -p short -c 1 -t 50:0
sbatch options: sbatch -p short -c 1 -t 50:0

find job:
grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
find loop end: done

find job marker:
#@3,1.2,merge

find job:
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt
flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.run.sh bashScriptV2.sh is ready to run. Starting to run ...
Running flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.run.sh bashScriptV2.sh

Currently Loaded Modules:
  1) rcbio/1.3.3

Could not find any jobs to cancel.
---------------------------------------------------------

step: 1, depends on: 0, job name: find1, flag: find1.A reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.A -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.sh
# Submitted batch job 41208893

step: 2, depends on: 0, job name: find2, flag: find2.A reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 2.0.find2.A -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.sh
# Submitted batch job 41208894

step: 1, depends on: 0, job name: find1, flag: find1.B reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.B -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.sh
# Submitted batch job 41208895

step: 2, depends on: 0, job name: find2, flag: find2.B reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 2.0.find2.B -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.sh
# Submitted batch job 41208898

step: 3, depends on: 1.2, job name: merge , flag: merge reference:
depend on multiple jobs
sbatch -p short -t 10:0 -c 1 --requeue --nodes=1 --dependency=afterok:41208893:41208895:41208894:41208898 -J 3.1.2.merge -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.sh
# Submitted batch job 41208899

All submitted jobs:
job_id       depend_on              job_flag
41208893    null                  1.0.find1.A
41208894    null                  2.0.find2.A
41208895    null                  1.0.find1.B
41208898    null                  2.0.find2.B
41208899    ..41208893.41208895..41208894.41208898  3.1.2.merge
---------------------------------------------------------

Monitoring the jobs

You can use the command:

O2squeue -u $USER

To see the job status (running, pending, etc.). You also get two emails for each step, one at the start of the step, one at the end of the step.

Successful job email

Email subject: Success: job id:41208893 name:1.0.find1.A

Email content:

Job script content:
#!/bin/bash
#Commands:
trap "{ cleanup.sh /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A; }” EXIT
touch /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.start
srun -n 1 bash -e -c "{ set -e; rsyncToTmp  /tmp/rcbio/universityA.txt; grep -H John /tmp/rcbio/universityA.txt >>  John.txt; grep -H Mike /tmp/rcbio/universityA.txt >>  Mike.txt        ; } && touch /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.success || touch /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.failed"

#sbatch command:
#sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.A -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.sh

# Submitted batch job 41208893
Job output:
Working to copy: /tmp/rcbio/universityA.txt, waiting lock...
Got lock: /tmp/-tmp-rcbio-universityA.txt. Copying data to: /tmp/rcbio/universityA.txt
Copying is done for /tmp/rcbio/universityA.txt
Job done. Summary:
       JobID              Submit               Start                 End      State  Partition              ReqTRES  Timelimit    CPUTime     MaxRSS                       NodeList
------------ ------------------- ------------------- ------------------- ---------- ---------- -------------------- ---------- ---------- ---------- ------------------------------
41208893     2021-09-24T09:48:13 2021-09-24T09:48:24             Unknown    RUNNING      short billing=1,cpu=1,mem+   00:50:00   00:00:10                          compute-e-16-180
41208893.ba+ 2021-09-24T09:48:24 2021-09-24T09:48:24             Unknown    RUNNING                                              00:00:10                          compute-e-16-180
41208893.ex+ 2021-09-24T09:48:24 2021-09-24T09:48:24             Unknown    RUNNING                                              00:00:10                          compute-e-16-180
41208893.0   2021-09-24T09:48:29 2021-09-24T09:48:29 2021-09-24T09:48:29  COMPLETED                                              00:00:00          0               compute-e-16-180
*Notice the sacct report above: while the main job is still running for sacct command, user task is completed.

The key elements are time and memory used.

Check job logs

You can use the command:

ls -l flag

This command list all the logs created by the pipeline runner. *.sh files are the slurm scripts for each step, *.out files are output files for each step, *.success files means job successfully finished for each step and *.failed means job failed for each steps.

You also get two emails for each step, one at the start of the step, one at the end of the step.

Cancel all jobs

You can use the command to cancel running and pending jobs:

cancelAllJobs flag/alljobs.jid

What happens if there is some error?

You can re-run this command in the same folder. We will delete an input file to see what happens.

# We are intentionally removing an input file to see a "failed job" email message
rm universityB.txt
runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp run

# Here is the output
Fri Sep 24 10:00:36 EDT 2021
Running: /n/app/rcbio/1.3.3/bin/runAsPipeline bashScriptV2.sh sbatch -p short -t 10:0 -c 1 useTmp run

Currently Loaded Modules:
  1) rcbio/1.3.3

This is a re-run with the same command and script is not changed, no need to convert the script. Using the old one: flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.run.sh
Running flag/slurmPipeLine.a855454a70b2198fa5b2643bb1d41762.run.sh bashScriptV2.sh

Currently Loaded Modules:
  1) rcbio/1.3.3

Could not find any jobs to cancel.
---------------------------------------------------------

step: 1, depends on: 0, job name: find1, flag: find1.A reference: .u
depend on no job
1.0.find1.A was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

step: 2, depends on: 0, job name: find2, flag: find2.A reference: .u
depend on no job
2.0.find2.A was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

job 2.0.find2.A is not submitted

step: 1, depends on: 0, job name: find1, flag: find1.B reference: .u
depend on no job
1.0.find1.B was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type ‘y’ and enter here to re-run

Will re-run the down stream steps even if they are done before.
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.B -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.sh
# Submitted batch job 41209197

step: 2, depends on: 0, job name: find2, flag: find2.B reference: .u
depend on no job
2.0.find2.B was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

job 2.0.find2.B is not submitted

step: 3, depends on: 1.2, job name: merge , flag: merge reference:
depend on other jobs
sbatch -p short -t 10:0 -c 1 --requeue --nodes=1 --dependency=afterok:41209197 -J 3.1.2.merge -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.sh
# Submitted batch job 41209210

# Notice above, rcbio didn’t ask if user wants to re-run step3 or not and directly re-run it.

All submitted jobs:
job_id       depend_on              job_flag
41209197    null                  1.0.find1.B
41209210    ..41209197.           3.1.2.merge
---------------------------------------------------------

This command will check if the earlier run is finished or not. If not, ask user to kill the running jobs or not, then ask user to rerun the successfully finished steps or not. Click 'y', it will rerun, directly press 'enter' key, it will not rerun.

Failed job email

Email subject: Failed: job id:41209197 name:1.0.find1.B

Email content:
Job script content:
#!/bin/bash
#Commands:
trap "{ cleanup.sh /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B; }” EXIT
touch /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.start
srun -n 1 bash -e -c "{ set -e; rsyncToTmp  /tmp/rcbio/universityB.txt; grep -H John /tmp/rcbio/universityB.txt >>  John.txt; grep -H Mike /tmp/rcbio/universityB.txt >>  Mike.txt        ; } && touch /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.success || touch /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.failed"

#sbatch command:
#sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.B -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.sh

# Submitted batch job 41209197
Job output:
Working to copy: /tmp/rcbio/universityB.txt, waiting lock...
Reference file or folder not exist: /universityB.txt
grep: /tmp/rcbio/universityB.txt: No such file or directory
grep: /tmp/rcbio/universityB.txt: No such file or directory
Job done. Summary:
       JobID              Submit               Start                 End      State  Partition              ReqTRES  Timelimit    CPUTime     MaxRSS                       NodeList
------------ ------------------- ------------------- ------------------- ---------- ---------- -------------------- ---------- ---------- ---------- ------------------------------
41209197     2021-09-24T10:02:43 2021-09-24T10:03:09             Unknown    RUNNING      short billing=1,cpu=1,mem+   00:50:00   00:00:09                          compute-e-16-180
41209197.ba+ 2021-09-24T10:03:09 2021-09-24T10:03:09             Unknown    RUNNING                                              00:00:09                          compute-e-16-180
41209197.ex+ 2021-09-24T10:03:09 2021-09-24T10:03:09             Unknown    RUNNING                                              00:00:09                          compute-e-16-180
41209197.0   2021-09-24T10:03:13 2021-09-24T10:03:13 2021-09-24T10:03:13  COMPLETED                                              00:00:00          0               compute-e-16-180
*Notice the sacct report above: while the main job is still running for sacct command, user task is completed.

The key element here is the error message.
Notice here, step2 job is automatically canceled because this job failed. We deleted universityB.txt, so the job has failed. We don’t get an email from the downstream step3 job.

Fix the error and re-run the pipeline

You can rerun this command in the same folder

cp universityA.txt universityB.txt
runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp run

This command will automatically check if the earlier run is finished. If the run has not finished, the script will ask the user if they want to kill the running jobs or not, then ask user to rerun the successfully finished steps or not. Click 'y', it will rerun, directly press 'enter' key, it will not rerun.

Notice here, step3 will run by default. It will run without prompting the user for permission.

What happens if we add more input data and re-run the pipeline?

You can rerun this command in the same folder

cp universityA.txt universityC.txt
cp bashScriptV2.sh bashScriptV3.sh 
nano bashScriptV3.sh  
# change
for i in A B; do
to: 
for i in A B C; do

# save the file and run:
runAsPipeline bashScriptV3.sh "sbatch -p short -t 10:0 -c 1" useTmp run

# Here are the output: 
Fri Sep 24 10:56:16 EDT 2021
Running: /n/app/rcbio/1.3.3/bin/runAsPipeline bashScriptV3.sh sbatch -p short -t 10:0 -c 1 useTmp run

Currently Loaded Modules:
  1) rcbio/1.3.3

converting bashScriptV3.sh to flag/slurmPipeLine.b72e7f91da30d312a2c85d0735896f79.run.sh

find loop start: for i in A B C; do

find job marker:
#@1,0,find1,u,sbatch -p short -c 1 -t 50:0
sbatch options: sbatch -p short -c 1 -t 50:0

find job:
grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt

find job marker:
#@2,0,find2,u,sbatch -p short -c 1 -t 50:0
sbatch options: sbatch -p short -c 1 -t 50:0

find job:
grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
find loop end: done

find job marker:
#@3,1.2,merge

find job:
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt
flag/slurmPipeLine.b72e7f91da30d312a2c85d0735896f79.run.sh bashScriptV3.sh is ready to run. Starting to run ...
Running flag/slurmPipeLine.b72e7f91da30d312a2c85d0735896f79.run.sh bashScriptV3.sh

Currently Loaded Modules:
  1) rcbio/1.3.3


Could not find any jobs to cancel.
---------------------------------------------------------

step: 1, depends on: 0, job name: find1, flag: find1.A reference: .u
depend on no job
1.0.find1.A was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

job 1.0.find1.A is not submitted

step: 2, depends on: 0, job name: find2, flag: find2.A reference: .u
depend on no job
2.0.find2.A was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

job 2.0.find2.A is not submitted

step: 1, depends on: 0, job name: find1, flag: find1.B reference: .u
depend on no job
1.0.find1.B was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

job 1.0.find1.B is not submitted

step: 2, depends on: 0, job name: find2, flag: find2.B reference: .u
depend on no job
2.0.find2.B was done before, do you want to re-run it?
y:        To re-run this job, press y, then enter key.
ystep:    To re-run all jobs for step 3: hisatCount, type yall, then press enter key.
yall:     To re-run all jobs, type yallall, then press enter key.
enter:    To not re-run this job, directly press enter key.
nstep:    To not re-run all successful jobs for step 3: hisatCount, type nall, then press enter key.
nall:     To not re-run all successful jobs, type nallall, then press enter key.

# type enter here to not re-run

job 2.0.find2.B is not submitted

step: 1, depends on: 0, job name: find1, flag: find1.C reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 1.0.find1.C -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.C.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.C.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.C.sh
# Submitted batch job 41211380

step: 2, depends on: 0, job name: find2, flag: find2.C reference: .u
depend on no job
sbatch -p short -c 1 -t 50:0 --requeue --nodes=1  -J 2.0.find2.C -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.C.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.C.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.C.sh
# Submitted batch job 41211381

step: 3, depends on: 1.2, job name: merge , flag: merge reference:
depend on multiple jobs
sbatch -p short -t 10:0 -c 1 --requeue --nodes=1 --dependency=afterok:41211380:41211381 -J 3.1.2.merge -o /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out -e /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out /home/ld32/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.sh
# Submitted batch job 41211382

All submitted jobs:
job_id       depend_on              job_flag
41211380    null                  1.0.find1.C
41211381    null                  2.0.find2.C
41211382    ..41211380..41211381  3.1.2.merge
---------------------------------------------------------

This command will check if the earlier run is finished, and will prompt the user if they kill any running jobs. Next, it will then ask the user if they want to rerun any successfully finished steps. Click 'y', it will rerun, directly press 'enter' key, it will not rerun.

For the new data, RCBio will submit 2 jobs. Step3 will also still automatically run.

Re-run a single job manually

# /working/directory is a placeholder, replace it with your actual working directory path
cd /working/directory
# all/related/modules is a placeholder, replace it with the actual other modules/versions you need
module load rcbio/1.3.3 and all/related/modules

# submit job with proper partition, time, number of cores and memory
sbatch --requeue --mail-type=ALL -p short -t 2:0:0 -c 2 --mem 2G /working/directory/flag/stepID.loopID.stepName.sh

Or:
runSingleJob "module load bowtie/1.2.2; bowtie -x /n/groups/shared_databases/bowtie_indexes/hg19 -p 2 -1 read1.fq -2 read2.fq --sam > out.bam" "sbatch -p short -t 1:0:0 -c 2 -mem 8G"

For details about the second option: Get more informative slurm email notification and logs through rcbio/1.3

To run your own script as Slurm pipeline

If you have a bash script with multiple steps and you wish to run it as Slurm pipeline, here is how you can do that:

modify your old script and add the notation to mark the start and end of any loops, and the start of any step for which you want to submit as an sbatch job.
use runAsPipeline with your modified bash script, as detailed above.

How does the `runAsPipeline` RCBio pipeline runner work?

In case you wonder how it works, here is a simple example to explain.

For each step per loop, the pipeline runner creates a file that looks like the one below. (Here it is named flag.sh):

#!/bin/bash 
srun -n 1 bash -c "{ echo I am running...; hostname; otherCommands; } && touch flag.success" 
sleep 5 
export SLURM_TIME_FORMAT=relative 
echo Job done. Summary: 
sacct --format=JobID,Submit,Start,End,State,Partition,ReqTRES%30,CPUTime,MaxRSS,NodeList%30 --units=M -j $SLURM_JOBID 
sendJobFinishEmail.sh flag 
[ -f flag.success ] && exit 0 || exit 1

Your analysis commands will be wrapped in an srun so we can monitor if it completed successfully. If your commands worked (meaning exited in 0 status), then we will create the success file. Next, we will run sacct to get stats for the job step, and will send a job completion email with sendJobFinishEmail.sh. The sendJobFinishEmail.sh script is available in /n/app/rcbio/1.3.3/bin/, if you are interested in the contents of that script.

Then the job script will be submitted with:

sbatch -p short -t 10:0 -o flag.out -e flag.out flag.sh

Let us know if you have any questions by emailing rchelp@hms.harvard.edu. Please include your working folder and the commands used in your email. Any comments and suggestions are welcome!

We have additional example ready-to-run workflows available, which may be of interest to you.