Run Bash Script As Slurm Pipeline through rcbio/1.0

This page shows you how to run a regular bash script as a pipeline. The runAsPipeline script, accessible through the rcbio/1.0 module, converts an input bash script to a pipeline that easily submits jobs to the Slurm scheduler for you.

Features of the new pipeline:

Submit each step as a cluster job using sbatch.
Automatically arrange dependencies among jobs.
Email notifications are sent when each job fails or succeeds.
If a job fails, all its downstream jobs automatically are killed.
When re-running the pipeline on the same data folder, if there are any unfinished jobs, the user is asked to kill them or not.
When re-running the pipeline on the same data folder, the user is asked to confirm to re-run or not if a step was done successfully earlier.

Please read below for an example.

Log on to O2

If you need help connecting to O2, please review the Using Slurm Basic wiki page.

From Windows, use the graphical PuTTY program to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your eCommons ID instead of user123:

ssh user123@o2.hms.harvard.edu

Start interactive job, and create working folder

For example, for user abc123, the working directory will be

srun --pty -p interactive -t 0-12:0:0 --mem 2000MB -n 1 /bin/bash
mkdir /n/scratch3/users/a/abc123/testRunBashScriptAsSlurmPipeline  
cd /n/scratch3/users/a/abc123/testRunBashScriptAsSlurmPipeline

Load the pipeline related modules

# This will setup the path and environmental variables for the pipeline
module load rcbio/1.0

Build some testing data in the current folder

echo -e "John Paul\nMike Smith\nNick Will\nJulia Johnson\nTom Jones"  > universityA.txt
cp universityA.txt universityB.txt

Take a look at the example files

# this command shows the content of file universityA.txt
cat universityA.txt

# Below is the content of universityA.txt
John Paul
Mike Smith
Nick Will
Julia Johnson
Tom Jones

# this command shows the content of file universityB.txt
cat universityB.txt

# below is the content of universityB.txt
John Paul
Mike Smith
Nick Will
Julia Johnson
Tom Jones

The original bash script

cp /n/app/rcbio/1.0/bin/bash_script_v1.sh .
cat bash_script_v1.sh

# Below is the conten of bash_script_v1.sh

#!/bin/sh
for i in A B; do            

    u=university$i.txt   

    grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt       
  
    grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
done

cat John.txt Mike.txt Nick.txt Julia.txt > all.txt

How does this bash script work?

There is a loop that goes through the two university text files (for loop in line 7 above) to search for John and Mike (line 11 above), and then searches for Nick and Julia (line 13 above). After all searching is finished (line 14 above), then the results are merged into a single text file (line 16 above) . This means that the merge step (line 14 above) has to wait until the earlier two steps (line 11 and 13 above) are finished. However, the runAsPipeline workflow builder can't read this script directly. We will need to create a modified bash script that adds parts that explicitly tell the workflow builder the order in which the jobs need to run, among other things.

The modified bash script

cp /n/app/rcbio/1.0/bin/bash_script_v2.sh .
cat bash_script_v2.sh

# Below is the conten of bash_script_v2.sh
#!/bin/sh

#loopStart,i	
for i in A B; do            
    u=university$i.txt    
            
    #@1,0,find1,u,sbatch -p short -n 1 -t 50:0
    grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt        
     
    #@2,0,find2,u,sbatch -p short -n 1 -t 50:0
    grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt

#loopEnd                     
done

#@3,1.2,merge     
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt

Notice that there are a few things added to the script here:

before the loop starts, #loopStart,i was added (line 7 above). Here the variable i is looping variable, which will be recognized by the pineline runner.
before the loop ends, #loopEnd was added (line 17 above). This will be recognized by the pineline runner.
Step 1 is denoted by #@1,0,find1,u,sbatch -p short -n 1 -t 50:0 (line 11 above), which means this is step 1 that depends on no other step, is named find1, and file $u needs to be copied to the /tmp directory. The sbatch command tells the pipeline runner the sbatch command to run this step.
Step 2 is denoted by #@2,0,find2,u (line 14 above), which means this is step2 that depends on no other step, is named find2, and file $u needs to be copy to /tmp directory. The sbatch command tells the pipeline runner the sbatch command to run this step.
Step 3 is denoted by #@3,1.2,merge, which means that this is step3 that depends on step1 and step2, and the step is named merge. Notice, there is no sbatch here, so the pipeline runner will use default sbatch command (see below).

Notice the format of step annotaion is #@stepID,dependIDs,stepName,reference,sbatchOptions. Reference is optional, which allows the pineline runner to copy data (file or folder) to local /tmp folder on the computing node to speed up the software. sbatchOptions is also optional, and when it is missing, the pipeline runner will use the default sbatch command given from command line (see below).

Here are two more examples:

#@4,1.3,map,,sbatch -p short -n 1 -t 50:0 Means step4 depends on step1 and step3, named map, no reference data to copy, with sbatch -p short -n 1 -t 50:0

#@3,1.2,align,db1.db2 Means step3 depends on step1 and step2, named align, $db1 and $db2 as reference data to be copied to /tmp , with default sbatch command (see below).

Test run the modified bash script as a pipeline

runAsPipeline bash_script_v2.sh "sbatch -p short -t 10:0 -n 1" useTmp

This command will generate new bash script named slurmPipeLine.201801100946.sh in flag folder (201801100946 is the timestamp that runAsPipeline was invoked at). Then test run it, meaning does not really submit jobs, but only create a fake job id, 123 for each step. If you were to append run at the end of the command, the pipeline would actually be submitted to the Slurm scheduler.

Ideally, with 'useTmp', the software should run faster using local /tmp disk space for database/reference than the network storage. For this small query, the difference is small, or even slower if you use local /tmp. If you don't need /tmp, you can use noTmp.

Sample output from the test run

Note that only step 2 used -t 50:0, and all other steps used the default -t 10:0. The default walltime limit was set in the runAsPipeline command, and the walltime parameter for step 2 was set in the bash_script_v2.sh script.

runAsPipeline bash_script_v2.sh "sbatch -p short -t 10:0 -n 1" useTmp

# Below is the output: 
converting bash_script_v2.sh to flag/slurmPipeLine.201801161424.sh

find loopStart: #loopStart,i	

find job marker:
#@1,0,find1,u:     

find job:
grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt        

find job marker:
#@2,0,find2,u,sbatch -p short -n 1 -t 50:0
sbatch options: sbatch -p short -n 1 -t 50:0

find job:
grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
find loopend: #loopEnd                     

find job marker:
#@3,1.2,merge:           

find job:
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt
flag/slurmPipeLine.201801161424.sh bash_script_v2.sh is ready to run. Starting to run ...
Running flag/slurmPipeLine.201801161424.sh bash_script_v2.sh
---------------------------------------------------------

step: 1, depends on: 0, job name: find1, flag: find1.A reference: .u
depend on no job
sbatch -p short -t 10:0 -n 1 --nodes=1  -J 1.0.find1.A -o /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out -e /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.sh 
# Submitted batch job 123

step: 2, depends on: 0, job name: find2, flag: find2.A reference: .u
depend on no job
sbatch -p short -n 1 -t 50:0 --nodes=1  -J 2.0.find2.A -o /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out -e /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.sh 
# Submitted batch job 123

step: 1, depends on: 0, job name: find1, flag: find1.B reference: .u
depend on no job
sbatch -p short -t 10:0 -n 1 --nodes=1  -J 1.0.find1.B -o /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out -e /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.sh 
# Submitted batch job 123

step: 2, depends on: 0, job name: find2, flag: find2.B reference: .u
depend on no job
sbatch -p short -n 1 -t 50:0 --nodes=1  -J 2.0.find2.B -o /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out -e /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.sh 
# Submitted batch job 123

step: 3, depends on: 1.2, job name: merge, flag: merge reference:
depend on multiple jobs
sbatch -p short -t 10:0 -n 1 --nodes=1 --dependency=afterok:123:123:123:123 -J 3.1.2.merge -o /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out -e /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out /n/scratch2/kmk34/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.sh 
# Submitted batch job 123

all submitted jobs:
job_id       depend_on              job_flag  
123         null                  1.0.find1.A
123         null                  2.0.find2.A
123         null                  1.0.find1.B
123         null                  2.0.find2.B
123         ..123.123..123.123    3.1.2.merge
---------------------------------------------------------

Run the modified bash script as a pipeline

Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run parameter to the command. If run is not specified, test mode will be used, which does not submit jobs and gives theplaceholder of 123 for jobids in the command's output.

runAsPipeline bash_script_v2.sh "sbatch -p short -t 10:0 -n 1" useTmp run

# Below is the output
converting bash_script_v2.sh to flag/slurmPipeLine.201801101002.run.sh

find loopStart: #loopStart,i

find job marker:
#@1,0,find1,u:

find job:
grep -H John $u >> John.txt; grep -H Mike $u >> Mike.txt

find job marker:
#@2,0,find2,u,sbatch -p short -n 1 -t 50:0
sbatch options: sbatch -p short -n 1 -t 50:0

find job:
grep -H Nick $u >> Nick.txt; grep -H Julia $u >> Julia.txt
find loopend: #loopEnd

find job marker:
#@3,1.2,merge:

find job:
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt
flag/slurmPipeLine.201801101002.run.sh is ready to run. Starting to run ...
Running flag/slurmPipeLine.201801101002.run.sh
---------------------------------------------------------

step: 1, depends on: 0, job name: find1, flag: find1.A reference: .u
depend on no job
sbatch -p short -t 10:0 -n 1 --kill-on-invalid-dep=yes --nodes=1 -J 1.0.find1.A -o /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out -e /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.out /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.A.sh
# Submitted batch job 8091045

step: 2, depends on: 0, job name: find2, flag: find2.A reference: .u
depend on no job
sbatch -p short -n 1 -t 50:0 --kill-on-invalid-dep=yes --nodes=1 -J 2.0.find2.A -o /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out -e /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.out /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.A.sh
# Submitted batch job 8091046

step: 1, depends on: 0, job name: find1, flag: find1.B reference: .u
depend on no job
sbatch -p short -t 10:0 -n 1 --kill-on-invalid-dep=yes --nodes=1 -J 1.0.find1.B -o /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out -e /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.out /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/1.0.find1.B.sh
# Submitted batch job 8091047

step: 2, depends on: 0, job name: find2, flag: find2.B reference: .u
depend on no job
sbatch -p short -n 1 -t 50:0 --kill-on-invalid-dep=yes --nodes=1 -J 2.0.find2.B -o /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out -e /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.out /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/2.0.find2.B.sh
# Submitted batch job 8091048

step: 3, depends on: 1.2, job name: merge, flag: merge reference:
depend on multiple jobs
sbatch -p short -t 10:0 -n 1 --kill-on-invalid-dep=yes --nodes=1 --dependency=afterok:8091045:8091047:8091046:8091048 -J 3.1.2.merge -o /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out -e /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.out /n/scratch2/mfk8/testRunBashScriptAsSlurmPipeline/flag/3.1.2.merge.sh
# Submitted batch job 8091049
all submitted jobs:
job_id depend_on job_flag
8091045 null 1.0.find1.A
8091046 null 2.0.find2.A
8091047 null 1.0.find1.B
8091048 null 2.0.find2.B
8091049 ..8091045.8091047..8091046.8091048 3.1.2.merge
---------------------------------------------------------

Monitoring the jobs

You can use the command:

squeue -u $USER

To see the job status (running, pending, etc.). You also get two emails for each step, one at the start of the step, one at the end of the step.

Check job logs

You can use the command:

ls -l flag

This command list all the logs created by the pipeline runner. *.sh files are the slurm scripts for eash step, *.out files are output files for each step, *.success files means job successfully finished for each step and *.failed means job failed for each steps.

You also get two emails for each step, one at the start of the step, one at the end of the step.

Re-run the pipeline

You can rerun this command in the same folder

runAsPipeline bash_script_v2.sh "sbatch -p short -t 10:0 -n 1" useTmp run

This command will check if the earlier run is finished or not. If not, ask user to kill the running jobs or not, then ask user to rerun the successfully finished steps or not. Click 'y', it will rerun, directly press 'enter' key, it will not rerun.

To run your own script as Slurm pipeline

If you have a bash script with multiple steps and you wish to run it as Slurm pipeline, modify your old script and add the notation to mark the start and end of any loops, and the start of any step for which you want to submit as an sbatch job. Then you can use runAsPipeline with your modified bash script, as detailed above.

How does it work

In case you wonder how it works, here is a simple example to expain:

For each step per loop, the pipeline runner reates a file looks like this (here it is named flag.sh):

#!/bin/bash 
srun -n 1 bash -c "{ echo I am running...; hostname; otherCommands; } && touch flag.success" 
sleep 5 
export SLURM_TIME_FORMAT=relative 
echo Job done. Summary: 
sacct --format=JobID,Submit,Start,End,State,Partition,ReqTRES%30,CPUTime,MaxRSS,NodeList%30 --units=M -j $SLURM_JOBID 
sendJobFinishEmail.sh flag 
[ -f flag.success ] && exit 0 || exit 1

Then submit with:

sbatch -p short -t 10:0 -o flag.out -e flag.out flag.sh

sendJobFinishEmail.sh is in /n/app/rcbio/1.0/bin
There is a bug in the script, please change:
[ -f $flag.failed ]
to:
[ ! -f $flag.success ]

Let us know if you have any questions. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!