Page Comparison

Table of Contents

This page shows you how to run a regular bash script as a pipeline. The runAsPipeline script, accessible through the rcbio/1.2 module, converts an input bash script to a pipeline that easily submits jobs to the Slurm scheduler for you.

Features of the new pipeline:

Submit each step as a cluster job using sbatch.
Automatically arrange dependencies among jobs.
Email notifications are sent when each job fails or succeeds.
If a job fails, all its downstream jobs automatically are killed.
When re-running the pipeline on the same data folder, if there are any unfinished jobs, the user is asked to kill them or not.
When re-running the pipeline on the same data folder, the user is asked to confirm to re-run or not if a step was done successfully earlier.

Please read below for an example.

...

If you need help connecting to O2, please review the Using Slurm Basic and the How to Login to O2 wiki pages.

From Windows, use the graphical PuTTY program to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your eCommons ID instead of user123:

Code Block

linenumbers	true

ssh user123@o2.hms.harvard.edu

Start interactive job, and create working folder

For example, for user abc123, the working directory will be

true

Code Block
linenumbers

srun --pty -p interactive -t 0-12:0:0 --mem 2000MB -c 1 /bin/bash mkdir /n/scratch3/users/a/abc123/testRunBashScriptAsSlurmPipeline cd /n/scratch3/users/a/abc123/testRunBashScriptAsSlurmPipeline

...

Load the pipeline related modules

true

Code Block
linenumbers

# This will setup the path and environmental variables for the pipeline module load rcbio/1.2

Build some testing data in the current folder

true

Code Block
linenumbers

echo -e "John Paul\nMike Smith\nNick Will\nJulia Johnson\nTom Jones" > universityA.txt cp universityA.txt universityB.txt

Take a look at the example files

true

Code Block
linenumbers

# this command shows the content of file universityA.txt cat universityA.txt # Below is the content of universityA.txt John Paul Mike Smith Nick Will Julia Johnson Tom Jones

true

Code Block
linenumbers

Code Block
# this command shows the content of file universityB.txt cat universityB.txt # below is the content of universityB.txt John Paul Mike Smith Nick Will Julia Johnson Tom Jones

The original bash script

true

Code Block

linenumbers

cp /n/app/rcbio/1.2/bin/bashScriptV1.sh .
# Below is the content of bashScriptV1.sh
cat bashScriptV1.sh

#!/bin/sh
for i in A B; do            

    u=university$i.txt   

    grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt       
  
    grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
done

cat John.txt Mike.txt Nick.txt Julia.txt > all.txt

...

The modified bash script

true

Code Block

linenumbers

cp /n/app/rcbio/1.2/bin/bashScriptV2.sh .
cat bashScriptV2.sh

# Below is the conten of bashScriptV2.sh
#!/bin/sh
	
for i in A B; do            
    u=university$i.txt    
            
    #@1,0,find1,u,sbatch -p short -c 1 -t 50:0
    grep -H John $u >>  John.txt; grep -H Mike $u >>  Mike.txt        
     
    #@2,0,find2,u,sbatch -p short -c 1 -t 50:0
    grep -H Nick $u >>  Nick.txt; grep -H Julia $u >>  Julia.txt
                   
done

#@3,1.2,merge     
cat John.txt Mike.txt Nick.txt Julia.txt > all.txt

...

Notice that there are a few things added to the script here:

Step 1 is denoted by #@1,0,find1,u,sbatch -p short -c 1 -t 50:0 (line 10 above), which means this is step 1 that depends on no other step, is named find1, and file $u needs to be copied to the /tmp directory. The sbatch command tells the pipeline runner the sbatch command to run this step.
Step 2 is denoted by #@2,0,find2,u (line 13 above), which means this is step2 that depends on no other step, is named find2, and file $u needs to be copy to /tmp directory. The sbatch command tells the pipeline runner the sbatch command to run this step.
Step 3 is denoted by #@3,1.2,merge, (line 18), which means that this is step3 that depends on step1 and step2, and the step is named merge. Notice, there is no sbatch here, so the pipeline runner will use default sbatch command from command line (see below).

Notice the format of step annotation is #@stepID,dependIDs,stepName,reference,sbatchOptions. Reference is optional, which allows the pipeline runner to copy data (file or folder) to local /tmp folder on the computing node to speed up the software. sbatchOptions is also optional, and when it is missing, the pipeline runner will use the default sbatch command given from command line (see below).

...

Test run the modified bash script as a pipeline

true

Code Block
linenumbers

runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp

...

You can use the command:

true

Code Block
linenumbers

squeue -u $USER

To see the job status (running, pending, etc.). You also get two emails for each step, one at the start of the step, one at the end of the step.

...

You can use the command:

true

Code Block
linenumbers

ls -l flag

This command list all the logs created by the pipeline runner. *.sh files are the slurm scripts for each step, *.out files are output files for each step, *.success files means job successfully finished for each step and *.failed means job failed for each steps.

...

You can use the command to cancel running and pending jobs:

true

Code Block
linenumbers

cancelAllJobs flag/alljobs.jid

...

You can rerun this command in the same folder

true

Code Block
linenumbers

runAsPipeline bashScriptV2.sh "sbatch -p short -t 10:0 -c 1" useTmp run

...

Re-run a single job manually

true

Code Block

linenumbers

cd proper/directory
module load rcbio/1.2 and all/related/modules

#submit job with proper partition, time, number of cores and memory
sbatch --requeue --mail-type=ALL -p short -t 2:0:0 -c 2 --mem 2G /working/directory/flag/stepID.loopID.stepName.sh

Or:
runSingleJob "module load bowtie/1.2.2; bowtie -x /n/groups/shared_databases/bowtie_indexes/hg19 -p 2 -1 read1.fq -2 read2.fq --sam > out.bam" "sbatch -p short -t 1:0:0 -c 2 -mem 8G"

For details about the second option: Get more informative slurm email notification and logs through rcbio/1.2

To run your own script as Slurm pipeline

If you have a bash script with multiple steps and you wish to run it as Slurm pipeline, modify your old script and add the notation to mark the start and end of any loops, and the start of any step for which you want to submit as an sbatch job. Then you can use runAsPipeline with your modified bash script, as detailed above.

...

For each step per loop, the pipeline runner creates a file looks like this (here it is named flag.sh):

true

Code Block

linenumbers

#!/bin/bash 
srun -n 1 bash -c "{ echo I am running...; hostname; otherCommands; } && touch flag.success" 
sleep 5 
export SLURM_TIME_FORMAT=relative 
echo Job done. Summary: 
sacct --format=JobID,Submit,Start,End,State,Partition,ReqTRES%30,CPUTime,MaxRSS,NodeList%30 --units=M -j $SLURM_JOBID 
sendJobFinishEmail.sh flag 
[ -f flag.success ] && exit 0 || exit 1

Then submit with:

true

Code Block
linenumbers

sbatch -p short -t 10:0 -o flag.out -e flag.out flag.sh

sendJobFinishEmail.sh is in /n/app/rcbio/1.2/bin/

Let us know if you have any questions by emailing rchelp@hms.harvard.edu. Please include your working folder and the commands used in your email. Any comments and suggestions are welcome!

We have additional example ready-to-run workflows available, which may be of interest to you.

Versions Compared

Old Version 15

New Version 16

Key

Start interactive job, and create working folder

Load the pipeline related modules

Build some testing data in the current folder

Take a look at the example files

The original bash script

The modified bash script

Test run the modified bash script as a pipeline

Re-run a single job manually

To run your own script as Slurm pipeline