Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

Table of Contents

This page shows you how to run GATK4 using our recently installed Singularity GATK4 container. The runAsPipeline script, accessible through the rcbio/1.0 module, converts the bash script into a pipeline that easily submits jobs to the Slurm scheduler for you.

Features of this pipeline:

  • Given a sample sheet, generate folder structure for data processing

  • Submit each step as a cluster job using sbatch.

  • Automatically arrange dependencies among jobs.

  • Email notifications are sent when each job fails or succeeds.

  • If a job fails, all its downstream jobs automatically are killed.

  • When re-running the pipeline on the same data folder, if there are any unfinished jobs, the user is asked to kill them or not.

  • When re-running the pipeline on the same data folder, the user is asked to confirm to re-run or not if a step was done successfully earlier.

The workflows are downloaded from: https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels and modified to work on O2 slurm cluster.

Notice the original workflow uses reference and annotation files listed in this file: 

https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-germline-variant-calling.inputs.json 

We download the genome reference and all annotation files from: 

https://console.cloud.google.com/storage/browser/genomics-public-data/references/Homo_sapiens_assembly19_1000genomes_decoy/ except for the gtf file, which is downloaded from here: https://console.cloud.google.com/storage/browser/gatk-test-data/intervals?project=broad-dsde-outreach

We then modified the json file to this one: 

Code Block
/n/shared_db/singularity/hmsrc-gatk/scripts/gatk4-rna-germline-variant-calling.inputs.template.json
Code Block
Please modified the file to use your own reference and annotation files as you like.

Jump start

Here are the commands to test out the workflow using example data. The whole run need a few hours if the cluster is not busy. 

true
Code Block
linenumbers
ssh user123@o2.hms.harvard.edu

# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc

screen

srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash

mkdir /n/scratch3/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch3/users/${USER:0:1}/$USER/testGATK4

module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a

export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:/opt/singularity/bin:$PATH

# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh

cp /n/shared_db/singularity/hmsrc-gatk/scripts/* .

buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx

runAsPipeline fastqToBamGermline.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log

# check email or use this command to see the workflow progress
squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S"

# After all jobs finish run, run this command to start database, and keep it running in background
runDB.sh &     

for json in unmappedBams/group*/*.json; do
	echo working on $json
    [ -f $json.done ] && continue
	java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run gatk4-rna-germline-variant-calling.wdl -i $json 2>&1 | tee -a $json.log && touch $json.done
done

# rerun the above command if anything does not work

# find all .vcf files and create sym links in final folder. If there are dulplicate files: 
mkdir finalVCFs
ln -s  $PWD/cromwell-executions/RNAseq/*/call-VariantFiltration/execution/*.variant_filtered.vcf.gz finalVCFs/ 2>/dev/null

#merge all CVFs to single VCF file
vcfFiles=`ls finalVCFs/*.variant_filtered.vcf.gz`
module load bcftools/1.9
bcftools merge -o final.vcf --force-samples $vcfFiles

# Stop database
killall runDB.sh

...

If you need help connecting to O2, please review the How to login to O2 wiki page.


From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

...

From a Mac Terminal, use the ssh command, inserting your eCommons ID instead of user123:

Code Block
linenumberstrue
ssh user123@o2.hms.harvard.edu

# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc

screen  # start screen session. For detail: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715


Start interactive job, and create working folder

Code Block
linenumberstrue
srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash

mkdir /n/scratch3/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch3/users/${USER:0:1}/$USER/testGATK4


Load the pipeline related modules

Code Block
linenumberstrue
# This will setup the path and environmental variables for the pipeline
module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a
export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:$PATH

# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh


Build some testing data in the current folder

Code Block
linenumberstrue
cp /n/shared_db/singularity/hmsrc-gatk/scripts/* .
buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx

...


Take a look at the example files

linenumbers
Code Block
true
# this command shows the content of newly created folders
ls -l group*/*

# Below is out of the ls command
group1/tumor1:

total 74K
lrwxrwxrwx 1 ld32 rccg 120 Jul 19 08:43 lib1_lane1_1.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group1/patient1/ERR166302_ch14_16_1.fastq
lrwxrwxrwx 1 ld32 rccg 120 Jul 19 08:43 lib1_lane1_2.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group1/patient1/ERR166302_ch14_16_2.fastq
lrwxrwxrwx 1 ld32 rccg 120 Jul 19 08:43 lib1_lane2_1.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group1/patient1/ERR166303_ch14_16_1.fastq
lrwxrwxrwx 1 ld32 rccg 120 Jul 19 08:43 lib1_lane2_2.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group1/patient1/ERR166303_ch14_16_2.fastq

group2/normal1:
total 8.0K
lrwxrwxrwx 1 ld32 rccg 121 Jul 19 08:43 lib2_lane1_1.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group2/control1//ERR166316_ch14_16_1.fastq
lrwxrwxrwx 1 ld32 rccg 121 Jul 19 08:43 lib2_lane1_2.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group2/control1//ERR166316_ch14_16_2.fastq
lrwxrwxrwx 1 ld32 rccg 121 Jul 19 08:43 lib2_lane2_1.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group2/control1//ERR166317_ch14_16_1.fastq
lrwxrwxrwx 1 ld32 rccg 121 Jul 19 08:43 lib2_lane2_2.fq -> /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/group2/control1//ERR166317_ch14_16_1.fastq


The original bash script

Code Block
linenumberstrue
less fastqToBamGermline.sh

# Here is the content of fastqToBamGermline.sh
#!/bin/sh

module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 samtools/1.9 bwa/0.7.17

[ -d group2 ] || { echo group2 is not found. You need at least two groups to run this pipeline; exit 1; }

pwdhere=`pwd`

for group in `ls -d group*/|sed 's|[/]||g'`; do
    echo working on group:  $group
    cd $pwdhere/$group
    inputsams=""
	for sample in `ls -d */|sed 's|[/]||g'`; do
        rm $pwdhere/unmappedBams/$group/$sample.inputBamList.txt 2>/dev/null
        mkdir -p $pwdhere/unmappedBams/$group/$sample
        echo working on sample: $sample
        
for r1 in `ls $sample/*_1.fastq $sample/*_1.fq 2>/dev/null`; do
            readgroup=${r1#*/}
            readgroup=${readgroup%_*}
            r2=${r1%_*}_2${r1##*_1}
            echo working on readgroup: $readgroup
                        [[ -f $r2 ]] || { echo -e "\n\n!!!Warning: read2 file '$r2' not exist, ignore this warning if you are working with single-end data\n\n"; r2=""; }

            #@1,0,fastqToSam,,sbatch -p short -t 0-4 --mem 4G
            java -XX:+UseSerialGC -Xmx8G -jar /n/app/picard/2.8.0/bin/picard-2.8.0.jar FastqToSam \
               	FASTQ=$r1 \
                FASTQ2=$r2 \
                OUTPUT=$pwdhere/unmappedBams/$group/$sample/$readgroup.bam \
                READ_GROUP_NAME=${readgroup##*_} \
                SAMPLE_NAME=$sample \
                LIBRARY_NAME=${readgroup%%_*} \
                PLATFORM_UNIT=H0164ALXX140820.2 \
                PLATFORM=illumina \
                SEQUENCING_CENTER=BI \
                RUN_DATE=2014-08-20T00:00:00-0400
                
			inputsams="$inputsams INPUT=$pwdhere/unmappedBams/$group/$sample/$readgroup.bam"	
        done
        sed "s/inputBamFileNamePath/${pwdhere//\//\\/}\/unmappedBams\/$group\/$sample.bam/" $pwdhere/gatk4-rna-germline-variant-calling.inputs.template.json  > $pwdhere/unmappedBams/$group/$sample.json
        #@2,1,merge,,sbatch -p short -n 1 -t 0-4:0 --mem 12G
        java -Xmx12g -jar $PICARD/picard-2.8.0.jar MergeSamFiles VALIDATION_STRINGENCY=SILENT OUTPUT=$pwdhere/unmappedBams/$group/$sample.bam  $inputsams SORT_ORDER=coordinate && samtools index $pwdhere/unmappedBams/$group/$sample.bam
    done
done

...

These comments are recgnized by our pipeline runner and the command following them are submitted as slurm jobs: 

Code Block
#@1,0,fastqToSam,,sbatch -p short -t 0-4 --mem 4G     # means this is step 1, depends on nothing, and use 4 hours and 4G memory
Code Block
Test run the modified bash script as a pipeline
linenumbers
Code Block
true
runAsPipeline fastqToBamGermline.sh "sbatch -p short -t 10:0 -n 1" useTmp

...

You can use the command:

Code Block
linenumberstrue
squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S"

...

You can use the command:

true
Code Block
linenumbers
ls -l flag

This command list all the logs created by the pipeline runner. *.sh files are the slurm scripts for eash step, *.out files are output files for each step, *.success files means job successfully finished for each step and *.failed means job failed for each steps.

...

You can rerun this command in the same folder

Code Block
linenumberstrue
runAsPipeline fastqToBamGermline.sh   "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log

This command will check if the earlier run is finished or not. If not, ask user to kill the running jobs or not, then ask user to rerun the successfully finished steps or not. Click 'y', it will rerun, directly press 'enter' key, it will not rerun. 

Call variance:

true
Code Block
linenumbers
# start database and let it run in background (& allow the command keep running in background) 
runDB.sh &
for json in unmappedBams/group*/*.json; do
	echo working on $json
    [ -f $json.done ] && continue
	java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run gatk4-rna-germline-variant-calling.wdl -i $json 2>&1 | tee -a $json.log && touch $json.done
done

# rerun the above command if anything does not work

# find all .vcf files and create sym links in final folder. If there are dulplicate files: 
mkdir finalVCFs
ln -s  $PWD/cromwell-executions/RNAseq/*/call-VariantFiltration/execution/*.variant_filtered.vcf.gz finalVCFs/ 2>/dev/null

#merge all CVFs to single VCF file
vcfFiles=`ls finalVCFs/*.variant_filtered.vcf.gz`
module load bcftools/1.9
bcftools merge -o final.vcf --force-samples $vcfFiles

# Stop database
killall runDB.sh

...