Page Comparison

Table of Contents

This page shows you how to run GATK4 using our recently installed Singularity GATK4 container. The runAsPipeline script, accessible through the rcbio/1.0 module, converts the bash script into a pipeline that easily submits jobs to the Slurm scheduler for you.

...

The workflows are downloaded from: https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels and modified to work on O2 slurm cluster.

Notice the original workflow uses reference and annotation files listed in this file:

https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-germline-variant-calling.inputs.json

We download the genome reference and all annotation files from:

https://console.cloud.google.com/storage/browser/genomics-public-data/references/Homo_sapiens_assembly19_1000genomes_decoy/ except for the gtf file, which is downloaded from here: https://console.cloud.google.com/storage/browser/gatk-test-data/intervals?project=broad-dsde-outreach

We then modified the json file to this one:

Code Block
/n/shared_db/singularity/hmsrc-gatk/scripts/gatk4-rna-germline-variant-calling.inputs.template.json

...

Code Block

ssh user123@o2.hms.harvard.edu

# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc

screen

srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash

mkdir /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4

module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a

export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:/opt/singularity/bin:$PATH

# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh

cp /n/shared_db/singularity/hmsrc-gatk/scripts/* .

buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx

runAsPipeline fastqToBamGermline.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log

# check email or use this command to see the workflow progress
squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S"

# After all jobs finish run, run this command to start database, and keep it running in background
runDB.sh &     

for json in unmappedBams/group*/*.json; do
	echo working on $json
    [ -f $json.done ] && continue
	java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run gatk4-rna-germline-variant-calling.wdl -i $json 2>&1 | tee -a $json.log && touch $json.done
done

# rerun the above command if anything does not work

# find all .vcf files and create sym links in final folder. If there are dulplicate files: 
mkdir finalVCFs
ln -s  $PWD/cromwell-executions/RNAseq/*/call-VariantFiltration/execution/*.variant_filtered.vcf.gz finalVCFs/ 2>/dev/null

#merge all CVFs to single VCF file
vcfFiles=`ls finalVCFs/*.variant_filtered.vcf.gz`
module load bcftools/1.9
bcftools merge -o final.vcf --force-samples $vcfFiles

# Stop database
killall runDB.sh

...

From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

...

From a Mac Terminal, use the ssh command, inserting your eCommons HMS ID instead of user123:

Code Block

ssh user123@o2.hms.harvard.edu

# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc

screen  # start screen session. For detail: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715

...

Code Block
srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash mkdir /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4 cd /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4

...

Code Block

# This will setup the path and environmental variables for the pipeline
module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a rcbio/1.3.3
export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:$PATH

# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh

...

Note that only step 2 used -t 50:0, and all other steps used the default -t 10:0. The default walltime limit was set in the runAsPipeline command, and the walltime parameter for step 2 was set in the bash_script_v2.sh

Run the pipeline

Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run parameter to the command. If run is not specified, test mode will be used, which does not submit jobs and gives theplaceholder of 123 for jobids in the command's output.

...

Versions Compared

Old Version 22

New Version Current

Key

Run the pipeline