Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

The workflows are downloaded from: https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels and modified to work on O2 slurm cluster.

Notice the original workflow uses reference and annotation files listed in this file: 

https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-germline-variant-calling.inputs.json 

We download the genome reference and all annotation files from: 

https://console.cloud.google.com/storage/browser/genomics-public-data/references/Homo_sapiens_assembly19_1000genomes_decoy/ except for the gtf file, which is downloaded from here: https://console.cloud.google.com/storage/browser/gatk-test-data/intervals?project=broad-dsde-outreach

We then modified the json file to this one: 

Code Block
/n/shared_db/singularity/hmsrc-gatk/scripts/gatk4-rna-germline-variant-calling.inputs.template.json

...

Code Block
ssh user123@o2.hms.harvard.edu

# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc

screen

srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash

mkdir /n/scratch3/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch3/users/${USER:0:1}/$USER/testGATK4

module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a

export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:/opt/singularity/bin:$PATH

# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh

cp /n/shared_db/singularity/hmsrc-gatk/scripts/* .

buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx

runAsPipeline fastqToBamGermline.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log

# check email or use this command to see the workflow progress
squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S"

# After all jobs finish run, run this command to start database, and keep it running in background
runDB.sh &     

for json in unmappedBams/group*/*.json; do
	echo working on $json
    [ -f $json.done ] && continue
	java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run gatk4-rna-germline-variant-calling.wdl -i $json 2>&1 | tee -a $json.log && touch $json.done
done

# rerun the above command if anything does not work

# find all .vcf files and create sym links in final folder. If there are dulplicate files: 
mkdir finalVCFs
ln -s  $PWD/cromwell-executions/RNAseq/*/call-VariantFiltration/execution/*.variant_filtered.vcf.gz finalVCFs/ 2>/dev/null

#merge all CVFs to single VCF file
vcfFiles=`ls finalVCFs/*.variant_filtered.vcf.gz`
module load bcftools/1.9
bcftools merge -o final.vcf --force-samples $vcfFiles

# Stop database
killall runDB.sh

...

From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

...

Note that only step 2 used -t 50:0, and all other steps used the default -t 10:0. The default walltime limit was set in the runAsPipeline command, and the walltime parameter for step 2 was set in the bash_script_v2.sh

Run the pipeline

Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run parameter to the command. If run is not specified, test mode will be used, which does not submit jobs and gives theplaceholder of 123 for jobids in the command's output. 

...