|
This page shows you how to run GATK4 using our recently installed Singularity GATK4 container. The runAsPipeline
script, accessible through the rcbio/1.0
module, converts the bash script into a pipeline that easily submits jobs to the Slurm scheduler for you.
...
The workflows are downloaded from: https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels and modified to work on O2 slurm cluster.
Notice the original workflow uses reference and annotation files listed in this file:
We download the genome reference and all annotation files from:
https://console.cloud.google.com/storage/browser/genomics-public-data/references/Homo_sapiens_assembly19_1000genomes_decoy/ except for the gtf file, which is downloaded from here: https://console.cloud.google.com/storage/browser/gatk-test-data/intervals?project=broad-dsde-outreach
We then modified the json file to this one:
Code Block |
---|
/n/shared_db/singularity/hmsrc-gatk/scripts/gatk4-rna-germline-variant-calling.inputs.template.json |
...
Code Block |
---|
ssh user123@o2.hms.harvard.edu # set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715 cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc screen srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash mkdir /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4 cd /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4 module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:/opt/singularity/bin:$PATH # setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home. setupDB.sh cp /n/shared_db/singularity/hmsrc-gatk/scripts/* . buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx runAsPipeline fastqToBamGermline.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log # check email or use this command to see the workflow progress squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S" # After all jobs finish run, run this command to start database, and keep it running in background runDB.sh & for json in unmappedBams/group*/*.json; do echo working on $json [ -f $json.done ] && continue java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run gatk4-rna-germline-variant-calling.wdl -i $json 2>&1 | tee -a $json.log && touch $json.done done # rerun the above command if anything does not work # find all .vcf files and create sym links in final folder. If there are dulplicate files: mkdir finalVCFs ln -s $PWD/cromwell-executions/RNAseq/*/call-VariantFiltration/execution/*.variant_filtered.vcf.gz finalVCFs/ 2>/dev/null #merge all CVFs to single VCF file vcfFiles=`ls finalVCFs/*.variant_filtered.vcf.gz` module load bcftools/1.9 bcftools merge -o final.vcf --force-samples $vcfFiles # Stop database killall runDB.sh |
...
From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.
...
From a Mac Terminal, use the ssh
command, inserting your eCommons HMS ID instead of user123:
Code Block |
---|
ssh user123@o2.hms.harvard.edu # set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715 cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc screen # start screen session. For detail: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715 |
...
Code Block |
---|
srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash mkdir /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4 cd /n/scratch3scratch/users/${USER:0:1}/$USER/testGATK4 |
...
Code Block |
---|
# This will setup the path and environmental variables for the pipeline module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a rcbio/1.3.3 export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:$PATH # setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home. setupDB.sh |
...
Note that only step 2 used -t 50:0
, and all other steps used the default -t 10:0
. The default walltime limit was set in the runAsPipeline
command, and the walltime parameter for step 2 was set in the bash_script_v2.sh
Run the pipeline
Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run
parameter to the command. If run
is not specified, test
mode will be used, which does not submit jobs and gives theplaceholder of 123
for jobids in the command's output.
...