|
...
The workflows are downloaded from: https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels and modified to work on O2 slurm cluster.
Notice the original workflow uses reference and annotation files listed in this file:
We download the genome reference and all annotation files from:
https://console.cloud.google.com/storage/browser/genomics-public-data/references/Homo_sapiens_assembly19_1000genomes_decoy/ except for the gtf file, which is downloaded from here: https://console.cloud.google.com/storage/browser/gatk-test-data/intervals?project=broad-dsde-outreach
We then modified the json file to this one:
Code Block |
---|
/n/shared_db/singularity/hmsrc-gatk/scripts/gatk4-rna-germline-variant-calling.inputs.template.json |
...
Code Block |
---|
ssh user123@o2.hms.harvard.edu
# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc
screen
srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash
mkdir /n/scratch3/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch3/users/${USER:0:1}/$USER/testGATK4
module load gcc/6.2.0 python/2.7.12 java/jdk-1.8u112 star/2.5.4a
export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/home/ld32/rcbioDev/bin:/opt/singularity/bin:$PATH
# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh
cp /n/shared_db/singularity/hmsrc-gatk/scripts/* .
buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx
runAsPipeline fastqToBamGermline.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log
# check email or use this command to see the workflow progress
squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S"
# After all jobs finish run, run this command to start database, and keep it running in background
runDB.sh &
for json in unmappedBams/group*/*.json; do
echo working on $json
[ -f $json.done ] && continue
java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run gatk4-rna-germline-variant-calling.wdl -i $json 2>&1 | tee -a $json.log && touch $json.done
done
# rerun the above command if anything does not work
# find all .vcf files and create sym links in final folder. If there are dulplicate files:
mkdir finalVCFs
ln -s $PWD/cromwell-executions/RNAseq/*/call-VariantFiltration/execution/*.variant_filtered.vcf.gz finalVCFs/ 2>/dev/null
#merge all CVFs to single VCF file
vcfFiles=`ls finalVCFs/*.variant_filtered.vcf.gz`
module load bcftools/1.9
bcftools merge -o final.vcf --force-samples $vcfFiles
# Stop database
killall runDB.sh |
...
From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.
...
Note that only step 2 used -t 50:0
, and all other steps used the default -t 10:0
. The default walltime limit was set in the runAsPipeline
command, and the walltime parameter for step 2 was set in the bash_script_v2.sh
Run the pipeline
Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run
parameter to the command. If run
is not specified, test
mode will be used, which does not submit jobs and gives theplaceholder of 123
for jobids in the command's output.
...