GATK/3.7 with Queue/3.7 (Debug phase)

 

Pipeline Features:

  • Automatically submit each step as a job, automatically setup job dependencies.

  • Get email updates about job running status.

  • If a job fails or gets killed, all downstream jobs will be terminated automatically.

  • Keep all flag files and job scripts in folder 'flag'

  • When re-run the pipeline on the same data set, users have option to re-run or not re-run earlier successfully finished steps.

  • Use gatk bundle as reference, which is stored in: =/n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37

This page shows you how to use a script written by RC to run a standard human DNA-Seq data.

The RNA-Seq is similar, just use start aligner in step 6 and use HaplotypeCaller with options -splitNCigarReads, -dontUseSoftClippedBases and -stand_call_conf 20.0., in the queue command (step 10, 11 and 12)

We use subset (two breast cancer and two normal control samples on chromosome 14 and chromosome 16) of a published breast cancer data set. Here is the link to the original paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0055681

Cut and paste the commands below onto the Orchestra command line to:

  • run bwa (or star for RNA-Seq) to map .fastq reads to human genome reference on each data set, then use PICARD tools to merge all date from same sample to one .bam file

  • run GATK queue to identify indel region, realign them, mark duplicates, re-calibrate quality score on each sample and call mutations all together

  • run gatk to variant score recalibration and filter mutations according soft threshold or hard threshold if the soft threshold fails

  • run Annovar to annotate mutation and generate report

If you have any questions, please contact us http://rc.hms.harvard.edu/support.html. Include your working folder and the commands used in your email. We are happy to work with users of all levels of experience. Comments and suggestion are also welcome.



Note: The commands are written in bold black. The text that gets printed to the screen is in red/brown. Comments have # at the beginning of the line. (When copying/pasting commands, you can include lines starting with #. They will be ignored by Linux.)

#1 Log on to O2

# If you need help connecting to O2, please review the Orchestra introduction for new users. That page also includes information about obtaining an account on the cluster if you do not already have one.

# From Windows, use the graphical PuTTY program to connect to orchestra.med.harvard.edu and make sure the port is set to the default value of 22

# From a Mac Terminal, use the ssh command, inserting your eCommons ID instead of user123:

ssh user123@orchestra.med.harvard.edu

#2 Create a new folder, goto it

mkdir -p /n/scratch/users/${USER:0:1}/$USER/gatk_test

cd  /n/scratch/users/${USER:0:1}/$USER/gatk_test

#3 Load the pipeline related modules

# This will setup the path and environmental variables for the pipeline, including bwa, picardtools, gatk, samtools and annovar.

module load rcbio/1.3.4

#4 Copy the testing data to current folder

cp -Rs /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37/3.3-0-in/* .

cp /home/ld32/rcbioDev/bin/align-for-gatk.sh  /home/ld32/rcbioDev/bin/gatkQueueScript.scala /home/ld32/rcbioDev/bin/filter-variants-v0.1.sh .

# Make sure to include the period (which means "current folder) at the end of the above command.

#5 take a look at the files

ls -lLh group*/*/

group1/patient1/: total 1.7G -rw-rw-r-- 1 ld32 shared_databases 339M Apr 6 2015 ERR166302_ch14_16_1.fastq -rw-rw-r-- 1 ld32 shared_databases 339M Apr 6 2015 ERR166302_ch14_16_2.fastq -rw-rw-r-- 1 ld32 shared_databases 350M Apr 6 2015 ERR166303_ch14_16_1.fastq -rw-rw-r-- 1 ld32 shared_databases 350M Apr 6 2015 ERR166303_ch14_16_2.fastq
group1/patient2/: total 1.7G -rw-rw-r-- 1 ld32 shared_databases 366M Apr 6 2015 ERR166304_ch14_16_1.fastq -rw-rw-r-- 1 ld32 shared_databases 366M Apr 6 2015 ERR166304_ch14_16_2.fastq -rw-rw-r-- 1 ld32 shared_databases 337M Apr 6 2015 ERR166305_ch14_16_1.fastq -rw-rw-r-- 1 ld32 shared_databases 337M Apr 6 2015 ERR166305_ch14_16_2.fastq
group2/control1/: total 1.6G -rw-rw-r-- 1 ld32 shared_databases 347M Apr 6 2015 ERR166316_ch14_16_1.fastq -rw-rw-r-- 1 ld32 shared_databases 317M Apr 6 2015 ERR166316_ch14_16_2.fastq -rw-rw-r-- 1 ld32 shared_databases 352M Apr 6 2015 ERR166317_ch14_16_1.fastq -rw-rw-r-- 1 ld32 shared_databases 321M Apr 6 2015 ERR166317_ch14_16_2.fastq

## Note: These are subset data of breast cancer normal control samples. To make the data set small for quick testing, they only contain read mappable to chromesome 14 and 16.

How the data files should be set up

Folders: there are 2 folders, named as group1 and group2. Each folder is for one experimental condition, such as patient or control.

Files: Each group# folder has 2 samples: group1 has patient1 and patient2, and group2 has control1 and control2. Each sample there are two data sets, and each data set has two .fastq files, one for read1, the other is for read2.

For your own data, you should also organize the data in this way.

Create one condition folder for each condition, and copy or link the fastq files to the correct group folder.

File format:

All read files should be in fastq format, the file extension can be .fastq or .fq.

Read1 file should named xxx_1.fastq or xxx_1.fq. Respectively, read2 file should named xxx_2.fastq or xxx_2.fq.

#6 Run bwa to map .fastq reads to human genome reference on each data set, then use PICARD tools to merge all date from same sample to one .bam file

runAsPipeline align-for-gatk.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee align.log

# In case you're curious, the script submits all jobs to short partition using 4 processors and 12 hours runtime. The tee command saves the log and also prints the log on the screen.

Sample output



Mon Jun 24 11:34:28 EDT 2019
Runing: /home/ld32/rcbioDev/bin/runAsPipeline align-for-gatk.sh sbatch -p short --mem 4G -t 1:0:0 -n 1 noTmp run
module list:

Currently Loaded Modules:
1) gcc/6.2.0 2) python/2.7.12



converting align-for-gatk.sh to flag/slurmPipeLine.201906241134.run.sh

find loop start: for group in `ls -d group*/|sed 's|[/]||g'`; do

find loop start: for sample in `ls -d */|sed 's|[/]||g'`; do

find loop start: for r1 in `ls $sample/*_1.fastq $sample/*_1.fq 2>/dev/null`; do

find job marker:
#@1,0,bwa,index,sbatch -p short -t 0-12 -c 4 --mem 32G
sbatch options: sbatch -p short -t 0-12 -c 4 --mem 32G

find job:
bwa mem -M -t 4 -R "@RG\tID:$sample.$readgroupz\tPL:Illumina\tSM:$sample" $index $r1 $r2 > $sample/$readgroup.sam
find loop end: done

find job marker:
#@2,1,merge,,sbatch -p short -t 0-4 --mem 8G
sbatch options: sbatch -p short -t 0-4 --mem 8G

find job:
java -Xmx12g -jar $PICARD/picard-2.8.0.jar MergeSamFiles VALIDATION_STRINGENCY=SILENT OUTPUT=$sample.bam $inputsams SORT_ORDER=coordinate && samtools index $sample.bam
find loop end: done
find loop end: done
flag/slurmPipeLine.201906241134.run.sh align-for-gatk.sh is ready to run. Starting to run ...
Running flag/slurmPipeLine.201906241134.run.sh align-for-gatk.sh
module list:

Currently Loaded Modules:
1) gcc/6.2.0 2) python/2.7.12

working on group: group1
working on sample: patient1
working on readgroup: ERR166302_ch14_16

step: 1, depends on: 0, job name: bwa, flag: bwa.group1.patient1.patient1/ERR166302_ch14_16_1.fastq
depend on no job
1.0.bwa.group1.patient1.patient1_ERR166302_ch14_16_1.fastq was done before, do you want to re-run for (y/enter)?:rm /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group1.patient1.success
Will re-run the down stream steps even if they are done before.
sbatch -p short -t 0-12 -c 4 --mem 32G --mail-type=FAIL --nodes=1 -J 1.0.bwa.group1.patient1.patient1_ERR166302_ch14_16_1.fastq -o /n/groups/rccg/ld32/gatkO21/flag/1.0.bwa.group1.patient1.patient1_ERR166302_ch14_16_1.fastq.out -e /n/groups/rccg/ld32/gatkO21/flag/1.0.bwa.group1.patient1.patient1_ERR166302_ch14_16_1.fastq.out /n/groups/rccg/ld32/gatkO21/flag/1.0.bwa.group1.patient1.patient1_ERR166302_ch14_16_1.fastq.sh
# Submitted batch job 24084
working on readgroup: ERR166303_ch14_16

step: 1, depends on: 0, job name: bwa, flag: bwa.group1.patient1.patient1/ERR166303_ch14_16_1.fastq
depend on no job
1.0.bwa.group1.patient1.patient1_ERR166303_ch14_16_1.fastq was done before, do you want to re-run for (y/enter)?:rm /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group1.patient1.success
Will re-run the down stream steps even if they are done before.
sbatch -p short -t 0-12 -c 4 --mem 32G --mail-type=FAIL --nodes=1 -J 1.0.bwa.group1.patient1.patient1_ERR166303_ch14_16_1.fastq -o /n/groups/rccg/ld32/gatkO21/flag/1.0.bwa.group1.patient1.patient1_ERR166303_ch14_16_1.fastq.out -e /n/groups/rccg/ld32/gatkO21/flag/1.0.bwa.group1.patient1.patient1_ERR166303_ch14_16_1.fastq.out /n/groups/rccg/ld32/gatkO21/flag/1.0.bwa.group1.patient1.patient1_ERR166303_ch14_16_1.fastq.sh
# Submitted batch job 24085

step: 2, depends on: 1, job name: merge, flag: merge.group1.patient1
depend on multiple jobs
sbatch -p short -t 0-4 --mem 8G --mail-type=FAIL --nodes=1 --dependency=afterok:24084:24085 -J 2.1.merge.group1.patient1 -o /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group1.patient1.out -e /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group1.patient1.out /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group1.patient1.sh
# Submitted batch job 24086

....

step: 2, depends on: 1, job name: merge, flag: merge.group2.control2
depend on multiple jobs
sbatch -p short -t 0-4 --mem 8G --mail-type=FAIL --nodes=1 --dependency=afterok:24093:24094 -J 2.1.merge.group2.control2 -o /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group2.control2.out -e /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group2.control2.out /n/groups/rccg/ld32/gatkO21/flag/2.1.merge.group2.control2.sh
# Submitted batch job 24095

All submitted jobs:
job_id depend_on job_flag
24084 null 1.0.bwa.group1.patient1.patient1_ERR166302_ch14_16_1.fastq
24085 null 1.0.bwa.group1.patient1.patient1_ERR166303_ch14_16_1.fastq
24086 .24084.24085 2.1.merge.group1.patient1
24087 null 1.0.bwa.group1.patient2.patient2_ERR166304_ch14_16_1.fastq
24088 null 1.0.bwa.group1.patient2.patient2_ERR166305_ch14_16_1.fastq
24089 .24087.24088 2.1.merge.group1.patient2
24090 null 1.0.bwa.group2.control1.control1_ERR166316_ch14_16_1.fastq
24091 null 1.0.bwa.group2.control1.control1_ERR166317_ch14_16_1.fastq
24092 .24090.24091 2.1.merge.group2.control1
24093 null 1.0.bwa.group2.control2.control2_ERR166318_ch14_16_1.fastq
24094 null 1.0.bwa.group2.control2.control2_ERR166319_ch14_16_1.fastq
24095 .24093.24094 2.1.merge.group2.control2



#7 Wait for the jobs to finish

O2sacct

# O2sacct tells you the status of each job that hasn't finished. If status is something other than PEND, RUN, or (briefly) DONE, see TroubleshootingLSFJobs.

# The jobs should finish in four to five hours unless the cluster is extremely busy. Check your email from time to time to see the progress. You can also look for the output files from each program:

ls flag/*start flag/*success flag/*fail

#8 If you need to stop the jobs...

checkJobsSlurm flag/alljobs.jid

the following jobs are not finished yet:

# You are asked if you want to to kill them. Answer 'y' then enter will kill all the jobs.

#9 Re-run the pipeline on the same data set (because some steps fail).

# If, for any reason, some steps fail, you can re-run the pipeline.

# When you re-run the command, it will check if there are still some jobs not done yet. If yes, it will ask if you want to kill them.

# If you agree to kill them, the pipeline software

# will check the flag files in the flag folder to see which steps are successfully finished earlier.

# For each step which successfully finished earlier, the software

# will ask you whether you want to re-run them, if you don't type 'y' (such as: directly type 'enter'), the step will not re-run. This will save you a lot of time.

#10 run GATK queue to identify indel region, realign them, mark duplicates, recalibrate quality score on each sample and call mutations all together

# setup some parameters:

bams=`cat bam.list`

email=`cat ~/.forward`

genotyper=UnifiedGenotyper

# or, if you want to use HaplotypeCaller

genotyper=HaplotypeCaller

# or both UnifiedGenotyper and HaplotypeCaller

genotyper=UnifiedGenotyperandHaplotypeCaller



# Start a dry run: this command only print out the steps will be run, not actually run any steps.

module load gatk/3.7 queue/3.7 slurm-drmaa/1.1.4

export GATKDATA=/n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37

java -jar $QUEUE/Queue.jar -noGCOpt -retry 3 -keepIntermediates -statusTo $email -S gatkQueueScript.scala $bams -R $GATKDATA/human_g1k_v37.fasta -D $GATKDATA/dbsnp_138.b37.vcf -genotyper $genotyper -jobRunner Drmaa    -jobNative   "-t 2:0:0 -n 4 -p short --nodes=1 --mem-per-cpu=8000"  -startFromScratch  2>&1 | tee $genotyper.log

# Notice: the tee command allow the output written to the log file, also printed on the screen.

# You should see output looks like:

....

If you don't see it, correct the error, and re-run the command.

If everything looks good, re-run the command with an additional option -run. This will actually submit jobs to the computer cluster.

sbatch -p short -t 12:0:0 --mem 8000 --wrap "java -jar $QUEUE/Queue.jar -noGCOpt -retry 3 -keepIntermediates -statusTo $email -S gatkQueueScript.scala $bams -R $GATKDATA/human_g1k_v37.fasta -D $GATKDATA/dbsnp_138.b37.vcf -genotyper $genotyper -jobRunner Drmaa    -jobNative   \"-t 2:0:0 -n 4 -p short --nodes=1 --mem-per-cpu=8000\"  -startFromScratch -run 2>&1 | tee $genotyper.log"

# The command submits hundreds of jobs to the computer cluster Orchestra and the status is sent to your email.

#11 Wait for the jobs to finish

squeue -u $USER

# squeue will tell you the status of each job that hasn't finished. If status is something other than PEND, RUN, or (briefly) DONE, see TroubleshootingLSFJobs.

# The jobs should finish in four to five hours unless the cluster is extremely busy. Check your email from time to time to see the progress.

# You can also see the log file by: 

tail -f $genotyper.log   # Ctrl + C to quit

#12 Re-run the pipeline on the same data set (because some steps fail).

# If, for any reason, some steps fail, you can correct the error according to the error message in the log and re-run the pipeline. Notice, we removed -startFromScratch. For dry run:

java -jar $QUEUE/Queue.jar -noGCOpt -retry 3 -keepIntermediates -statusTo $email -S gatkQueueScript.scala $bams -R $GATKDATA/human_g1k_v37.fasta -D $GATKDATA/dbsnp_138.b37.vcf -genotyper $genotyper -jobRunner Drmaa    -jobNative  "-t 2:0:0 -n 4 -p short --nodes=1 --mem-per-cpu=8000"  2>&1 | tee $genotyper.log1

# then real run:

sbatch -q short -t 12:0:0 -mem 8000 --wrap "java -jar $QUEUE/Queue.jar -noGCOpt -retry 3 -keepIntermediates -statusTo $email -S gatkQueueScript.scala $bams -R $GATKDATA/human_g1k_v37.fasta -D $GATKDATA/dbsnp_138.b37.vcf -genotyper $genotyper -jobRunner Drmaa    -jobNative   \"-t 2:0:0 -n 4 -p short --nodes=1 --mem-per-cpu=8000\"  -run 2>&1 | tee $genotyper.log1"

#13 filter the candidate variants:

sbatch -p short  -t 2:0:0 --mem 8G  --wrap "filter-variants-v0.1.sh $genotyper.variant.vcf &> filter.log"

# Note: the command tries to create Gaussian mixture model for SNPs and indels and apply them to the variants. The model construction step may fail. If it indeed fails, the command will do hard filtering. After the job finishes, check the log to see if the Gaussian mixture model successful or not. You can also check the output files: if you can see file $genotyper.variant.vcf.snp.indel.recalibrated.vcf, that means the Gaussian mixture model successful for both snps and indel. If, on the hand you can see $genotyper.variant.vcf.snp.recalibrated.vcf, that means the creating Gaussian mixture model failed. Or if you don't see either of them, please use the hard filtered file: $genotyper.variant.vcf.hard_filtered.vcf

# For RNA-Seq, we can only do hard-filtering:

sbatch -p short  -t 2:0:0 --mem 8G  --wrap "filter-variants-v0.1.sh $genotyper.variant.vcf hard-filter &> filter.log"

#14 Annotate variants:

module load annovar/20170601

sbatch -p short  -t 2:0:0 --mem 8G  --wrap  "table_annovar.pl $genotyper.variant.vcf.snp.recalibrated.vcf /n/groups/shared_databases/annovar/20150322/ -buildver hg19 -out $genotyper.variant.vcf.snp.recalibrated.vcf -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2014oct_all,1000g2014oct_afr,1000g2014oct_eas,1000g2014oct_eur,snp138,ljb26_all -operation g,r,r,f,f,f,f,f,f,f -nastring . -vcfinput"

# or if you need to use hard filtered result

sbatch -p short  -t 2:0:0 --mem 8G  --wrap "table_annovar.pl $genotyper.variant.vcf.hard_filtered.vcf /n/groups/shared_databases/annovar/20150322/ -buildver hg19 -out $genotyper.variant.vcf.hard_filtered.vcf -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2014oct_all,1000g2014oct_afr,1000g2014oct_eas,1000g2014oct_eur,snp138,ljb26_all -operation g,r,r,f,f,f,f,f,f,f -nastring . -vcfinput"

#15 Check output:

# I have put all the output in folder below. Your results should look similar:

ls  /n/groups/shared_databases/gatk/ftp.broadinstitute.org/bundle/2.8/b37//3.3-0-out/

# Two mutations identified by the original researches are rs144567652 on chr14 and rs147321998 on chr16, which are among this table: http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0055681.s005.

# Notice the original paper use hg18 as reference, so the coordination is difference from our current analysis.

# You can find them by command:

grep rs144567652 $genotyper.variant.vcf.snp.indel.recalibrated.vcf.hg19_multianno.vcf

and

grep rs140899350 $genotyper.variant.vcf.snp.indel.recalibrated.vcf.hg19_multianno.vcf

# For details about vcf files, please visit: http://vcftools.sourceforge.net/specs.html



###### DO NOT RUN THE STEPS BELOW######





#15 RNA-Seq:

in step 6, run: 



# For RNA-Seq data, you should use star aligner:

align-for-gatk-using-mcore-4cpu-star.sh 2>&1 | tee align.log



in step 10:

## Notice: for RNA-Seq data, please use HaplotypeCaller with options -splitNCigarReads, -dontUseSoftClippedBases and -stand_call_conf 20.0, like: 

genotyper=HaplotypeCaller

java -jar $QUEUE/Queue.jar -noGCOpt -retry 3 -keepIntermediates -statusTo $email -S $RCBIO/resource/gatkQueueScript.scala $bams -R $GATKDATA/human_g1k_v37.fasta -D $GATKDATA/dbsnp_138.b37.vcf -genotyper $genotyper -splitNCigarReads  -dontUseSoftClippedBases -stand_call_conf 20.0 -jobRunner Drmaa    -jobNative   "-t 2:0:0 -n 4 -p short --nodes=1 --mem-per-cpu=8000"  -startFromScratch



Let us know if you have any question. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!