spp (1.15.2) for Chip-Seq

SPP is a R package especially designed for the analysis of Chip-Seq data. The package was developed by Peter Park's group from Harvard Medical School. Here is the the original nature paper:http://www.nature.com/nbt/journal/v26/n12/full/nbt.1508.html

Here is the original tutorial created by Peter Kharchenk from the same group: http://compbio.med.harvard.edu/Supplements/ChIP-seq/tutorial.html

Based paper and tutorial above, we created a guide to use the package on O2 . Please notice: part of this guild was from the material mentioned above.

We will calculate smoothed read enrichment profile and identify sites that significantly enriched compared to control. IGB or IGV can be used to browse the data.

We will also use the biomaRt package to download annotation and identify genes within 2kb of the transcription start site (TSS).

Note: You can copy and paste all the text to your Linux command line to run. Anything with "#" is comment, and will be IGNORED by Linux or R. I have run the pipeline on testing data. All the output files can be found in /n/groups/shared_databases/rcbio/spp/output

1 Log on to O2

If you need help connecting to O2, please review the How to login to O2 wiki page.

From Windows, use MobaXterm (preferred) or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your HMS ID instead of user123:

ssh user123@o2.hms.harvard.edu

2 Start interactive job with 4G memory and 2 CPU cores , and create working folder

srun --pty -p interactive -t 0-12:0:0 --mem 4G -n 2 /bin/bash mkdir /n/scratch/users/${USER:0:1}/$USER/sppTest && cd /n/scratch/users/${USER:0:1}/$USER/sppTest

3 Load bowtie2 module and run mapping. (note: I already run mapping. Copy them if you don't want to re-run bowtie2: /n/groups/shared_databases/rcbio/spp/input)

# This will setup the path and environmental variables for the pipeline module load gcc/6.2.0 bowtie2/2.2.9 samtools/0.1.19 read_file=/n/groups/shared_databases/rcbio/spp/input/SRR1002328_1.fastq bowtie2 -p 2 -x /n/groups/shared_databases/bowtie2_indexes/d_melanogaster_fb5_22 -U $read_file | samtools view -bS - > SRR1002328.bam read_file=/n/groups/shared_databases/rcbio/spp/input/SRR1002329_1.fastq bowtie2 -p 2 -x /n/groups/shared_databases/bowtie2_indexes/d_melanogaster_fb5_22 -U $read_file | samtools view -bS - > SRR1002329.bam

4 Start R and load spp and bioMaRt packages

module load gcc/6.2.0 R/3.4.1 R .libPaths("/n/shared_db/misc/rcbio/rlib/3.4.1") library(spp) library(biomaRt)

5 Read in bowtie alignments

chip.data <- read.bam.tags("SRR1002329.bam") input.data <- read.bam.tags("SRR1002328.bam") #

6 Identify binding characteristics

# This command will uses cross-correlation profile to calculate binding peak separation distance, # and assess whether inclusion of tags with non-perfect alignment quality improves the cross-correlation peak. # If you would like to accept all aligned tags, specify accept.all.tags=T argument to save time. # srange gives the possible range for the size of the protected region # It should be higher than tag length; making the upper boundary too high will increase calculation time # bin - bin tags within the specified number of base pairs to speed up calculation # increasing bin size decreases the accuracy of the determined parameters binding.characteristics <- get.binding.characteristics(chip.data,srange=c(50,500),bin=5, accept.all.tags = T) # Print out binding peak separation distance print(paste("binding peak separation distance =",binding.characteristics$peak$x)) # Plot cross-correlation profile, you can use pdf reader to review the plot once it is created. pdf(file="chip.crosscorrelation.pdf",width=5,height=5) par(mar = c(3.5,3.5,1.0,0.5), mgp = c(2,0.65,0), cex = 0.8) plot(binding.characteristics$cross.correlation,type='l',xlab="strand shift",ylab="cross-correlation") abline(v=binding.characteristics$peak$x,lty=2,col=2) dev.off() #

7 Select informative tags based on the binding characteristics

# The following calls will select tags with acceptable alignment quality: # The chip.data and input.data will be a simple list of tag coordinate vectors. The analysis of binding characteristics described above could have been skipped by accepting # all of the tag alignments after reading the Eland files (i.e. chip.data.info <- chip.data$tags). # Since version 1.2, the steps can also be skipped by simply not passing the binding.characteristics argument to the select.informative.tags call. chip.data.info <- select.informative.tags(chip.data,binding.characteristics) input.data.info <- select.informative.tags(input.data,binding.characteristics) #

8 Remove singular positions with very high tag counts

# The method calls below will scan along the chromosomes calculating local density of region (can be specified using window.size parameter, default is 200bp), # removing or restricting singular positions with extremely high tag count relative to the neighborhood. # restrict or remove singular positions with very high tag counts chip.data.qua <- remove.local.tag.anomalies(chip.data.info) input.data.qua <- remove.local.tag.anomalies(input.data.info) #

9 Calculate genome-wide tag density and tag enrichment/depletion profiles

# This will calculate genome-wide smoothed tag density (subtracting re-scaled input). # Note that the tags are shifted by half of the peak separation distance tag.shift <- round(binding.characteristics$peak$x/2) smoothed.density <- get.smoothed.tag.density(chip.data.qua,control.tags=input.data.qua,bandwidth=200,step=100,tag.shift=tag.shift) writewig(smoothed.density,"chip.smoothed.density.wig","chip smoothed, background-subtracted tag density") # The WIG file can be read with genome browsers, such as IGB. # The following calls will scan ChIP? and signal tag density to estimate lower bounds of tag enrichment (and upper bound of tag depletion if it is significant) along the genome. # The resulting profile gives conservative statistical estimates of log2 fold-enrichment ratios along the genome. # The example below uses a window of 500bp (and background windows of 1, 5, 25 and 50 times that size) and a confidence interval corresponding to 1%. # alpha specifies significance level at which confidence intervals will be estimated enrichment.estimates <- get.conservative.fold.enrichment.profile(chip.data.qua,input.data.qua,fws=500,step=100,alpha=0.01) writewig(enrichment.estimates,"chip.enrichment.estimates.wig","chip conservative fold-enrichment/depletion estimates shown on log2 scale") # The WIG file can be read with genome browsers, such as IGB. # Broad regions of enrichment for a specified scale can be quickly identified using the following command: broad.clusters <- get.broad.enrichment.clusters(chip.data.qua, input.data.qua, window.size=1e3, z.thr=3,tag.shift=round(binding.characteristics$peak$x/2)) # Write out in broadPeak format write.broadpeak.info(broad.clusters,"chip.broadPeak") # We'll also write a bed file that contains broad regions of enrichment on chromosome X for browsing in IGB. write.table(cbind(rep("X", length(broad.clusters$X$s)), broad.clusters$X$s, broad.clusters$X$e), file="chip_enrich_broad_chrX.bed", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t") #

10 Detecting point binding positions

# The example below will use WTD method to call binding positions, using FDR of 5% and a window size estimated by the binding.characteristics: # binding detection parameters # desired FDR (5%). Alternatively, an E-value can be supplied to the method calls below instead of the fdr parameter fdr <- 5e-2 # The binding.characteristics contains the optimized half-size for binding detection window detection.window.halfsize <- binding.characteristics$whs # Determine binding positions using wtd method bp <- find.binding.positions(signal.data=chip.data.qua,control.data=input.data.qua,fdr=fdr,whs=detection.window.halfsize) print(paste("detected",sum(unlist(lapply(bp$npl,function(d) length(d$x)))),"peaks")) # Output detected binding positions output.binding.results(bp,"chip.binding.positions.txt") # We'll also write a graph file (similar to wig) that contains the binding sites and their associated 10log10(FDR) for browsing in IGB. write.table(cbind(bp$npl$X$x, -10*log(bp$npl$X$fdr)), file="binding_sites_fdr_chrX.gr", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t") # Upload this bed file into IGB by going to File ! Open and browse the results.

11 Comparing Binding Sites to Annotations Using the biomaRt package

#Note: you should make sure that ensembl has the same version of reference as you used in bowtie aligner. # We will download the ENSEMBLE fruit fly genome annotations and generate a list of ENSEMBLE gene information on chromosome X including start position, end position, strand and description ensembl = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl") genes.chrX = getBM(attributes = c("chromosome_name", "start_position", "end_position", "strand", "description"), filters = "chromosome_name", values= "X", mart = ensembl) # Next, we're going to take our binding sites from the bp list and use it to determine the set of genes that contain significantly enriched sites within 2kb of their TSS. # In order to compare enriched sites to TSS sites, we need to write an overlap function where bs represents a binding site position, ts is the annotated TSS and l is the allowed distance of the binding site from the TSS. overlap = function(bs, ts, l) { if ((bs > ts - l) && (bs < ts + l)) { TRUE; } else { FALSE; } } # Now we'll write a function that takes a vector of binding site values, start positions, end positions and strands of the genes on chromosome X # as well as our distance cutoff. l and outputs a logical vector of the genes that contain a enriched site within l bp (i.e., TRUE value) # or do not contain a enriched site (i.e., FALSE value). fivePrimeGenes=function(bs, ts, te, s, l) { fivePrimeVec = logical(); for (i in 1:length(ts)) { fivePrime = FALSE; for (j in 1:length(bs)) { if (s[i] == 1) { fivePrime = fivePrime || overlap(bs[j], ts[i], l); } else { fivePrime = fivePrime || overlap(bs[j], te[i], l); } } fivePrimeVec = c(fivePrimeVec, fivePrime); } fivePrimeVec; } # Using the fivePrimeGenes function, generate a vector of the TSSs and genes that contain enriched site within .2kb of their TSS (i.e., l = 2000). fivePrimeGenesLogical = fivePrimeGenes(bp$npl$X$x, genes.chrX$start_position, genes.chrX$end_position, genes.chrX$strand, 2000) # Find the gene located on the plus strand fivePrimeStartsPlus = genes.chrX$start_position[fivePrimeGenesLogical & genes.chrX$strand == 1] # Find the gene located on the minus strand fivePrimeStartsMinus = genes.chrX$end_position[fivePrimeGenesLogical & genes.chrX$strand == -1] # Combine the start positions together fivePrimeStarts = sort(c(fivePrimeStartsPlus, fivePrimeStartsMinus)) # Get all the gene names fivePrimeGenes = genes.chrX$description[fivePrimeGenesLogical] save.image(file="8.11.RData") # you can use command load("8.11.RData") to load all the objects into R later on. # Explore the vectors fivePrimeStarts and fivePrimeGenes. # Use IGB to check if the genes/starts are the correct ones. # You can use annotations in the biomaRt package to access other tools including Bioconductor to # learn more about the set of genes that contain enriched sites or whatever regulatory factor you're studying. Have fun! # Below are the all functions in the spp package: > library(help=spp) Information on package 'spp' Description: Package: spp Type: Package Title: ChIP-Seq Processing Pipeline Version: 1.15.2 Date: 2018-02-01 Author: Peter K Depends: R (>= 3.3.0), Rcpp Imports: Rsamtools, caTools, parallel, graphics, stats Suggests: methods LinkingTo: Rcpp Maintainer: Peter Kharchenko <Peter_Kharchenko@hms.harvard.edu> Description: R package for analysis of ChIP-seq and other functional sequencing data. License: GPL-2 LazyLoad: yes Note: revised for compliance with CRAN by K.Pal and C.M.Livi Built: R 3.4.1; x86_64-pc-linux-gnu; 2018-05-15 17:49:52 UTC; unix Index: add.broad.peak.regions Calculate chromosome-wide profiles of smoothed tag density densum Do Something find.binding.positions Determine significant point protein binding positions (peaks) get.binding.characteristics Calculate characteristics of observed DNA-binding signal from cross-correlation profiles get.broad.enrichment.clusters Determine broad clusters of enrichment get.conservative.fold.enrichment.profile Estimate minimal fold enrichment/depletion along the chromosomes get.conservative.fold.enrichment.profile2 Return Conservative fold enrichment profile controlling for input and a single background scale get.mser Calculate minimal saturated enrichment fold ratio get.mser.interpolation Interpolate MSER dependency on the tag count get.smoothed.enrichment.mle Calculate chromosome-wide profiles of smoothed

Let us know if you have any questions. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!