spp (1.15.2) for Chip-Seq

1 1 Log on to O2
2 2 Start interactive job with 4G memory and 2 CPU cores , and create working folder
3 3 Load bowtie2 module and run mapping. (note: I already run mapping. Copy them if you don't want to re-run bowtie2: /n/groups/shared_databases/rcbio/spp/input)
4 4 Start R and load spp and bioMaRt packages
5 5 Read in bowtie alignments
6 6 Identify binding characteristics
7 7 Select informative tags based on the binding characteristics
8 8 Remove singular positions with very high tag counts
9 9 Calculate genome-wide tag density and tag enrichment/depletion profiles
10 10 Detecting point binding positions
11 11 Comparing Binding Sites to Annotations Using the biomaRt package

SPP is a R package especially designed for the analysis of Chip-Seq data. The package was developed by Peter Park's group from Harvard Medical School. Here is the the original nature paper:http://www.nature.com/nbt/journal/v26/n12/full/nbt.1508.html

Here is the original tutorial created by Peter Kharchenk from the same group: http://compbio.med.harvard.edu/Supplements/ChIP-seq/tutorial.html

Based paper and tutorial above, we created a guide to use the package on O2 . Please notice: part of this guild was from the material mentioned above.

We will calculate smoothed read enrichment profile and identify sites that significantly enriched compared to control. IGB or IGV can be used to browse the data.

We will also use the biomaRt package to download annotation and identify genes within 2kb of the transcription start site (TSS).

Note: You can copy and paste all the text to your Linux command line to run. Anything with "#" is comment, and will be IGNORED by Linux or R. I have run the pipeline on testing data. All the output files can be found in /n/groups/shared_databases/rcbio/spp/output

1 Log on to O2

If you need help connecting to O2, please review the How to login to O2 wiki page.

From Windows, use MobaXterm (preferred) or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your HMS ID instead of user123:

ssh user123@o2.hms.harvard.edu

2 Start interactive job with 4G memory and 2 CPU cores , and create working folder

srun --pty -p interactive -t 0-12:0:0 --mem 4G -n 2 /bin/bash
mkdir /n/scratch/users/${USER:0:1}/$USER/sppTest && cd /n/scratch/users/${USER:0:1}/$USER/sppTest

3 Load bowtie2 module and run mapping. (note: I already run mapping. Copy them if you don't want to re-run bowtie2: /n/groups/shared_databases/rcbio/spp/input)

# This will setup the path and environmental variables for the pipeline
module load gcc/6.2.0 bowtie2/2.2.9 samtools/0.1.19 

read_file=/n/groups/shared_databases/rcbio/spp/input/SRR1002328_1.fastq
bowtie2 -p 2 -x /n/groups/shared_databases/bowtie2_indexes/d_melanogaster_fb5_22 -U $read_file | samtools view -bS - > SRR1002328.bam

read_file=/n/groups/shared_databases/rcbio/spp/input/SRR1002329_1.fastq
bowtie2 -p 2 -x /n/groups/shared_databases/bowtie2_indexes/d_melanogaster_fb5_22 -U $read_file | samtools view -bS - > SRR1002329.bam

4 Start R and load spp and bioMaRt packages

module load gcc/6.2.0 R/3.4.1
R
.libPaths("/n/shared_db/misc/rcbio/rlib/3.4.1")
library(spp)
library(biomaRt)

5 Read in bowtie alignments

chip.data <- read.bam.tags("SRR1002329.bam")

input.data <- read.bam.tags("SRR1002328.bam")


#

6 Identify binding characteristics

# This command will uses cross-correlation profile to calculate binding peak separation distance,
# and assess whether inclusion of tags with non-perfect alignment quality improves the cross-correlation peak.
# If you would like to accept all aligned tags, specify accept.all.tags=T argument to save time.
# srange gives the possible range for the size of the protected region
# It should be higher than tag length; making the upper boundary too high will increase calculation time
# bin - bin tags within the specified number of base pairs to speed up calculation
# increasing bin size decreases the accuracy of the determined parameters
binding.characteristics <- get.binding.characteristics(chip.data,srange=c(50,500),bin=5, accept.all.tags = T)

# Print out binding peak separation distance
print(paste("binding peak separation distance =",binding.characteristics$peak$x))

# Plot cross-correlation profile, you can use pdf reader to review the plot once it is created.
pdf(file="chip.crosscorrelation.pdf",width=5,height=5)

par(mar = c(3.5,3.5,1.0,0.5), mgp = c(2,0.65,0), cex = 0.8)

plot(binding.characteristics$cross.correlation,type='l',xlab="strand shift",ylab="cross-correlation")

abline(v=binding.characteristics$peak$x,lty=2,col=2)

dev.off()


#

7 Select informative tags based on the binding characteristics

# The following calls will select tags with acceptable alignment quality:
# The chip.data and input.data will be a simple list of tag coordinate vectors. The analysis of binding characteristics described above could have been skipped by accepting
# all of the tag alignments after reading the Eland files (i.e. chip.data.info <- chip.data$tags).
# Since version 1.2, the steps can also be skipped by simply not passing the binding.characteristics argument to the select.informative.tags call.

chip.data.info <- select.informative.tags(chip.data,binding.characteristics)

input.data.info <- select.informative.tags(input.data,binding.characteristics)


#

8 Remove singular positions with very high tag counts

# The method calls below will scan along the chromosomes calculating local density of region (can be specified using window.size parameter, default is 200bp), 
# removing or restricting singular positions with extremely high tag count relative to the neighborhood.
# restrict or remove singular positions with very high tag counts

chip.data.qua <- remove.local.tag.anomalies(chip.data.info)

input.data.qua <- remove.local.tag.anomalies(input.data.info)


#

9 Calculate genome-wide tag density and tag enrichment/depletion profiles

# This will calculate genome-wide smoothed tag density (subtracting re-scaled input).
# Note that the tags are shifted by half of the peak separation distance

tag.shift <- round(binding.characteristics$peak$x/2)

smoothed.density <- get.smoothed.tag.density(chip.data.qua,control.tags=input.data.qua,bandwidth=200,step=100,tag.shift=tag.shift)

writewig(smoothed.density,"chip.smoothed.density.wig","chip smoothed, background-subtracted tag density")

# The WIG file can be read with genome browsers, such as IGB.
# The following calls will scan ChIP? and signal tag density to estimate lower bounds of tag enrichment (and upper bound of tag depletion if it is significant) along the genome.
# The resulting profile gives conservative statistical estimates of log2 fold-enrichment ratios along the genome.
# The example below uses a window of 500bp (and background windows of 1, 5, 25 and 50 times that size) and a confidence interval corresponding to 1%.
# alpha specifies significance level at which confidence intervals will be estimated

enrichment.estimates <- get.conservative.fold.enrichment.profile(chip.data.qua,input.data.qua,fws=500,step=100,alpha=0.01)

writewig(enrichment.estimates,"chip.enrichment.estimates.wig","chip conservative fold-enrichment/depletion estimates shown on log2 scale")

# The WIG file can be read with genome browsers, such as IGB.
# Broad regions of enrichment for a specified scale can be quickly identified using the following command:
broad.clusters <- get.broad.enrichment.clusters(chip.data.qua, input.data.qua, window.size=1e3, z.thr=3,tag.shift=round(binding.characteristics$peak$x/2))

# Write out in broadPeak format
write.broadpeak.info(broad.clusters,"chip.broadPeak")

# We'll also write a bed file that contains broad regions of enrichment on chromosome X for browsing in IGB.
write.table(cbind(rep("X", length(broad.clusters$X$s)), broad.clusters$X$s, broad.clusters$X$e), file="chip_enrich_broad_chrX.bed", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")

#

10 Detecting point binding positions

# The example below will use WTD method to call binding positions, using FDR of 5% and a window size estimated by the binding.characteristics:
# binding detection parameters
# desired FDR (5%). Alternatively, an E-value can be supplied to the method calls below instead of the fdr parameter

fdr <- 5e-2

# The binding.characteristics contains the optimized half-size for binding detection window
detection.window.halfsize <- binding.characteristics$whs

# Determine binding positions using wtd method
bp <- find.binding.positions(signal.data=chip.data.qua,control.data=input.data.qua,fdr=fdr,whs=detection.window.halfsize)
print(paste("detected",sum(unlist(lapply(bp$npl,function(d) length(d$x)))),"peaks"))

# Output detected binding positions
output.binding.results(bp,"chip.binding.positions.txt")

# We'll also write a graph file (similar to wig) that contains the binding sites and their associated 10log10(FDR) for browsing in IGB.
write.table(cbind(bp$npl$X$x, -10*log(bp$npl$X$fdr)), file="binding_sites_fdr_chrX.gr", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")

# Upload this bed file into IGB by going to File ! Open and browse the results.

11 Comparing Binding Sites to Annotations Using the biomaRt package

#Note: you should make sure that ensembl has the same version of reference as you used in bowtie aligner.
# We will download the ENSEMBLE fruit fly genome annotations and generate a list of ENSEMBLE gene information on chromosome X including start position, end position, strand and description

ensembl = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")

genes.chrX = getBM(attributes = c("chromosome_name", "start_position", "end_position", "strand", "description"), filters = "chromosome_name", values= "X", mart = ensembl)

# Next, we're going to take our binding sites from the bp list and use it to determine the set of genes that contain significantly enriched sites within 2kb of their TSS.
# In order to compare enriched sites to TSS sites, we need to write an overlap function where bs represents a binding site position, ts is the annotated TSS and l is the allowed 
distance of the binding site from the TSS.

overlap = function(bs, ts, l) { if ((bs > ts - l) && (bs < ts + l)) { TRUE; } else { FALSE; } }

# Now we'll write a function that takes a vector of binding site values, start positions, end positions and strands of the genes on chromosome X
# as well as our distance cutoff. l and outputs a logical vector of the genes that contain a enriched site within l bp (i.e., TRUE value)
# or do not contain a enriched site (i.e., FALSE value).

fivePrimeGenes=function(bs, ts, te, s, l) {
    fivePrimeVec = logical();   
    for (i in 1:length(ts)) {
            fivePrime = FALSE;
            for (j in 1:length(bs)) {
                if (s[i] == 1) {
                    fivePrime = fivePrime || overlap(bs[j], ts[i], l);
                } else {
                    fivePrime = fivePrime || overlap(bs[j], te[i], l);
                }
             }
            fivePrimeVec = c(fivePrimeVec, fivePrime);
    }
    fivePrimeVec;
} 

# Using the fivePrimeGenes function, generate a vector of the TSSs and genes that contain enriched site within .2kb of their TSS (i.e., l = 2000).
fivePrimeGenesLogical = fivePrimeGenes(bp$npl$X$x, genes.chrX$start_position, genes.chrX$end_position, genes.chrX$strand, 2000)

# Find the gene located on the plus strand
fivePrimeStartsPlus = genes.chrX$start_position[fivePrimeGenesLogical & genes.chrX$strand == 1]

# Find the gene located on the minus strand
fivePrimeStartsMinus = genes.chrX$end_position[fivePrimeGenesLogical & genes.chrX$strand == -1]

# Combine the start positions together
fivePrimeStarts = sort(c(fivePrimeStartsPlus, fivePrimeStartsMinus))

# Get all the gene names
fivePrimeGenes = genes.chrX$description[fivePrimeGenesLogical]
save.image(file="8.11.RData") # you can use command load("8.11.RData") to load all the objects into R later on.

# Explore the vectors fivePrimeStarts and fivePrimeGenes.
# Use IGB to check if the genes/starts are the correct ones.
# You can use annotations in the biomaRt package to access other tools including Bioconductor to
# learn more about the set of genes that contain enriched sites or whatever regulatory factor you're studying. Have fun!

# Below are the all functions in the spp package:
> library(help=spp)

                Information on package 'spp'

Description:

Package:       spp
Type:          Package
Title:         ChIP-Seq Processing Pipeline
Version:       1.15.2
Date:          2018-02-01
Author:        Peter K
Depends:       R (>= 3.3.0), Rcpp
Imports:       Rsamtools, caTools, parallel, graphics, stats
Suggests:      methods
LinkingTo:     Rcpp
Maintainer:    Peter Kharchenko <Peter_Kharchenko@hms.harvard.edu>
Description:   R package for analysis of ChIP-seq and other functional
               sequencing data.
License:       GPL-2
LazyLoad:      yes
Note:          revised for compliance with CRAN by K.Pal and C.M.Livi
Built:         R 3.4.1; x86_64-pc-linux-gnu; 2018-05-15 17:49:52 UTC;
               unix

Index:

add.broad.peak.regions
                        Calculate chromosome-wide profiles of smoothed
                        tag density
densum                  Do Something
find.binding.positions
                        Determine significant point protein binding
                        positions (peaks)
get.binding.characteristics
                        Calculate characteristics of observed
                        DNA-binding signal from cross-correlation
                        profiles
get.broad.enrichment.clusters
                        Determine broad clusters of enrichment
get.conservative.fold.enrichment.profile
                        Estimate minimal fold enrichment/depletion
                        along the chromosomes
get.conservative.fold.enrichment.profile2
                        Return Conservative fold enrichment profile
                        controlling for input and a single background
                        scale
get.mser                Calculate minimal saturated enrichment fold
                        ratio
get.mser.interpolation
                        Interpolate MSER dependency on the tag count
get.smoothed.enrichment.mle
                        Calculate chromosome-wide profiles of smoothed

Let us know if you have any questions. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!

HMS IT RC O2