Browse code

moved vignettes into Rmd

Kasper Daniel Hansen authored on 17/10/2017 02:21:02
Showing 5 changed files

... ...
@@ -35,6 +35,7 @@ Suggests:
35 35
     RUnit,
36 36
     bsseqData,
37 37
     BiocStyle,
38
+    rmarkdown,
38 39
     knitr,
39 40
 Collate:
40 41
     utils.R
41 42
new file mode 100644
... ...
@@ -0,0 +1,263 @@
1
+---
2
+title: "The bsseq User's Guide"
3
+shorttitle: "bsseq guide"
4
+author: 
5
+  - Kasper Daniel Hansen
6
+package: bsseq
7
+bibliography: bsseq.bib
8
+abstract: >
9
+  A comprehensive guide to using the bsseq package for analyzing whole-genome bisulfite sequencing
10
+  (WGBS) datta.
11
+vignette: >
12
+  %\VignetteIndexEntry{bsseq User's Guide}
13
+  %\VignetteEngine{knitr::rmarkdown}
14
+  %\VignetteEncoding{UTF-8}
15
+output: 
16
+  BiocStyle::html_document:
17
+    toc_float: true
18
+---
19
+
20
+# Introduction
21
+
22
+This R package is the reference implementation of the BSmooth algorithm for analyzing whole-genome bisulfite sequencing (WGBS) data.  This package does **not** contain alignment software, which is available from [](http://rafalab.jhsph.edu/bsmooth).  This package is not intended for analysis of single-loci bisulfite sequencing (typically Sanger bisulfite sequencing or pyro bisulfite sequencing).
23
+
24
+The package has been used to analyze capture bisulfite sequencing data.  For this type of data, the most useful parts of the package is its data-handling functions.  The BSmooth algorithm itself may not be useful for a standard capture bisulfite sequencing experiment, since it relies heavily on smoothing, which again requires that we have measured methylation in bigger regions of the genome.
25
+
26
+The BSmooth algorithm is described in detail in [@Hansen:2012].  It was applied to human cancer data in [@Hansen:2011] and we have also used it to analyze data from Rhesus Macaque [@Tung:2012].  Specifically, the algorithm uses smoothing to get reliable semi-local methylation estimates from low-coverage bisulfite sequencing.  After smoothing it uses biological replicates to estimate biological variation and identify differentially methylated regions (DMRs).  The smoothing portion could be used even on a single sample, but we believe that variation between individuals is an important aspect of DNA methylation and needs to be accounted for, see also [@Hansen:2011rna] for a relevant discussion.
27
+
28
+The main focus for our development has been the analysis of CpG methylation in humans, and we have successfully used it in other primates [@Tung:2012].  It is highly likely that the approach will work in non-human organisms, but care must be taken: the smoothing parameters (which we have selected based on human data) should be evaluated carefully.  Furthermore, it may not be suitable at all for organisms with vastly different (from humans) methylation structures.
29
+
30
+With respect to non-CpG methylation, the situation is mixed.  We have never used the algorithm to analyze non-CpG methylation, but it should be straightforward to do so.  However, the data structures used in the current code might not conveniently scale from the 28.2M CpGs in the human genome to the roughly 2x585M Cs (it may well be possible to do an separate analysis for each chromosome).  This should be less of an issue for organisms with smaller genomes.  We are considering changing these underlying data structures to allow for easier analysis of non-CpG methylation in humans.
31
+
32
+## System Requirements
33
+
34
+The package requires that all the data is loaded into system memory.  By ``data'' we do not mean the individual reads (which is big for a whole-genome experiment).  Instead, what we need are summarized data given us the number of reads supporting methylation as well as the total number of reads at
35
+each loci.  Focusing on human data, we need to work with objects with a maximum of 28.2 million entries, per sample (since there are roughly 28.2 millions CpGs in the human genome).  This provides us with an upper limit on the data object.
36
+
37
+Based on this, the 8 samples from [@Hansen:2011] including the smoothed values, take up roughly 1.2GB of RAM, meaning an analysis can easily be done with 8GB of RAM.  In order to improve speed, the package allows for easy parallel processing of samples/chromosomes.  This will require multiple copies of the data object for each core used, increasing the memory usage substantially to perhaps even 32GB.  This can be avoided by processing the samples sequentially at the cost of speed.
38
+
39
+On a 64GB node, the 8 samples from [@Hansen:2011] takes roughly one hour to process in parallel using 8 cores (using much less than 64GB of RAM in total).  This does not including parsing the data from the alignment output files.
40
+
41
+## Some terminology
42
+
43
+Because not all methylation happens at CpG sites, we have tried to use the term ``methylation loci'' instead of CpG.  We use this term to refer to a single-base methylation site.
44
+
45
+Some standard terms from the DNA methylation field: differentially methylated region (DMR), hyper methylation (methylation that is higher in one condition/sample than in another), hypo methylation (methylation that is lower in one condition/sample than in another), and finally differentially methylated position (DMP) referring to a single loci.
46
+
47
+## Citation
48
+
49
+If you use this package, please cite our BSmooth paper [@Hansen:2012].
50
+
51
+## Dependencies 
52
+
53
+```{r dependencies, warning=FALSE, message=FALSE}
54
+library(bsseq)
55
+```
56
+
57
+# Overview
58
+
59
+The package assumes that the following data has been extracted from alignments:
60
+
61
+- genomic positions, including chromosome and location, for methylation loci.
62
+- a (matrix) of M (Methylation) values, describing the number of read supporting methylation covering a single loci.  Each row in this matrix is a methylation loci and each column is a sample.
63
+- a (matrix) of Cov (Coverage) values, describing the total number of reads covering a single loci.  Each row in this matrix is a methylation loci and each column is a sample.
64
+
65
+
66
+We can have a look at some data from [@Lister:2009], specifically chromosome 22 from the IMR90 cell line.
67
+
68
+```{r BSchr22}
69
+data(BS.chr22)
70
+BS.chr22
71
+```
72
+
73
+The genomic positions are stored as a `GRanges` object `GRanges` are general genomic regions; we represent a single base methylation loci as an interval of width 1 (which may seem a bit strange, but there are good reasons for this).  For example, the first 4 loci in the Lister data are
74
+
75
+```{r ov-granges}
76
+head(granges(BS.chr22), n = 4)
77
+```
78
+
79
+We also have the M and Cov matrices
80
+
81
+```{r ov-getCoverage}
82
+head(getCoverage(BS.chr22, type = "M"), n = 4)
83
+head(getCoverage(BS.chr22), n = 4)
84
+``` 
85
+
86
+Since CpG methylation is symmetric on the two strands of a chromosome, we aggregated reads on the forward and reverse strand to form a single value, and we assume the genomic position points to the C of the CpG.  It is not crucial in any way to do this, one may easily analyze each strand separately, but CpG methylation is symmetric and this halves the number of loci.
87
+
88
+How to input your methylation data into this data structure (called a `BSseq` object) is described in a section below.  We also have a section on how to operate on these types of objects.
89
+
90
+An analysis typically consists of the following steps.
91
+
92
+1. Smoothing, using the function `BSmooth()`.
93
+2. Compute t-statistics, using the function `BSmooth.tstat()`.  This converts the `BSseq` object into a `BSseqTstat` object.
94
+3. Threshold these t-statistics to identify DMRs, using the function `dmrFinder()`, returning a simple `data.frame`.
95
+
96
+It is usually a good idea to look at the smoothed data either before or after identifying DMRs.  This can be done using the functions `plotRegion()` and `plotManyRegions()`.
97
+
98
+We also have functions for assessing goodness of fit for binomial and poison models; this is like to be of marginal interest to most users.  See the man page for `binomialGoodnessOfFit()`.
99
+
100
+We also allow for easy computation of Fisher's exact test for each loci, using the function `fisherTests()`.
101
+
102
+# Using objects of class BSseq
103
+
104
+## Basic operations
105
+
106
+Objects of class `BSseq` contains a `GRanges` object with the genomic locations.  This `GRanges` object can be obtained by `granges()`.  A number of standard `GRanges` methods works directly on the `BSseq` object, such as `start()`, `end()`, `seqnames()` (chromosome names) etc.
107
+
108
+These objects also contains a `phenoData` object for sample pheno data.  Useful methods are `sampleNames()`, `pData()`.
109
+
110
+Finally, we have methods such as `dim()`, `ncol()` (number of columns; number of samples), `nrow()` (number of rows; number of methylation loci).
111
+
112
+Objects can be subsetted using two indicies `BSseq[i,j]` with the first index being methylation loci and the second index being samples.  Another very useful way of subsetting the object is by using the method `subsetByOverlaps()`.  This selects all methylation loci inside a set of genomic intervals (there is a difference between first and second argument and either can be
113
+`BSseq` or `GRanges`).
114
+
115
+Examples:
116
+```{r using-examples}
117
+head(start(BS.chr22), n = 4)
118
+head(seqnames(BS.chr22), n = 4)
119
+sampleNames(BS.chr22)
120
+pData(BS.chr22)
121
+dim(BS.chr22)
122
+BS.chr22[1:6,1]
123
+subsetByOverlaps(BS.chr22, 
124
+                 GRanges(seqnames = "chr22", 
125
+                         ranges = IRanges(start = 1, end = 2*10^7)))
126
+``` 
127
+
128
+## Data handling 
129
+
130
+We have a number of functions for manipulating one or more `BSseq` objects.
131
+
132
+`BSseq()` instantiates an object of class `BSseq`.  Genomic locations are passed in, either as a `GRanges` object (argument `gr`) or as chromosome and location vectors (arguments `chr` and `pos`).  The arguments `M` and `Cov` accepts matrices, and it is possible to directly give it a `phenoData` object.
133
+
134
+```{r data-bsseq}
135
+M <- matrix(0:8, 3, 3)
136
+Cov <- matrix(1:9, 3, 3)
137
+BStmp <- BSseq(chr = c("chr1", "chrX", "chr1"), pos = 1:3, 
138
+               M = M, Cov = Cov, sampleNames = c("A1","A2", "B"))
139
+``` 
140
+
141
+A `BSseq` object may be ordered by `orderBSseq()`.  This ensures that data from a single chromosome appears in an ordered, contiguous block.  There is also the possibility for specifying the chromosome order (this is less important).  The smoothing functions assumes that the underlying `BSseq` has been ordered.
142
+
143
+```{r data-order}
144
+granges(BStmp)
145
+BStmp <- orderBSseq(BStmp, seqOrder = c("chr1", "chrX"))
146
+granges(BStmp)
147
+``` 
148
+
149
+`chrSelectBSseq()` performs the often useful task of selecting one or more chromosomes and can also order the output.  In case `order = TRUE`, the output is ordered according to the order of the `seqnames` argument.
150
+```{r data-chrSelect}
151
+chrSelectBSseq(BStmp, seqnames = "chr1", order = TRUE)
152
+``` 
153
+Of course, in this case the function does little, since `BS.chr22` already only contains data
154
+from chromosome 22.
155
+
156
+`combine()` combines two `BSseq` objects in the following way: the samples for the return objects is the union of the samples from the two objects, and the methylation loci are the union of the two methylation loci.  The two objects do not need to have measured the same loci (in the example below, `BStmp` has data on chromosome 1 and X).
157
+
158
+```{r data-combine}
159
+BStmp2 <- combine(BStmp, BS.chr22[1:3,])
160
+granges(BStmp2)
161
+getCoverage(BStmp2)
162
+``` 
163
+
164
+`collapseBSseq` performs the often useful task of adding several columns of a `BSseq` object.  This is often used in the beginning of an analysis where each column may correspond to a lane and several such columns represents the data for a single biological sample.
165
+
166
+```{r data-collapse}
167
+collapseBSseq(BStmp, columns = c("A", "A", "B"))
168
+``` 
169
+
170
+## Obtaining coverage (methylation)
171
+
172
+Coverage, either Cov or M values, are obtained by `getCoverage()`, using the `type` argument:
173
+
174
+```{r getCoverage1}
175
+head(getCoverage(BS.chr22, type = "Cov"), n = 4)
176
+head(getCoverage(BS.chr22, type = "M"), n = 4)
177
+``` 
178
+
179
+This will return a -- possibly very large -- matrix.  It is also possible to get region based coverage by using the `regions` argument to the function.  This argument is either a `data.frame` (with columns `chr`, `start` and `end`) or a `GRanges` object.  Let us do an example
180
+```{r regions}
181
+regions <- GRanges(seqnames = c("chr22", "chr22"), 
182
+                   ranges = IRanges(start = 1.5 * 10^7 + c(0,200000), 
183
+                                    width = 1000))
184
+getCoverage(BS.chr22, regions = regions, what = "perRegionTotal")
185
+``` 
186
+When `what` is `perRegionTotal` the return value is the total coverage of each region (and note that regions without coverage return `NA`).  Similarly, `perRegionAverage` yields the average coverage in the region.  However, it is often useful to get the actual values, like
187
+```{r getCoverage2}
188
+getCoverage(BS.chr22, regions = regions, what = "perBase")
189
+``` 
190
+This is the default behaviour, and it returns a list where each element corresponds to a region.  Note that regions with no coverage gets a `NULL`.
191
+
192
+Methylation estimates can be obtained in the same way, using the function `getMeth()`.  If `type` is set to `raw` the function returns simple single-loci methylation estimates (which are M/Cov).  To get smoothed estimates, the `BSseq` object needs to have been smoothed using Bsmooth, and `type` set to `smooth` (default).  The `getCoverage(0`, `getMeth()` has a `regions` and a `what` argument.  For `getMeth()` the `what` argument can be `perBase` or `perRegion` (the later is really the per-region average methylation).  Additionally, confidence intervals can be computed using a method taking possibly low coverage loci into account as well as loci where the methylation percentage might be close to 0 or 1 [@Agresti:1998].  Currently, confidence intervals cannot be computed for region-level summary estimates.  Examples
193
+
194
+```{r getMeth}
195
+getMeth(BS.chr22, regions, type = "raw")
196
+getMeth(BS.chr22, regions[2], type = "raw", what = "perBase")
197
+``` 
198
+
199
+# Reading data
200
+
201
+## Alignment output from the BSmooth alignment suite
202
+
203
+It is straightforward to read output files (summarized evidence) from the BSmooth alignment suite, using `read.bsmooth()`.  This function takes as argument a number of directories, usually corresponding to samples, with the alignment output.  Sometimes, one might want to read only certain chromosomes, which can be accomplished by the `seqnames` argument.  Also, the default values of `qualityCutoff = 20` ensures that we do not use methylation evidence with a base quality strictly lower than 20 (since we may not be confident whether the read really supports methylation or not).  The function `read.bsmooth()` allows for both gzipped and non-gzipped input files. It is faster to read gzipped files, so we recommend gzipping post-alignment.
204
+
205
+During the development of BSmooth we experimented with a number of different output formats.  These legacy formats can be read with `read.umtab()` and `read.umtab2()`.
206
+
207
+## Alignment output from other aligners
208
+
209
+We are interested in adding additional parsers to the package; if your favorite alignment software is not supported, feel free to get in touch.
210
+
211
+In general, we need summarized methylation evidence.  In short, for each methylation loci, we need to know how many reads cover the loci and how many of those reads support methylation.
212
+
213
+As an example, consider the Lister data.  The files posted online looks like
214
+```
215
+> head mc_imr90_r1_22
216
+assembly        position        strand  class   mc      h
217
+22      14430632        +       CG      9       10
218
+22      14430633        -       CG      8       8
219
+22      14430677        +       CG      1       1
220
+22      14430678        -       CG      3       10
221
+22      14430688        -       CG      6       10
222
+22      14430703        -       CG      2       8
223
+22      14431244        +       CG      5       10
224
+22      14431245        -       CG      5       11
225
+```
226
+
227
+For these files, the evidence is split by strand.  It is often easiest to read in one sample at a time, instantiate a `BSseq` object and then use `combine()` and `collapseBSseq()` to combine the samples, since these functions deal objects that have different sets of CpGs.  In the Lister data, note that the position is the position of the "C", so basically, if you want to combine the evidence from both strands, CpGs on the "-" strand needs to have 1 subtracted from their position.  A full script showing how these files are used to create `BS.chr22` can be found in `inst/scripts/get\_BS.chr22.R` in the `r Biocpkg("bsseq")` package.  The path of this file may be found by
228
+```{r locateScript, results="hide"}
229
+system.file("scripts", "get_BS.chr22.R", package = "bsseq")
230
+``` 
231
+
232
+# Analysis
233
+
234
+Computing smoothed methylation levels is simple.  We subset ``BS.chr22` so we only smooth 1 sample, for run-time purpose
235
+
236
+```{r BSmooth}
237
+BS.chr22.1 <- BSmooth(BS.chr22[,"r1"], verbose = TRUE)
238
+BS.chr22.1
239
+``` 
240
+
241
+A number of arguments pertains to using multiple cores.  This is useful if you have multiple samples or chromosomes.  `mc.cores` tells `BSmooth()` how many cores to use and `parallelBy` describes whether the parallelization is over samples or chromosomes.  If the parallelization is over samples, each core will compute the smoothing of the entire genome.  And if it is over samples, each cores will smooth all samples for one or more chromosomes.  The appropriate choice depends on the number of samples and cpu cores available.  If you have the exact same number of samples as cores, the fastest is `parallelBy = "sample"`.  If you have less samples than cores, the fastest is `parallelBy = "chromosome"`.  The argument `mc.preshedule` should not need to be changed (unless perhaps if a small value of `maxGap` is uses); see the man page for `parallel::mclapply`.  Note that setting `mc.cores` to a value greater than 1 is not support on MS Windows due to a limitation of the operating system.
242
+
243
+For a more detailed discussion of the analysis tools, read the companion vignette ``Analyzing WGBS with the bsseq package'', which also finding DMRs and plotting them.
244
+
245
+Fisher's exact tests may be efficiently computed using the function `fisherTests`()`.
246
+
247
+Binomial and poisson goodness of fit tests statistics may be computed using `binomialGoodnessOfFit()` and `poissonGoodnessOfFit()`.
248
+
249
+# sessionInfo()
250
+
251
+```{r sessionInfo, echo=FALSE}
252
+sessionInfo()
253
+```
254
+
255
+# References
256
+
257
+<!--
258
+Local Variables:
259
+% eval: (add-hook 'LaTeX-mode-hook '(lambda () (if (string= (buffer-name) "bsseq.Rnw") (setq fill-column 100))))
260
+% LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR
261
+% LocalWords: DMRs differentially
262
+% End:
263
+-->
0 264
deleted file mode 100644
... ...
@@ -1,380 +0,0 @@
1
-%\VignetteIndexEntry{The bsseq user's guide}
2
-%\VignetteDepends{bsseq}
3
-%\VignettePackage{bsseq}
4
-%\VignetteEngine{knitr::knitr}
5
-\documentclass[12pt]{article}
6
-<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
7
-BiocStyle::latex()
8
-@
9
-%% \newcommand{\bold}[1]{\textbf{#1}} % Fix for R bibentry
10
-%% \newcommand{\Rhref}[2]{\href{#1}{#2}}
11
-
12
-\title{The bsseq user's guide}
13
-\author{Kasper Daniel Hansen \\ \texttt{kasperdanielhansen@gmail.com}}
14
-\date{Modified: June 10, 2015.  Compiled: \today}
15
-\begin{document}
16
-
17
-\maketitle
18
-
19
-\section*{Capabilities}
20
-
21
-The packages required for this vignette are
22
-
23
-<<load, warning=FALSE, message=FALSE>>=
24
-library(bsseq)
25
-@ 
26
-
27
-\subsubsection*{Introduction}
28
-
29
-This \R\ package is the reference implementation of the BSmooth algorithm for analyzing whole-genome
30
-bisulfite sequencing (WGBS) data.  This package does \emph{not} contain alignment software, which is
31
-available from \url{http://rafalab.jhsph.edu/bsmooth}.  This package is not intended for analysis of
32
-single-loci bisulfite sequencing (typically Sanger bisulfite sequencing or pyro bisulfite
33
-sequencing).
34
-
35
-The package has been used to analyze capture bisulfite sequencing data.  For this type of data, the
36
-most useful parts of the package is its data-handling functions.  The BSmooth algorithm itself may
37
-not be useful for a standard capture bisulfite sequencing experiment, since it relies heavily on
38
-smoothing, which again requires that we have measured methylation in bigger regions of the
39
-genome.
40
-
41
-The BSmooth algorithm is described in detail in \cite{Hansen:2012}.  It was applied to human cancer
42
-data in \cite{Hansen:2011} and we have also used it to analyze data from Rhesus Macaque
43
-\cite{Tung:2012}.  Specifically, the algorithm uses smoothing to get reliable semi-local methylation
44
-estimates from low-coverage bisulfite sequencing.  After smoothing it uses biological replicates to
45
-estimate biological variation and identify differentially methylated regions (DMRs).  The smoothing
46
-portion could be used even on a single sample, but we believe that variation between individuals
47
-is an important aspect of DNA methylation and needs to be accounted for, see also
48
-\cite{Hansen:2011rna} for a relevant discussion.
49
-
50
-The main focus for our development has been the analysis of CpG methylation in humans, and we have
51
-successfully used it in other primates \cite{Tung:2012}.  It is highly likely that the approach will
52
-work in non-human organisms, but care must be taken: the smoothing parameters (which we have
53
-selected based on human data) should be evaluated carefully.  Furthermore, it may not be suitable at
54
-all for organisms with vastly different (from humans) methylation structures.
55
-
56
-With respect to non-CpG methylation, the situation is mixed.  We have never used the algorithm to
57
-analyze non-CpG methylation, but it should be straightforward to do so.  However, the data
58
-structures used in the current code might not conveniently scale from the 28.2M CpGs in the human
59
-genome to the roughly 2x585M Cs (it may well be possible to do an separate analysis for each
60
-chromosome).  This should be less of an issue for organisms with smaller genomes.  We are
61
-considering changing these underlying data structures to allow for easier analysis of non-CpG
62
-methylation in humans.
63
-
64
-\subsubsection*{System Requirements}
65
-
66
-The package requires that all the data is loaded into system memory.  By ``data'' we do not mean the
67
-individual reads (which is big for a whole-genome experiment).  Instead, what we need are summarized
68
-data given us the number of reads supporting methylation as well as the total number of reads at
69
-each loci.  Focusing on human data, we need to work with objects with a maximum of 28.2 million
70
-entries, per sample (since there are roughly 28.2 millions CpGs in the human genome).  This provides
71
-us with an upper limit on the data object.  
72
-
73
-Based on this, the 8 samples from \cite{Hansen:2011} including the smoothed values, take up roughly
74
-1.2GB of RAM, meaning an analysis can easily be done with 8GB of RAM.  In order to improve speed,
75
-the package allows for easy parallel processing of samples/chromosomes.  This will require multiple
76
-copies of the data object for each core used, increasing the memory usage substantially to perhaps
77
-even 32GB.  This can be avoided by processing the samples sequentially at the cost of speed.
78
-
79
-On a 64GB node, the 8 samples from \cite{Hansen:2011} takes roughly one hour to process in parallel
80
-using 8 cores (using much less than 64GB of RAM in total).  This does not including parsing the data
81
-from the alignment output files.
82
-
83
-\subsubsection*{Some terminology}
84
-
85
-Because not all methylation happens at CpG sites, we have tried to use the term ``methylation loci''
86
-instead of CpG.  We use this term to refer to a single-base methylation site.
87
-
88
-Some standard terms from the DNA methylation field: differentially methylated region (DMR), hyper
89
-methylation (methylation that is higher in one condition/sample than in another), hypo methylation
90
-(methylation that is lower in one condition/sample than in another), and finally differentially
91
-methylated position (DMP) referring to a single loci.
92
-
93
-\subsubsection*{Citation}
94
-
95
-If you use this package, please cite our BSmooth paper \cite{Hansen:2012}.
96
-
97
-\section*{Overview}
98
-
99
-The package assumes that the following data has been extracted from alignments:
100
-\begin{enumerate}
101
-\item genomic positions, including chromosome and location, for methylation loci.
102
-\item a (matrix) of M (Methylation) values, describing the number of read supporting methylation covering a
103
-  single loci.  Each row in this matrix is a methylation loci and each column is a sample.
104
-\item a (matrix) of Cov (Coverage) values, describing the total number of reads covering a
105
-  single loci.  Each row in this matrix is a methylation loci and each column is a sample.
106
-\end{enumerate}
107
-
108
-We can have a look at some data from \cite{Lister:2009}, specifically chromosome 22 from the IMR90
109
-cell line.
110
-
111
-<<BSchr22>>=
112
-data(BS.chr22)
113
-BS.chr22
114
-@ 
115
-
116
-The genomic positions are stored as a \Rcode{GRanges} object \Rcode{GRanges} are general genomic
117
-regions; we represent a single base methylation loci as an interval of width 1 (which may seem a bit
118
-strange, but there are good reasons for this).  For example, the first 4 loci in the Lister data are
119
-
120
-<<ov-granges>>=
121
-head(granges(BS.chr22), n = 4)
122
-@ 
123
-
124
-We also have the M and Cov matrices
125
-<<ov-getCoverage>>=
126
-head(getCoverage(BS.chr22, type = "M"), n = 4)
127
-head(getCoverage(BS.chr22), n = 4)
128
-@ 
129
-
130
-Since CpG methylation is symmetric on the two strands of a chromosome, we aggregated reads on the
131
-forward and reverse strand to form a single value, and we assume the genomic position points to the
132
-C of the CpG.  It is not crucial in any way to do this, one may easily analyze each strand
133
-separately, but CpG methylation is symmetric and this halves the number of loci.
134
-
135
-How to input your methylation data into this data structure (called a \Rcode{BSseq} object) is
136
-described in a section below.  We also have a section on how to operate on these types of objects.
137
-
138
-An analysis typically consists of the following steps.
139
-\begin{enumerate}
140
-\item Smoothing, using the function \Rcode{BSmooth}.
141
-\item Compute t-statistics, using the function \Rcode{BSmooth.tstat}.  This converts the
142
-  \Rcode{BSseq} object into a \Rcode{BSseqTstat} object.
143
-\item Threshold these t-statistics to identify DMRs, using the function \Rcode{dmrFinder}, returning
144
-  a simple \Rcode{data.frame}.
145
-\end{enumerate}
146
-It is usually a good idea to look at the smoothed data either before or after identifying DMRs.
147
-This can be done using the functions \Rcode{plotRegion} and \Rcode{plotManyRegions}.
148
-
149
-We also have functions for assessing goodness of fit for binomial and poison models; this is like to
150
-be of marginal interest to most users.  See the man page for \Rcode{binomialGoodnessOfFit}.
151
-
152
-We also allow for easy computation of Fisher's exact test for each loci, using the function
153
-\Rcode{fisherTests}. 
154
-
155
-\section*{Using objects of class BSseq}
156
-
157
-\subsubsection*{Basic operations}
158
-
159
-Objects of class \Rcode{BSseq} contains a \Rcode{GRanges} object with the genomic locations.  This
160
-\Rcode{GRanges} object can be obtained by \Rcode{granges}.  A number of standard \Rcode{GRanges}
161
-methods works directly on the \Rcode{BSseq} object, such as \Rcode{start}, \Rcode{end},
162
-\Rcode{seqnames} (chromosome names) etc.
163
-
164
-These objects also contains a \Rcode{phenoData} object for sample pheno data.  Useful methods are
165
-\Rcode{sampleNames}, \Rcode{pData}.
166
-
167
-Finally, we have methods such as \Rcode{dim}, \Rcode{ncol} (number of columns; number of samples),
168
-\Rcode{nrow} (number of rows; number of methylation loci).
169
-
170
-Objects can be subsetted using two indicies \Rcode{BSseq[i,j]} with the first index being
171
-methylation loci and the second index being samples.  Another very useful way of subsetting the
172
-object is by using the method \Rcode{subsetByOverlaps}.  This selects all methylation loci inside a
173
-set of genomic intervals (there is a difference between first and second argument and either can be
174
-\Rcode{BSseq} or \Rcode{GRanges}).  
175
-
176
-Examples:
177
-<<using-examples>>=
178
-head(start(BS.chr22), n = 4)
179
-head(seqnames(BS.chr22), n = 4)
180
-sampleNames(BS.chr22)
181
-pData(BS.chr22)
182
-dim(BS.chr22)
183
-BS.chr22[1:6,1]
184
-subsetByOverlaps(BS.chr22, 
185
-                 GRanges(seqnames = "chr22", 
186
-                         ranges = IRanges(start = 1, end = 2*10^7)))
187
-@ 
188
-
189
-\subsubsection*{Data handling}
190
-
191
-We have a number of functions for manipulating one or more \Rcode{BSseq} objects.
192
-
193
-\Rcode{BSseq} instantiates an object of class \Rcode{BSseq}.  Genomic locations are passed in,
194
-either as a \Rcode{GRanges} object (argument \Rcode{gr}) or as chromosome and location vectors
195
-(arguments \Rcode{chr} and \Rcode{pos}).  The arguments \Rcode{M} and \Rcode{Cov} accepts matrices,
196
-and it is possible to directly give it a \Rcode{phenoData} object.
197
-
198
-<<data-bsseq>>=
199
-M <- matrix(0:8, 3, 3)
200
-Cov <- matrix(1:9, 3, 3)
201
-BStmp <- BSseq(chr = c("chr1", "chrX", "chr1"), pos = 1:3, 
202
-               M = M, Cov = Cov, sampleNames = c("A1","A2", "B"))
203
-@ 
204
-
205
-A \Rcode{BSseq} object may be ordered by \Rcode{orderBSseq}.  This ensures that data from a single
206
-chromosome appears in an ordered, contiguous block.  There is also the possibility for specifying
207
-the chromosome order (this is less important).  The smoothing functions assumes that the underlying
208
-\Rcode{BSseq} has been ordered.
209
-
210
-<<data-order>>=
211
-granges(BStmp)
212
-BStmp <- orderBSseq(BStmp, seqOrder = c("chr1", "chrX"))
213
-granges(BStmp)
214
-@ 
215
-
216
-\Rcode{chrSelectBSseq} performs the often useful task of selecting one or more chromosomes and can
217
-also order the output.  In case \Rcode{order = TRUE}, the output is ordered according to the order
218
-of the \Rcode{seqnames} argument.
219
-<<data-chrSelect>>=
220
-chrSelectBSseq(BStmp, seqnames = "chr1", order = TRUE)
221
-@ 
222
-Of course, in this case the function does little, since \Rcode{BS.chr22} already only contains data
223
-from chromosome 22.
224
-
225
-\Rcode{combine} combines two \Rcode{BSseq} objects in the following way: the samples for the return
226
-objects is the union of the samples from the two objects, and the methylation loci are the union of
227
-the two methylation loci.  The two objects do not need to have measured the same loci (in the
228
-example below, \Rcode{BStmp} has data on chromosome 1 and X)..
229
-
230
-<<data-combine>>=
231
-BStmp2 <- combine(BStmp, BS.chr22[1:3,])
232
-granges(BStmp2)
233
-getCoverage(BStmp2)
234
-@ 
235
-
236
-\Rcode{collapseBSseq} performs the often useful task of adding several columns of a \Rcode{BSseq}
237
-object.  This is often used in the beginning of an analysis where each column may correspond to a
238
-lane and several such columns represents the data for a single biological sample.
239
-
240
-<<data-collapse>>=
241
-collapseBSseq(BStmp, columns = c("A", "A", "B"))
242
-@ 
243
-
244
-\subsubsection*{Obtaining coverage (methylation)}
245
-
246
-Coverage, either Cov or M values, are obtained by \Rcode{getCoverage}, using the \Rcode{type} argument:
247
-<<getCoverage1>>=
248
-head(getCoverage(BS.chr22, type = "Cov"), n = 4)
249
-head(getCoverage(BS.chr22, type = "M"), n = 4)
250
-@ 
251
-This will return a -- possibly very large -- matrix.  It is also possible to get region based
252
-coverage by using the \Rcode{regions} argument to the function.  This argument is either a
253
-\Rcode{data.frame} (with columns \Rcode{chr}, \Rcode{start} and \Rcode{end}) or a \Rcode{GRanges}
254
-object.  Let us do an example
255
-<<regions>>=
256
-regions <- GRanges(seqnames = c("chr22", "chr22"), 
257
-                   ranges = IRanges(start = 1.5 * 10^7 + c(0,200000), 
258
-                                    width = 1000))
259
-getCoverage(BS.chr22, regions = regions, what = "perRegionTotal")
260
-@ 
261
-When \Rcode{what} is \Rcode{perRegionTotal} the return value is the total coverage of each region
262
-(and note that regions without coverage return \Rcode{NA}).  Similarly, \Rcode{perRegionAverage}
263
-yields the average coverage in the region.  However, it is often useful to get the actual values, like
264
-<<getCoverage2>>=
265
-getCoverage(BS.chr22, regions = regions, what = "perBase")
266
-@ 
267
-This is the default behaviour, and it returns a list where each element corresponds to a region.
268
-Note that regions with no coverage gets a \Rcode{NULL}.
269
-
270
-Methylation estimates can be obtained in the same way, using the function \Rcode{getMeth}.  If
271
-\Rcode{type} is set to \Rcode{raw} the function returns simple single-loci methylation estimates
272
-(which are $\textrm{M}/\textrm{Cov}$).  To get smoothed estimates, the \Rcode{BSseq} object needs to
273
-have been smoothed using Bsmooth, and \Rcode{type} set to \Rcode{smooth} (default).  The
274
-\Rcode{getCoverage}, \Rcode{getMeth} has a \Rcode{regions} and a \Rcode{what} argument.  For
275
-\Rcode{getMeth} the \Rcode{what} argument can be \Rcode{perBase} or \Rcode{perRegion} (the later is
276
-really the per-region average methylation).  Additionally, confidence intervals can be computed
277
-using a method taking possibly low coverage loci into account as well as loci where the methylation
278
-percentage might be close to 0 or 1 \cite{Agresti:1998}.  Currently, confidence intervals cannot be
279
-computed for region-level summary estimates.  Examples
280
-
281
-<<getMeth>>=
282
-getMeth(BS.chr22, regions, type = "raw")
283
-getMeth(BS.chr22, regions[2], type = "raw", what = "perBase")
284
-@ 
285
-
286
-\section*{Reading data}
287
-
288
-\subsubsection*{Alignment output from the BSmooth alignment suite}
289
-
290
-It is straightforward to read output files (summarized evidence) from the BSmooth alignment suite,
291
-using \Rcode{read.bsmooth}.  This function takes as argument a number of directories, usually
292
-corresponding to samples, with the alignment output.  Sometimes, one might want to read only certain
293
-chromosomes, which can be accomplished by the \Rcode{seqnames} argument.  Also, the default values
294
-of \Rcode{qualityCutoff = 20} ensures that we do not use methylation evidence with a base quality
295
-strictly lower than 20 (since we may not be confident whether the read really supports methylation
296
-or not).  The function \Rcode{read.bsmooth} allows for both gzipped and non-gzipped input files. It
297
-is faster to read gzipped files, so we recommend gzipping post-alignment.
298
-
299
-During the development of BSmooth we experimented with a number of different output formats.  These
300
-legacy formats can be read with \Rcode{read.umtab} and \Rcode{read.umtab2}.
301
-
302
-\subsubsection*{Alignment output from other aligners}
303
-
304
-We are interested in adding additional parsers to the package; if your favorite alignment software
305
-is not supported, feel free to get in touch.
306
-
307
-In general, we need summarized methylation evidence.  In short, for each methylation loci, we need
308
-to know how many reads cover the loci and how many of those reads support methylation.
309
-
310
-As an example, consider the Lister data.  The files posted online looks like
311
-\begin{verbatim}
312
-> head mc_imr90_r1_22
313
-assembly        position        strand  class   mc      h
314
-22      14430632        +       CG      9       10
315
-22      14430633        -       CG      8       8
316
-22      14430677        +       CG      1       1
317
-22      14430678        -       CG      3       10
318
-22      14430688        -       CG      6       10
319
-22      14430703        -       CG      2       8
320
-22      14431244        +       CG      5       10
321
-22      14431245        -       CG      5       11
322
-\end{verbatim}
323
-For these files, the evidence is split by strand.  It is often easiest to read in one sample at a
324
-time, instantiate a \Rcode{BSseq} object and the use \Rcode{combine} and \Rcode{collapseBSseq} to
325
-combine the samples, since these functions deal objects that have different sets of CpGs.  In the
326
-Lister data, note that the position is the position of the ``C'', so basically, if you want to
327
-combine the evidence from both strands, CpGs on the ``-'' strand needs to have 1 subtracted from
328
-their position.  A full script showing how these files are used to create \Rcode{BS.chr22} can be
329
-found in \texttt{inst/scripts/get\_BS.chr22.R} in the \Rpackage{bsseq} package.  The path of this
330
-file may be found by
331
-<<locateScript,results="hide">>=
332
-system.file("scripts", "get_BS.chr22.R", package = "bsseq")
333
-@ 
334
-
335
-\section*{Analysis}
336
-
337
-Computing smoothed methylation levels is simple.  We subset \Rcode{BS.chr22} so we only smooth 1
338
-sample, for run-time purpose
339
-
340
-<<BSmooth>>=
341
-BS.chr22.1 <- BSmooth(BS.chr22[,"r1"], verbose = TRUE)
342
-BS.chr22.1
343
-@ 
344
-
345
-A number of arguments pertains to using multiple cores.  This is useful if you have multiple samples
346
-or chromosomes.  \Rcode{mc.cores} tells \Rcode{BSmooth} how many cores to use and \Rcode{parallelBy}
347
-describes whether the parallelization is over samples or chromosomes.  If the parallelization is
348
-over samples, each core will compute the smoothing of the entire genome.  And if it is over
349
-samples, each cores will smooth all samples for one or more chromosomes.  The appropriate choice
350
-depends on the number of samples and cpu cores available.  If you have the exact same number of
351
-samples as cores, the fastest is \Rcode{parallelBy = "sample"}.  If you have less samples than
352
-cores, the fastest is \Rcode{parallelBy = "chromosome"}.  The argument \Rcode{mc.preshedule} should
353
-not need to be changed (unless perhaps if a small value of \Rcode{maxGap} is uses); see the man page
354
-for \Rcode{parallel::mclapply}.  Note that setting \Rcode{mc.cores} to a value greater than 1 is not
355
-support on MS Windows due to a limitation of the operating system.
356
-
357
-For a more detailed discussion of the analysis tools, read the companion vignette ``Analyzing WGBS
358
-with the bsseq package'', which also finding DMRs and plotting them.
359
-
360
-Fisher's exact tests may be efficiently computed using the function \Rcode{fisherTests}.
361
-
362
-Binomial and poisson goodness of fit tests statistics may be computed using \Rcode{binomialGoodnessOfFit} and \Rcode{poissonGoodnessOfFit}.
363
-
364
-\bibliography{bsseq}
365
-
366
-
367
-\section*{SessionInfo}
368
-
369
-<<sessionInfo,results="asis",eval=TRUE,echo=FALSE>>=
370
-toLatex(sessionInfo())
371
-@ 
372
-
373
-\end{document}
374
-
375
-% Local Variables:
376
-% eval: (add-hook 'LaTeX-mode-hook '(lambda () (if (string= (buffer-name) "bsseq.Rnw") (setq fill-column 100))))
377
-% LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR
378
-% LocalWords: DMRs differentially
379
-% End:
380
-
381 0
new file mode 100644
... ...
@@ -0,0 +1,249 @@
1
+---
2
+title: "Analyzing WGBS data with bsseq"
3
+shorttitle: "bsseq analysis"
4
+author: 
5
+  - Kasper Daniel Hansen
6
+package: bsseq
7
+bibliography: bsseq.bib
8
+abstract: >
9
+  An informal guide to analyzing WGBS using bsseq.
10
+  (WGBS) datta.
11
+vignette: >
12
+  %\VignetteIndexEntry{Analyzing WGBS data with bsseq}
13
+  %\VignetteEngine{knitr::rmarkdown}
14
+  %\VignetteEncoding{UTF-8}
15
+output: 
16
+  BiocStyle::html_document:
17
+    toc_float: true
18
+---
19
+
20
+# Introduction
21
+
22
+This document discusses the ins and outs of an analysis of a whole-genome shotgun bisulfite sequencing (WGBS) dataset, using the BSmooth algorithm, which was first used in [@Hansen:2011] and more formally presented and evaluated in [@Hansen:2012].  The intention with the document is to focus on analysis-related tasks and questions.  Basic usage of the `r Biocpkg("bsseq")` package is covered in "The bsseq user's guide".  It may be useful to consult the user's guide while reading this analysis guide.
23
+
24
+In this vignette we analyze chromosome 21 and 22 from [@Hansen:2011].  This is primary data from 3 patients with colon cancer.  For each patient we have normal colon tissue as well as cancer colon.  The samples were run on ABI SOLiD and we generated 50bp single-end reads.  The reads were aligned using the Merman aligner in the BSmooth suite [](http://www.rafalab.org/bsmooth).  See the primary publication for more details [@Hansen:2011].
25
+
26
+This data is contained in the `r Biocpkg("bsseqData")`
27
+
28
+```{r dependencies, warning=FALSE, message=FALSE}
29
+library(bsseq)
30
+library(bsseqData)
31
+```
32
+
33
+The `r Biocpkg("bsseqData")` contains a script, `inst/script/create\_BS.cancer.R`, describing how this data is created from the Merman alignment output (also contained in the package).  Note that the current version of the BSmooth pipeline uses a slightly different alignment output format.
34
+
35
+The following object contains the unsmoothed "raw" summarized alignment data.
36
+```{r showData}
37
+data(BS.cancer.ex)
38
+BS.cancer.ex <- updateObject(BS.cancer.ex)
39
+BS.cancer.ex
40
+pData(BS.cancer.ex)
41
+``` 
42
+
43
+If you use this package, please cite our BSmooth paper [@Hansen:2012].
44
+
45
+# Smoothing
46
+
47
+The first step of the analysis is to smooth the data
48
+
49
+```{r smooth,eval=FALSE}
50
+BS.cancer.ex.fit <- BSmooth(BS.cancer.ex, mc.cores = 1, verbose = TRUE)
51
+``` 
52
+This particular piece of code is not being run when the vignette is being created.  It takes roughly 2 minutes per sample.  If you have 6 cores available, use `mc.cores = 6` and the total run time will be roughly 2 minutes.  Note that setting `mc.cores` to a value greater than 1 is not support on MS Windows due to a limitation of the operating system.
53
+
54
+For ease of use, the `r Biocpkg("bsseqData")` includes the result of this command:
55
+```{r showDataFit}
56
+data(BS.cancer.ex.fit)
57
+BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
58
+BS.cancer.ex.fit
59
+``` 
60
+
61
+This step uses parallelization where each sample is run on a separate core using `mclapply()` from the `r CRANpkg("parallel")` package.  This form of parallelization is built into `r Biocpkg("bsseq")`, and (as written) requires access to a machine with 6 cores and enough RAM.  The smoothing step is being done completely independently on each sample, so if you have a lot of samples (or other circumstances), an alternative is to split the computations manually.  A later subsection shows some example code for doing that.
62
+
63
+Let us discuss coverage and representation.  The `BS.cancer.ex` object contains all annotated CpGs on human chromosome 21 and 22, whether or not there is actual data.  Since we have multiple samples, we can roughly divide the genome into 3 categories: CpGs where all samples have data, CpGs where none of the samples have data and CpGs where some, but not all, of the samples have data.  Examining the object at hand, we get
64
+
65
+```{r cpgNumbers}
66
+## The average coverage of CpGs on the two chromosomes
67
+round(colMeans(getCoverage(BS.cancer.ex)), 1)
68
+## Number of CpGs in two chromosomes
69
+length(BS.cancer.ex)
70
+## Number of CpGs which are covered by at least 1 read in all 6 samples
71
+sum(rowSums(getCoverage(BS.cancer.ex) >= 1) == 6)
72
+## Number of CpGs with 0 coverage in all samples
73
+sum(rowSums(getCoverage(BS.cancer.ex)) == 0)
74
+``` 
75
+
76
+The CpG coverage is roughly 4x, so we would expect many zero coverage CpGs by chance. although that should not necessarily occur in all 6 samples at the same CpG.  If we assume that coverage genome-wide is Poisson distributed with a parameter (lambda) of 4, we would expect
77
+```{r poisson}
78
+logp <- ppois(0, lambda = 4, lower.tail = FALSE, log.p = TRUE)
79
+round(1 - exp(6 * logp), 3)
80
+``` 
81
+of the CpGs to have at least one sample with zero coverage.
82
+
83
+There are roughly 130k CpGs with no data at all in any of the 6 samples.  This can happen either because of chance (although that is unlikely) or because the CpG is unmappable.  Since we are dealing with bisulfite converted reads, the unmappable portion of the genome is greater than with normal DNA-sequencing.  For this experiment we only used 50bp single-end reads (in our experience using 100bp paired-end reads greatly increases the mappable percentage of the genome).  These CpGs (with zero coverage in all samples) are in some sense easy to deal with: one should of course be careful drawing conclusions about CpGs with no data.
84
+
85
+We have roughly 959 - 573 - 136 = 250$ CpGs where some (but not all) of the samples have zero coverage, and these are in some sense harder to deal with.  Since we have very low coverage to begin with, it may happen just by chance that a single sample have zero coverage, and it may be too restrictive to just exclude these CpGs from an analysis.
86
+
87
+Smoothing is done separately for each sample, only using the data where the coverage (for that sample) is non-zero.  This estimates a genome-wide methylation profile, which is then **evaluated** in all CpGs in the `BSseq` object.  As a result, after smoothing, every CpG in the object has an estimated methylation value.  This is very nice for the situation where you want to compare a single CpG across multiple samples, but one or two of the samples have zero coverage by chance.  But note that these smoothed methylation profiles makes less sense in the parts of the genome where there are no covered CpGs nearby.  We fix this by removing these CpGs after smoothing, see below.
88
+
89
+Other arguments to the `BSmooth()` function are `mc.cores`, `mc.preschedule`, `parallelBy` which controls the parallelization built into the function as well as `ns`, `h`, `maxGap` which controls the smoothing.  `ns` is the minimum number of CpGs contained in each window, `h` is half the minimum window with (the actual window width is either 2 times `h` or wide enough to contain `ns` covered CpGs, whichever is greater).  Note that the window width is different at each position in the genome and may also be different for different samples at the same position, since it depends on how many nearby CpGs with non-zero coverage.  Per default, a smoothing cluster is a whole chromosome.  By "cluster" we mean a set of CpGs which are processed together.  This means that even if there is a large distance between two CpGs, we borrow strength between them.  By setting `maxGap` this can be prevented since the argument describes the longest distance between two CpGs before a cluster is broken up into two clusters.
90
+
91
+## Manually splitting the smoothing computation
92
+
93
+An example, only showing sample 1 and 2 for brevity, is (this example is not being run when the vignette is being created):
94
+
95
+```{r smoothSplit,eval=FALSE}
96
+## Split datag
97
+BS1 <- BS.cancer.ex[, 1]
98
+save(BS1, file = "BS1.rda")
99
+BS2 <- BS.cancer.ex[, 2]
100
+save(BS1, file = "BS1.rda")
101
+## done splitting
102
+
103
+## Do the following on each node
104
+
105
+## node 1
106
+load("BS1.rda")
107
+BS1.fit <- BSmooth(BS1)
108
+save(BS1.fit)
109
+save(BS1.fit, file = "BS1.fit.rda")
110
+## done node 1
111
+
112
+## node 2
113
+load("BS2.rda")
114
+BS2.fit <- BSmooth(BS2)
115
+save(BS2.fit, file = "BS2.fit.rda")
116
+## done node 2
117
+
118
+## join; in a new R session
119
+load("BS1.fit.rda")
120
+load("BS2.fit.rda")
121
+BS.fit <- combine(BS1.fit, BS2.fit)
122
+``` 
123
+
124
+This still requires that you have one node with enough RAM to hold all samples in memory.
125
+
126
+
127
+# Computing t-statistics
128
+
129
+Before computing t-statistics, we will remove CpGs with little or no coverage.  If this is not done, you may find many DMRs in areas of the genome with very little coverage, which are most likely false positives.  It is open to personal preferences exactly which CpGs to remove, but for this analysis we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.  For readability, we store the coverage in a separate matrix (this is just due to line breaks in the printed output).
130
+
131
+```{r keepLoci}
132
+BS.cov <- getCoverage(BS.cancer.ex.fit)
133
+keepLoci.ex <- which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
134
+                     rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 2)
135
+length(keepLoci.ex)
136
+BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex,]
137
+``` 
138
+
139
+(the `keepLoci.ex` is also available for direct inspection in the `r Biocpkg("bsseqData")` package.)
140
+
141
+
142
+We are now ready to compute t-statistics, by
143
+```{r BSmooth.tstat}
144
+BS.cancer.ex.tstat <- BSmooth.tstat(BS.cancer.ex.fit, 
145
+                                    group1 = c("C1", "C2", "C3"),
146
+                                    group2 = c("N1", "N2", "N3"), 
147
+                                    estimate.var = "group2",
148
+                                    local.correct = TRUE,
149
+                                    verbose = TRUE)
150
+BS.cancer.ex.tstat
151
+``` 
152
+
153
+(the `BS.cancer.ex.tstat` is also available for direct inspection in the `r Biocpkg("bsseqData")` package.)
154
+
155
+
156
+
157
+The arguments to `BSmooth.tstat()` are simple.  `group1` and `group2` contain the sample names of the two groups being compared (it is always group1 - group2), and indices may be used instead of sample names.  `estimate.var` describes which samples are being used to estimate the variability.  Because this is a cancer dataset, and cancer have higher variability than normals, we only use the normal samples to estimate the variability.  Other choices of `estimate.var` are `same` (assume same variability in each group) and `paired` (do a paired t-test).  The argument `local.correct` describes whether we should use a large-scale (low-frequency) mean correction.  This is especially important in cancer where we have found many large-scale methylation differences between cancer and normals.
158
+
159
+We can look at the marginal distribution of the t-statistic by
160
+
161
+```{r plotTstat,fig.width=4,fig.height=4}
162
+plot(BS.cancer.ex.tstat)
163
+``` 
164
+
165
+The "blocks" of hypomethylation are clearly visible in the marginal distribution of the uncorrected t-statistics.
166
+
167
+Even in comparisons where we do not observe these large-scale methylation differences, it often improves the marginal distribution of the t-statistics to locally correct them ("improves" in the sense of making them more symmetric).
168
+
169
+# Finding DMRs
170
+
171
+Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by thresholding the t-statistics.  Here we use a cutoff of $4.6$, which was chosen by looking at the quantiles of the t-statistics (for the entire genome)\footnote{See \url{https://support.bioconductor.org/p/78227/} for further discussion on the choice of cutoff. }.
172
+```{r dmrs}
173
+dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))
174
+dmrs <- subset(dmrs0, n >= 3 & abs(meanDiff) >= 0.1)
175
+nrow(dmrs)
176
+head(dmrs, n = 3)
177
+``` 
178
+
179
+Here, we filter out DMRs that do not have at least 3 CpGs in them and at least a mean difference (across the DMR) in methylation between normal and cancers of at least 0.1.  While the exact values of these two filters can be debated, it is surely a good idea to use something like this.
180
+
181
+Other arguments to `dmrFinder()` are `qcutoff` which chooses a quantile-based cutoff (for example `qcutoff = c(0.01, 0.99)`) and `maxGap` which makes sure that a DMR is being split if there are two CpGs with more than `maxGap` between them (default of 300bp).
182
+
183
+We rank DMRs by the column `areaStat` which is the sum of the t-statistics in each CpG.  This is kind of the area of the DMR, except that it is weighted by the number of CpGs and not by genomic length.  This is currently the best statistic we know, although it is far from perfect (we would like to do something better).
184
+
185
+# Plotting
186
+
187
+It is **always** a good idea to look at the DMRs.  One way of encoding standard plotting parameters like `col`, `lty`, and `lwd` is to add columns to the `pData`, like
188
+
189
+```{r plotSetup}
190
+pData <- pData(BS.cancer.ex.fit)
191
+pData$col <- rep(c("red", "blue"), each = 3)
192
+pData(BS.cancer.ex.fit) <- pData
193
+``` 
194
+
195
+Once this is setup, we can plot a single DMR like
196
+
197
+```{r plotDmr,fig.width=8,fig.height=4}
198
+plotRegion(BS.cancer.ex.fit, dmrs[1,], extend = 5000, addRegions = dmrs)
199
+``` 
200
+
201
+`extend` tells us how many bp to extend to either side of the plotting region.  `addRegions` is a `data.frame` or `GRanges` listing additional regions that should be highlighted.
202
+
203
+Typically, we plot hundreds of DMRs in a single PDF file and use external tools to look at them.  For this purpose, `plotManyRegions()` is very useful since it is much faster than plotting individual DMRs with `plotRegion()`.  An example (not run) is
204
+
205
+```{r plotManyRegions,eval=FALSE}
206
+pdf(file = "dmrs_top200.pdf", width = 10, height = 5)
207
+plotManyRegions(BS.cancer.ex.fit, dmrs[1:200,], extend = 5000, 
208
+                addRegions = dmrs)
209
+dev.off()
210
+``` 
211
+which plots the top200.
212
+
213
+# Question and answers
214
+
215
+> The BSmooth algorithm is supposed to give smooth methylation estimates.  Yet, when I plot the smoothed values, I see jagged lines, which do not look smooth to me.
216
+
217
+We estimate a genome-wide methylation profile that is a smooth function of the genomic position.  However, this profile is not stored in the `BSseq` objects.  Instead, we evaluate this smooth profile in the methylation loci in the object.  An example (made-up values) is
218
+
219
+```
220
+pos  meth
221
+1     0.1
222
+3     0.1
223
+5     0.1
224
+200   0.6
225
+203   0.6
226
+205   0.6
227
+```
228
+
229
+For plotting we do linear interpolation between this points.  The end result is that the methylation profile may appear jagged especially if there is a "big" distance between two CpGs (between pos `5` and `200` above).  If we wanted to plot truly smooth profiles we would have to store the methylation profile evaluated at a regular grid across the genome.  This takes up a lot of space and would add complications to the internal data structures.
230
+
231
+
232
+
233
+# sessionInfo()
234
+
235
+```{r sessionInfo, echo=FALSE}
236
+sessionInfo()
237
+```
238
+
239
+# References
240
+
241
+<!--
242
+% Local Variables:
243
+% LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR bsseq bp 
244
+% LocalWords: DMRs differentially ABI SOLiD dataset WGBS BSmooth 
245
+% LocalWords: parallelization Bioconductor hypomethylation genomic pos PDF thresholding
246
+% LocalWords: mappable unmappable indices normals quantile
247
+% End:
248
+-->
249
+
0 250
deleted file mode 100644
... ...
@@ -1,300 +0,0 @@
1
-%\VignetteIndexEntry{Analyzing WGBS with bsseq}
2
-%\VignetteDepends{bsseq}
3
-%\VignetteDepends{bsseqData}
4
-%\VignettePackage{bsseq}
5
-%\VignetteEngine{knitr::knitr}
6
-\documentclass[12pt]{article}
7
-<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
8
-BiocStyle::latex()
9
-@
10
-\title{Analyzing WGBS with the bsseq package}
11
-\author{Kasper Daniel Hansen\\ \texttt{kasperdanielhansen@gmail.com}}
12
-\date{Modified: June 10, 2015.  Compiled: \today}
13
-\begin{document}
14
-
15
-\maketitle
16
-
17
-\section*{Introduction}
18
-
19
-This document discusses the ins and outs of an analysis of a whole-genome shotgun bisulfite sequencing (WGBS) dataset, using the BSmooth algorithm, which was first used in \cite{Hansen:2011} and more formally presented and evaluated in \cite{Hansen:2012}.  The intention with the document is to focus on analysis-related tasks and questions.  Basic usage of the \Rpackage{bsseq} package is covered in ``The bsseq user's guide''.  It may be useful to consult the user's guide while reading this analysis guide.
20
-
21
-In this vignette we analyze chromosome 21 and 22 from \cite{Hansen:2011}.  This is primary data from 3 patients with colon cancer.  For each patient we have normal colon tissue as well as cancer colon.  The samples were run on ABI SOLiD and we generated 50bp single-end reads.  The reads were aligned using the Merman aligner in the BSmooth suite (\url{http://www.rafalab.org/bsmooth}).  See the primary publication for more details \cite{Hansen:2011}.
22
-
23
-This data is contained in the \Rpackage{bsseqData}
24
-
25
-<<load,warnings=FALSE,messages=FALSE>>=
26
-library(bsseq)
27
-library(bsseqData)
28
-@ 
29
-
30
-The \Rpackage{bsseqData} contains a script, \texttt{inst/script/create\_BS.cancer.R}, describing how this data is created from the Merman alignment output (also contained in the package).  Note that the current version of the BSmooth pipeline uses a slightly different alignment output format.
31
-
32
-The following object contains the unsmoothed ``raw'' summarized alignment data.
33
-<<showData>>=
34
-data(BS.cancer.ex)
35
-BS.cancer.ex <- updateObject(BS.cancer.ex)
36
-BS.cancer.ex
37
-pData(BS.cancer.ex)
38
-@ 
39
-
40
-If you use this package, please cite our BSmooth paper \cite{Hansen:2012}.
41
-
42
-\section*{Smoothing}
43
-
44
-The first step of the analysis is to smooth the data
45
-
46
-<<smooth,eval=FALSE>>=
47
-BS.cancer.ex.fit <- BSmooth(BS.cancer.ex, mc.cores = 1, verbose = TRUE)
48
-@ 
49
-This particular piece of code is not being run when the vignette is being created.  It takes roughly
50
-2 minutes per sample.  If you have 6 cores available, use \Rcode{mc.cores = 6} and the total run
51
-time will be roughly 2 minutes.  Note that setting \Rcode{mc.cores} to a value greater than 1 is not
52
-support on MS Windows due to a limitation of the operating system.
53
-
54
-For ease of use, the \Rpackage{bsseqData} includes the result of this command:
55
-<<showDataFit>>=
56
-data(BS.cancer.ex.fit)
57
-BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
58
-BS.cancer.ex.fit
59
-@ 
60
-
61
-This step uses parallelization where each sample is run on a separate core using \Rcode{mclapply} from the \Rpackage{parallel} package.  This form of parallelization is built into \Rpackage{bsseq}, and (as written) requires access to a machine with 6 cores and enough RAM.  The smoothing step is being done completely independently on each sample, so if you have a lot of samples (or other circumstances), an alternative is to split the computations manually.  A later subsection shows some example code for doing that.
62
-
63
-Let us discuss coverage and representation.  The \Rcode{BS.cancer.ex} object contains all annotated CpGs on human chromosome 21 and 22, whether or not there is actual data.  Since we have multiple samples, we can roughly divide the genome into 3 categories: CpGs where all samples have data, CpGs where none of the samples have data and CpGs where some, but not all, of the samples have data.  Examining the object at hand, we get
64
-
65
-<<cpgNumbers>>=
66
-## The average coverage of CpGs on the two chromosomes
67
-round(colMeans(getCoverage(BS.cancer.ex)), 1)
68
-## Number of CpGs in two chromosomes
69
-length(BS.cancer.ex)
70
-## Number of CpGs which are covered by at least 1 read in all 6 samples
71
-sum(rowSums(getCoverage(BS.cancer.ex) >= 1) == 6)
72
-## Number of CpGs with 0 coverage in all samples
73
-sum(rowSums(getCoverage(BS.cancer.ex)) == 0)
74
-@ 
75
-
76
-The CpG coverage is roughly 4x, so we would expect many zero coverage CpGs by
77
-chance. although that should not necessarily occur in all 6 samples at the same CpG.  If we assume
78
-that coverage genome-wide is Poisson distributed with a parameter (lambda) of 4, we would expect
79
-<<poisson>>=
80
-logp <- ppois(0, lambda = 4, lower.tail = FALSE, log.p = TRUE)
81
-round(1 - exp(6 * logp), 3)
82
-@ 
83
-of the CpGs to have at least one sample with zero coverage.
84
-
85
-There are roughly 130k CpGs with no data at all in any of the 6 samples.  This can happen either
86
-because of chance (although that is unlikely) or because the CpG is unmappable.  Since we are
87
-dealing with bisulfite converted reads, the unmappable portion of the genome is greater than with
88
-normal DNA-sequencing.  For this experiment we only used 50bp single-end reads (in our experience
89
-using 100bp paired-end reads greatly increases the mappable percentage of the genome).  These CpGs
90
-(with zero coverage in all samples) are in some sense easy to deal with: one should of course be
91
-careful drawing conclusions about CpGs with no data.
92
-
93
-We have roughly $959 - 573 - 136 = 250$k CpGs where some (but not all) of the samples have zero
94
-coverage, and these are in some sense harder to deal with.  Since we have very low coverage to begin
95
-with, it may happen just by chance that a single sample have zero coverage, and it may be too
96
-restrictive to just exclude these CpGs from an analysis.
97
-
98
-Smoothing is done separately for each sample, only using the data where the coverage (for that
99
-sample) is non-zero.  This estimates a genome-wide methylation profile, which is then
100
-\emph{evaluated} in all CpGs in the \Rclass{BSseq} object.  As a result, after smoothing, every CpG
101
-in the object has an estimated methylation value.  This is very nice for the situation where you
102
-want to compare a single CpG across multiple samples, but one or two of the samples have zero
103
-coverage by chance.  But note that these smoothed methylation profiles makes less sense in the parts
104
-of the genome where there are no covered CpGs nearby.  We fix this by removing these CpGs after
105
-smoothing, see below.
106
-
107
-Other arguments to the \Rcode{BSmooth} function are \Rcode{mc.cores}, \Rcode{mc.preschedule},
108
-\Rcode{parallelBy} which controls the parallelization built into the function as well as \Rcode{ns},
109
-\Rcode{h}, \Rcode{maxGap} which controls the smoothing.  \Rcode{ns} is the minimum number of CpGs
110
-contained in each window, \Rcode{h} is half the minimum window with (the actual window width is
111
-either 2 times \Rcode{h} or wide enough to contain \Rcode{ns} covered CpGs, whichever is greater).
112
-Note that the window width is different at each position in the genome and may also be different for
113
-different samples at the same position, since it depends on how many nearby CpGs with non-zero
114
-coverage.  Per default, a smoothing cluster is a whole chromosome.  By ``cluster'' we mean a set of
115
-CpGs which are processed together.  This means that even if there is a large distance between two
116
-CpGs, we borrow strength between them.  By setting \Rcode{maxGap} this can be prevented since the
117
-argument describes the longest distance between two CpGs before a cluster is broken up into two
118
-clusters.
119
-
120
-\subsubsection*{Manually splitting the smoothing computation}
121
-
122
-
123
-An example, only showing sample 1 and 2 for brevity, is (this example is not being run when the
124
-vignette is being created):
125
-
126
-<<smoothSplit,eval=FALSE>>=
127
-## Split datag
128
-BS1 <- BS.cancer.ex[, 1]
129
-save(BS1, file = "BS1.rda")
130
-BS2 <- BS.cancer.ex[, 2]
131
-save(BS1, file = "BS1.rda")
132
-## done splitting
133
-
134
-## Do the following on each node
135
-
136
-## node 1
137
-load("BS1.rda")
138
-BS1.fit <- BSmooth(BS1)
139
-save(BS1.fit)
140
-save(BS1.fit, file = "BS1.fit.rda")
141
-## done node 1
142
-
143
-## node 2
144
-load("BS2.rda")
145
-BS2.fit <- BSmooth(BS2)
146
-save(BS2.fit, file = "BS2.fit.rda")
147
-## done node 2
148
-
149
-## join; in a new R session
150
-load("BS1.fit.rda")
151
-load("BS2.fit.rda")
152
-BS.fit <- combine(BS1.fit, BS2.fit)
153
-@ 
154
-
155
-This still requires that you have one node with enough RAM to hold all samples in memory.
156
-
157
-
158
-\section*{Computing t-statistics}
159
-
160
-Before computing t-statistics, we will remove CpGs with little or no coverage.  If this is not done, you may find many DMRs in areas of the genome with very little coverage, which are most likely false positives.  It is open to personal preferences exactly which CpGs to remove, but for this analysis we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.  For readability, we store the coverage in a separate matrix (this is just due to line breaks in Sweave)
161
-
162
-<<keepLoci>>=
163
-BS.cov <- getCoverage(BS.cancer.ex.fit)
164
-keepLoci.ex <- which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
165
-                     rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 2)
166
-length(keepLoci.ex)
167
-BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex,]
168
-@ 
169
-
170
-(the \Rcode{keepLoci.ex} is also available for direct inspection in the \Rpackage{bsseqData} package.)
171
-
172
-
173
-We are now ready to compute t-statistics, by
174
-<<BSmooth.tstat>>=
175
-BS.cancer.ex.tstat <- BSmooth.tstat(BS.cancer.ex.fit, 
176
-                                    group1 = c("C1", "C2", "C3"),
177
-                                    group2 = c("N1", "N2", "N3"), 
178
-                                    estimate.var = "group2",
179
-                                    local.correct = TRUE,
180
-                                    verbose = TRUE)
181
-BS.cancer.ex.tstat
182
-@ 
183
-
184
-(the \Rcode{BS.cancer.ex.tstat} is also available for direct inspection in the \Rpackage{bsseqData} package.)
185
-
186
-
187
-
188
-The arguments to \Rcode{BSmooth.tstat} are simple.  \Rcode{group1} and \Rcode{group2} contain the sample names of the two groups being compared (it is always group1 - group2), and indices may be used instead of sample names.  \Rcode{estimate.var} describes which samples are being used to estimate the variability.  Because this is a cancer dataset, and cancer have higher variability than normals, we only use the normal samples to estimate the variability.  Other choices of \Rcode{estimate.var} are \Rcode{same} (assume same variability in each group) and \Rcode{paired} (do a paired t-test).  The argument \Rcode{local.correct} describes whether we should use a large-scale (low-frequency) mean correction.  This is especially important in cancer where we have found many large-scale methylation differences between cancer and normals.
189
-
190
-We can look at the marginal distribution of the t-statistic by
191
-
192
-<<plotTstat,fig.width=4,fig.height=4>>=
193
-plot(BS.cancer.ex.tstat)
194
-@ 
195
-
196
-The ``blocks'' of hypomethylation are clearly visible in the marginal distribution of the uncorrected t-statistics.
197
-
198
-Even in comparisons where we do not observe these large-scale methylation differences, it often improves the marginal distribution of the t-statistics to locally correct them (``improves'' in the sense of making them more symmetric).
199
-
200
-\section*{Finding DMRs}
201
-
202
-Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by
203
-thresholding the t-statistics.  Here we use a cutoff of $4.6$, which was chosen by looking at the
204
-quantiles of the t-statistics (for the entire genome)\footnote{See \url{https://support.bioconductor.org/p/78227/} for further discussion on the choice of cutoff. }.
205
-<<dmrs>>=
206
-dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))
207
-dmrs <- subset(dmrs0, n >= 3 & abs(meanDiff) >= 0.1)
208
-nrow(dmrs)
209
-head(dmrs, n = 3)
210
-@ 
211
-
212
-Here, we filter out DMRs that do not have at least 3 CpGs in them and at least a mean difference
213
-(across the DMR) in methylation between normal and cancers of at least 0.1.  While the exact values
214
-of these two filters can be debated, it is surely a good idea to use something like this.
215
-
216
-Other arguments to \Rcode{dmrFinder} are \Rcode{qcutoff} which chooses a quantile-based cutoff (for
217
-example \Rcode{qcutoff = c(0.01, 0.99)}) and \Rcode{maxGap} which makes sure that a DMR is being
218
-split if there are two CpGs with more than \Rcode{maxGap} between them (default of 300bp).
219
-
220
-We rank DMRs by the column \Rcode{areaStat} which is the sum of the t-statistics in each CpG.  This
221
-is kind of the area of the DMR, except that it is weighted by the number of CpGs and not by genomic
222
-length.  This is currently the best statistic we know, although it is far from perfect (we would
223
-like to do something better).
224
-
225
-\section*{Plotting}
226
-
227
-It is \emph{always} a good idea to look at the DMRs.  One way of encoding standard plotting
228
-parameters like \Rcode{col}, \Rcode{lty}, and \Rcode{lwd} is to add columns to the \Rcode{pData},
229
-like 
230
-
231
-<<plotSetup>>=
232
-pData <- pData(BS.cancer.ex.fit)
233
-pData$col <- rep(c("red", "blue"), each = 3)
234
-pData(BS.cancer.ex.fit) <- pData
235
-@ 
236
-
237
-Once this is setup, we can plot a single DMR like
238
-
239
-<<plotDmr,fig.width=8,fig.height=4>>=
240
-plotRegion(BS.cancer.ex.fit, dmrs[1,], extend = 5000, addRegions = dmrs)
241
-@ 
242
-
243
-\Rcode{extend} tells us how many bp to extend to either side of the plotting region.  \Rcode{addRegions} is a \Rclass{data.frame} or \Rcode{GRanges} listing additional regions that should be highlighted.
244
-
245
-Typically, we plot hundreds of DMRs in a single PDF file and use external tools to look at them.  For this purpose, \Rcode{plotManyRegions} is very useful since it is much faster than plotting individual DMRs with \Rcode{plotRegion}.  An example (not run) is
246
-
247
-<<plotManyRegions,eval=FALSE>>=
248
-pdf(file = "dmrs_top200.pdf", width = 10, height = 5)
249
-plotManyRegions(BS.cancer.ex.fit, dmrs[1:200,], extend = 5000, 
250
-                addRegions = dmrs)
251
-dev.off()
252
-@ 
253
-which plots the top200.
254
-
255
-\section*{Question and answers}
256
-
257
-\textbf{1. The BSmooth algorithm is supposed to give smooth methylation estimates.  Yet, when I plot
258
-  the smoothed values, I see jagged lines, which do not look smooth to me.}
259
-
260
-We estimate a genome-wide methylation profile that is a smooth function of the genomic position.
261
-However, this profile is not stored in the \Rclass{BSseq} objects.  Instead, we evaluate this smooth
262
-profile in the methylation loci in the object.  An example (made-up values) is
263
-\begin{verbatim}
264
-pos  meth
265
-1     0.1
266
-3     0.1
267
-5     0.1
268
-200   0.6
269
-203   0.6
270
-205   0.6
271
-\end{verbatim}
272
-
273
-For plotting we do linear interpolation between this points.  The end result is that the methylation
274
-profile may appear jagged especially if there is a ``big'' distance between two CpGs (between pos
275
-\texttt{5} and \texttt{200} above).  If we wanted to plot truly smooth profiles we would have to
276
-store the methylation profile evaluated at a regular grid across the genome.  This takes up a lot of
277
-space and would add complications to the internal data structures.
278
-
279
-
280
-%%
281
-%% Backmatter
282
-%%
283
-
284
-\bibliography{bsseq}
285
-
286
-\section*{SessionInfo}
287
-
288
-<<sessionInfo,results="asis",eval=TRUE,echo=FALSE>>=
289
-toLatex(sessionInfo())
290
-@ 
291
-
292
-\end{document}
293
-
294
-% Local Variables:
295
-% LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR bsseq bp 
296
-% LocalWords: DMRs differentially ABI SOLiD dataset WGBS BSmooth 
297
-% LocalWords: parallelization Bioconductor hypomethylation genomic pos PDF thresholding
298
-% LocalWords: mappable unmappable indices normals quantile
299
-% End:
300
-