Browse code

moved vignettes into Rmd

Kasper Daniel Hansen authored on 17/10/2017 02:21:02
Showing 1 changed files
1 1
deleted file mode 100644
... ...
@@ -1,300 +0,0 @@
1
-%\VignetteIndexEntry{Analyzing WGBS with bsseq}
2
-%\VignetteDepends{bsseq}
3
-%\VignetteDepends{bsseqData}
4
-%\VignettePackage{bsseq}
5
-%\VignetteEngine{knitr::knitr}
6
-\documentclass[12pt]{article}
7
-<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
8
-BiocStyle::latex()
9
-@
10
-\title{Analyzing WGBS with the bsseq package}
11
-\author{Kasper Daniel Hansen\\ \texttt{kasperdanielhansen@gmail.com}}
12
-\date{Modified: June 10, 2015.  Compiled: \today}
13
-\begin{document}
14
-
15
-\maketitle
16
-
17
-\section*{Introduction}
18
-
19
-This document discusses the ins and outs of an analysis of a whole-genome shotgun bisulfite sequencing (WGBS) dataset, using the BSmooth algorithm, which was first used in \cite{Hansen:2011} and more formally presented and evaluated in \cite{Hansen:2012}.  The intention with the document is to focus on analysis-related tasks and questions.  Basic usage of the \Rpackage{bsseq} package is covered in ``The bsseq user's guide''.  It may be useful to consult the user's guide while reading this analysis guide.
20
-
21
-In this vignette we analyze chromosome 21 and 22 from \cite{Hansen:2011}.  This is primary data from 3 patients with colon cancer.  For each patient we have normal colon tissue as well as cancer colon.  The samples were run on ABI SOLiD and we generated 50bp single-end reads.  The reads were aligned using the Merman aligner in the BSmooth suite (\url{http://www.rafalab.org/bsmooth}).  See the primary publication for more details \cite{Hansen:2011}.
22
-
23
-This data is contained in the \Rpackage{bsseqData}
24
-
25
-<<load,warnings=FALSE,messages=FALSE>>=
26
-library(bsseq)
27
-library(bsseqData)
28
-@ 
29
-
30
-The \Rpackage{bsseqData} contains a script, \texttt{inst/script/create\_BS.cancer.R}, describing how this data is created from the Merman alignment output (also contained in the package).  Note that the current version of the BSmooth pipeline uses a slightly different alignment output format.
31
-
32
-The following object contains the unsmoothed ``raw'' summarized alignment data.
33
-<<showData>>=
34
-data(BS.cancer.ex)
35
-BS.cancer.ex <- updateObject(BS.cancer.ex)
36
-BS.cancer.ex
37
-pData(BS.cancer.ex)
38
-@ 
39
-
40
-If you use this package, please cite our BSmooth paper \cite{Hansen:2012}.
41
-
42
-\section*{Smoothing}
43
-
44
-The first step of the analysis is to smooth the data
45
-
46
-<<smooth,eval=FALSE>>=
47
-BS.cancer.ex.fit <- BSmooth(BS.cancer.ex, mc.cores = 1, verbose = TRUE)
48
-@ 
49
-This particular piece of code is not being run when the vignette is being created.  It takes roughly
50
-2 minutes per sample.  If you have 6 cores available, use \Rcode{mc.cores = 6} and the total run
51
-time will be roughly 2 minutes.  Note that setting \Rcode{mc.cores} to a value greater than 1 is not
52
-support on MS Windows due to a limitation of the operating system.
53
-
54
-For ease of use, the \Rpackage{bsseqData} includes the result of this command:
55
-<<showDataFit>>=
56
-data(BS.cancer.ex.fit)
57
-BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
58
-BS.cancer.ex.fit
59
-@ 
60
-
61
-This step uses parallelization where each sample is run on a separate core using \Rcode{mclapply} from the \Rpackage{parallel} package.  This form of parallelization is built into \Rpackage{bsseq}, and (as written) requires access to a machine with 6 cores and enough RAM.  The smoothing step is being done completely independently on each sample, so if you have a lot of samples (or other circumstances), an alternative is to split the computations manually.  A later subsection shows some example code for doing that.
62
-
63
-Let us discuss coverage and representation.  The \Rcode{BS.cancer.ex} object contains all annotated CpGs on human chromosome 21 and 22, whether or not there is actual data.  Since we have multiple samples, we can roughly divide the genome into 3 categories: CpGs where all samples have data, CpGs where none of the samples have data and CpGs where some, but not all, of the samples have data.  Examining the object at hand, we get
64
-
65
-<<cpgNumbers>>=
66
-## The average coverage of CpGs on the two chromosomes
67
-round(colMeans(getCoverage(BS.cancer.ex)), 1)
68
-## Number of CpGs in two chromosomes
69
-length(BS.cancer.ex)
70
-## Number of CpGs which are covered by at least 1 read in all 6 samples
71
-sum(rowSums(getCoverage(BS.cancer.ex) >= 1) == 6)
72
-## Number of CpGs with 0 coverage in all samples
73
-sum(rowSums(getCoverage(BS.cancer.ex)) == 0)
74
-@ 
75
-
76
-The CpG coverage is roughly 4x, so we would expect many zero coverage CpGs by
77
-chance. although that should not necessarily occur in all 6 samples at the same CpG.  If we assume
78
-that coverage genome-wide is Poisson distributed with a parameter (lambda) of 4, we would expect
79
-<<poisson>>=
80
-logp <- ppois(0, lambda = 4, lower.tail = FALSE, log.p = TRUE)
81
-round(1 - exp(6 * logp), 3)
82
-@ 
83
-of the CpGs to have at least one sample with zero coverage.
84
-
85
-There are roughly 130k CpGs with no data at all in any of the 6 samples.  This can happen either
86
-because of chance (although that is unlikely) or because the CpG is unmappable.  Since we are
87
-dealing with bisulfite converted reads, the unmappable portion of the genome is greater than with
88
-normal DNA-sequencing.  For this experiment we only used 50bp single-end reads (in our experience
89
-using 100bp paired-end reads greatly increases the mappable percentage of the genome).  These CpGs
90
-(with zero coverage in all samples) are in some sense easy to deal with: one should of course be
91
-careful drawing conclusions about CpGs with no data.
92
-
93
-We have roughly $959 - 573 - 136 = 250$k CpGs where some (but not all) of the samples have zero
94
-coverage, and these are in some sense harder to deal with.  Since we have very low coverage to begin
95
-with, it may happen just by chance that a single sample have zero coverage, and it may be too
96
-restrictive to just exclude these CpGs from an analysis.
97
-
98
-Smoothing is done separately for each sample, only using the data where the coverage (for that
99
-sample) is non-zero.  This estimates a genome-wide methylation profile, which is then
100
-\emph{evaluated} in all CpGs in the \Rclass{BSseq} object.  As a result, after smoothing, every CpG
101
-in the object has an estimated methylation value.  This is very nice for the situation where you
102
-want to compare a single CpG across multiple samples, but one or two of the samples have zero
103
-coverage by chance.  But note that these smoothed methylation profiles makes less sense in the parts
104
-of the genome where there are no covered CpGs nearby.  We fix this by removing these CpGs after
105
-smoothing, see below.
106
-
107
-Other arguments to the \Rcode{BSmooth} function are \Rcode{mc.cores}, \Rcode{mc.preschedule},
108
-\Rcode{parallelBy} which controls the parallelization built into the function as well as \Rcode{ns},
109
-\Rcode{h}, \Rcode{maxGap} which controls the smoothing.  \Rcode{ns} is the minimum number of CpGs
110
-contained in each window, \Rcode{h} is half the minimum window with (the actual window width is
111
-either 2 times \Rcode{h} or wide enough to contain \Rcode{ns} covered CpGs, whichever is greater).
112
-Note that the window width is different at each position in the genome and may also be different for
113
-different samples at the same position, since it depends on how many nearby CpGs with non-zero
114
-coverage.  Per default, a smoothing cluster is a whole chromosome.  By ``cluster'' we mean a set of
115
-CpGs which are processed together.  This means that even if there is a large distance between two
116
-CpGs, we borrow strength between them.  By setting \Rcode{maxGap} this can be prevented since the
117
-argument describes the longest distance between two CpGs before a cluster is broken up into two
118
-clusters.
119
-
120
-\subsubsection*{Manually splitting the smoothing computation}
121
-
122
-
123
-An example, only showing sample 1 and 2 for brevity, is (this example is not being run when the
124
-vignette is being created):
125
-
126
-<<smoothSplit,eval=FALSE>>=
127
-## Split datag
128
-BS1 <- BS.cancer.ex[, 1]
129
-save(BS1, file = "BS1.rda")
130
-BS2 <- BS.cancer.ex[, 2]
131
-save(BS1, file = "BS1.rda")
132
-## done splitting
133
-
134
-## Do the following on each node
135
-
136
-## node 1
137
-load("BS1.rda")
138
-BS1.fit <- BSmooth(BS1)
139
-save(BS1.fit)
140
-save(BS1.fit, file = "BS1.fit.rda")
141
-## done node 1
142
-
143
-## node 2
144
-load("BS2.rda")
145
-BS2.fit <- BSmooth(BS2)
146
-save(BS2.fit, file = "BS2.fit.rda")
147
-## done node 2
148
-
149
-## join; in a new R session
150
-load("BS1.fit.rda")
151
-load("BS2.fit.rda")
152
-BS.fit <- combine(BS1.fit, BS2.fit)
153
-@ 
154
-
155
-This still requires that you have one node with enough RAM to hold all samples in memory.
156
-
157
-
158
-\section*{Computing t-statistics}
159
-
160
-Before computing t-statistics, we will remove CpGs with little or no coverage.  If this is not done, you may find many DMRs in areas of the genome with very little coverage, which are most likely false positives.  It is open to personal preferences exactly which CpGs to remove, but for this analysis we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.  For readability, we store the coverage in a separate matrix (this is just due to line breaks in Sweave)
161
-
162
-<<keepLoci>>=
163
-BS.cov <- getCoverage(BS.cancer.ex.fit)
164
-keepLoci.ex <- which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
165
-                     rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 2)
166
-length(keepLoci.ex)
167
-BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex,]
168
-@ 
169
-
170
-(the \Rcode{keepLoci.ex} is also available for direct inspection in the \Rpackage{bsseqData} package.)
171
-
172
-
173
-We are now ready to compute t-statistics, by
174
-<<BSmooth.tstat>>=
175
-BS.cancer.ex.tstat <- BSmooth.tstat(BS.cancer.ex.fit, 
176
-                                    group1 = c("C1", "C2", "C3"),
177
-                                    group2 = c("N1", "N2", "N3"), 
178
-                                    estimate.var = "group2",
179
-                                    local.correct = TRUE,
180
-                                    verbose = TRUE)
181
-BS.cancer.ex.tstat
182
-@ 
183
-
184
-(the \Rcode{BS.cancer.ex.tstat} is also available for direct inspection in the \Rpackage{bsseqData} package.)
185
-
186
-
187
-
188
-The arguments to \Rcode{BSmooth.tstat} are simple.  \Rcode{group1} and \Rcode{group2} contain the sample names of the two groups being compared (it is always group1 - group2), and indices may be used instead of sample names.  \Rcode{estimate.var} describes which samples are being used to estimate the variability.  Because this is a cancer dataset, and cancer have higher variability than normals, we only use the normal samples to estimate the variability.  Other choices of \Rcode{estimate.var} are \Rcode{same} (assume same variability in each group) and \Rcode{paired} (do a paired t-test).  The argument \Rcode{local.correct} describes whether we should use a large-scale (low-frequency) mean correction.  This is especially important in cancer where we have found many large-scale methylation differences between cancer and normals.
189
-
190
-We can look at the marginal distribution of the t-statistic by
191
-
192
-<<plotTstat,fig.width=4,fig.height=4>>=
193
-plot(BS.cancer.ex.tstat)
194
-@ 
195
-
196
-The ``blocks'' of hypomethylation are clearly visible in the marginal distribution of the uncorrected t-statistics.
197
-
198
-Even in comparisons where we do not observe these large-scale methylation differences, it often improves the marginal distribution of the t-statistics to locally correct them (``improves'' in the sense of making them more symmetric).
199
-
200
-\section*{Finding DMRs}
201
-
202
-Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by
203
-thresholding the t-statistics.  Here we use a cutoff of $4.6$, which was chosen by looking at the
204
-quantiles of the t-statistics (for the entire genome)\footnote{See \url{https://support.bioconductor.org/p/78227/} for further discussion on the choice of cutoff. }.
205
-<<dmrs>>=
206
-dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))
207
-dmrs <- subset(dmrs0, n >= 3 & abs(meanDiff) >= 0.1)
208
-nrow(dmrs)
209
-head(dmrs, n = 3)
210
-@ 
211
-
212
-Here, we filter out DMRs that do not have at least 3 CpGs in them and at least a mean difference
213
-(across the DMR) in methylation between normal and cancers of at least 0.1.  While the exact values
214
-of these two filters can be debated, it is surely a good idea to use something like this.
215
-
216
-Other arguments to \Rcode{dmrFinder} are \Rcode{qcutoff} which chooses a quantile-based cutoff (for
217
-example \Rcode{qcutoff = c(0.01, 0.99)}) and \Rcode{maxGap} which makes sure that a DMR is being
218
-split if there are two CpGs with more than \Rcode{maxGap} between them (default of 300bp).
219
-
220
-We rank DMRs by the column \Rcode{areaStat} which is the sum of the t-statistics in each CpG.  This
221
-is kind of the area of the DMR, except that it is weighted by the number of CpGs and not by genomic
222
-length.  This is currently the best statistic we know, although it is far from perfect (we would
223
-like to do something better).
224
-
225
-\section*{Plotting}
226
-
227
-It is \emph{always} a good idea to look at the DMRs.  One way of encoding standard plotting
228
-parameters like \Rcode{col}, \Rcode{lty}, and \Rcode{lwd} is to add columns to the \Rcode{pData},
229
-like 
230
-
231
-<<plotSetup>>=
232
-pData <- pData(BS.cancer.ex.fit)
233
-pData$col <- rep(c("red", "blue"), each = 3)
234
-pData(BS.cancer.ex.fit) <- pData
235
-@ 
236
-
237
-Once this is setup, we can plot a single DMR like
238
-
239
-<<plotDmr,fig.width=8,fig.height=4>>=
240
-plotRegion(BS.cancer.ex.fit, dmrs[1,], extend = 5000, addRegions = dmrs)
241
-@ 
242
-
243
-\Rcode{extend} tells us how many bp to extend to either side of the plotting region.  \Rcode{addRegions} is a \Rclass{data.frame} or \Rcode{GRanges} listing additional regions that should be highlighted.
244
-
245
-Typically, we plot hundreds of DMRs in a single PDF file and use external tools to look at them.  For this purpose, \Rcode{plotManyRegions} is very useful since it is much faster than plotting individual DMRs with \Rcode{plotRegion}.  An example (not run) is
246
-
247
-<<plotManyRegions,eval=FALSE>>=
248
-pdf(file = "dmrs_top200.pdf", width = 10, height = 5)
249
-plotManyRegions(BS.cancer.ex.fit, dmrs[1:200,], extend = 5000, 
250
-                addRegions = dmrs)
251
-dev.off()
252
-@ 
253
-which plots the top200.
254
-
255
-\section*{Question and answers}
256
-
257
-\textbf{1. The BSmooth algorithm is supposed to give smooth methylation estimates.  Yet, when I plot
258
-  the smoothed values, I see jagged lines, which do not look smooth to me.}
259
-
260
-We estimate a genome-wide methylation profile that is a smooth function of the genomic position.
261
-However, this profile is not stored in the \Rclass{BSseq} objects.  Instead, we evaluate this smooth
262
-profile in the methylation loci in the object.  An example (made-up values) is
263
-\begin{verbatim}
264
-pos  meth
265
-1     0.1
266
-3     0.1
267
-5     0.1
268
-200   0.6
269
-203   0.6
270
-205   0.6
271
-\end{verbatim}
272
-
273
-For plotting we do linear interpolation between this points.  The end result is that the methylation
274
-profile may appear jagged especially if there is a ``big'' distance between two CpGs (between pos
275
-\texttt{5} and \texttt{200} above).  If we wanted to plot truly smooth profiles we would have to
276
-store the methylation profile evaluated at a regular grid across the genome.  This takes up a lot of
277
-space and would add complications to the internal data structures.
278
-
279
-
280
-%%
281
-%% Backmatter
282
-%%
283
-
284
-\bibliography{bsseq}
285
-
286
-\section*{SessionInfo}
287
-
288
-<<sessionInfo,results="asis",eval=TRUE,echo=FALSE>>=
289
-toLatex(sessionInfo())
290
-@ 
291
-
292
-\end{document}
293
-
294
-% Local Variables:
295
-% LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR bsseq bp 
296
-% LocalWords: DMRs differentially ABI SOLiD dataset WGBS BSmooth 
297
-% LocalWords: parallelization Bioconductor hypomethylation genomic pos PDF thresholding
298
-% LocalWords: mappable unmappable indices normals quantile
299
-% End:
300
-
Browse code

updates from master

From: Kasper Daniel Hansen <khansen@Tanngrisnir.local>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@116910 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 29/04/2016 04:18:31
Showing 1 changed files
... ...
@@ -201,7 +201,7 @@ Even in comparisons where we do not observe these large-scale methylation differ
201 201
 
202 202
 Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by
203 203
 thresholding the t-statistics.  Here we use a cutoff of $4.6$, which was chosen by looking at the
204
-quantiles of the t-statistics (for the entire genome).
204
+quantiles of the t-statistics (for the entire genome)\footnote{See \url{https://support.bioconductor.org/p/78227/} for further discussion on the choice of cutoff. }.
205 205
 <<dmrs>>=
206 206
 dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))
207 207
 dmrs <- subset(dmrs0, n >= 3 & abs(meanDiff) >= 0.1)
Browse code

fixing merge

From: Kasper Daniel Hansen <kasperdanielhansen@gmail.com>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@114621 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 10/03/2016 17:06:33
Showing 1 changed files
... ...
@@ -7,7 +7,6 @@
7 7
 <<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
8 8
 BiocStyle::latex()
9 9
 @
10
-\newcommand{\bold}[1]{\textbf{#1}} % Fix for R bibentry
11 10
 \title{Analyzing WGBS with the bsseq package}
12 11
 \author{Kasper Daniel Hansen\\ \texttt{kasperdanielhansen@gmail.com}}
13 12
 \date{Modified: June 10, 2015.  Compiled: \today}
... ...
@@ -38,10 +37,7 @@ BS.cancer.ex
38 37
 pData(BS.cancer.ex)
39 38
 @ 
40 39
 
41
-If you use this package, please cite our BSmooth paper
42
-<<citation,echo=FALSE,results="asis">>=
43
-print(citation("bsseq"), style = "latex")
44
-@
40
+If you use this package, please cite our BSmooth paper \cite{Hansen:2012}.
45 41
 
46 42
 \section*{Smoothing}
47 43
 
Browse code

fixed conflict

From: Kasper Daniel Hansen <kasperdanielhansen@gmail.com>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@110074 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 29/10/2015 19:12:43
Showing 1 changed files
... ...
@@ -2,8 +2,9 @@
2 2
 %\VignetteDepends{bsseq}
3 3
 %\VignetteDepends{bsseqData}
4 4
 %\VignettePackage{bsseq}
5
+%\VignetteEngine{knitr::knitr}
5 6
 \documentclass[12pt]{article}
6
-<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
7
+<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
7 8
 BiocStyle::latex()
8 9
 @
9 10
 \newcommand{\bold}[1]{\textbf{#1}} % Fix for R bibentry
... ...
@@ -16,29 +17,18 @@ BiocStyle::latex()
16 17
 
17 18
 \section*{Introduction}
18 19
 
19
-This document discusses the ins and outs of an analysis of a whole-genome shotgun bisulfite
20
-sequencing (WGBS) dataset, using the BSmooth algorithm, which was first used in \cite{Hansen:2011}
21
-and more formally presented and evaluated in \cite{Hansen:2012}.  The intention with the document is
22
-to focus on analysis-related tasks and questions.  Basic usage of the \Rpackage{bsseq} package is
23
-covered in ``The bsseq user's guide''.  It may be useful to consult the user's guide while reading
24
-this analysis guide.
20
+This document discusses the ins and outs of an analysis of a whole-genome shotgun bisulfite sequencing (WGBS) dataset, using the BSmooth algorithm, which was first used in \cite{Hansen:2011} and more formally presented and evaluated in \cite{Hansen:2012}.  The intention with the document is to focus on analysis-related tasks and questions.  Basic usage of the \Rpackage{bsseq} package is covered in ``The bsseq user's guide''.  It may be useful to consult the user's guide while reading this analysis guide.
25 21
 
26
-In this vignette we analyze chromosome 21 and 22 from \cite{Hansen:2011}.  This is primary data from
27
-3 patients with colon cancer.  For each patient we have normal colon tissue as well as cancer colon.
28
-The samples were run on ABI SOLiD and we generated 50bp single-end reads.  The reads were aligned
29
-using the Merman aligner in the BSmooth suite (\url{http://www.rafalab.org/bsmooth}).  See the
30
-primary publication for more details \cite{Hansen:2011}.
22
+In this vignette we analyze chromosome 21 and 22 from \cite{Hansen:2011}.  This is primary data from 3 patients with colon cancer.  For each patient we have normal colon tissue as well as cancer colon.  The samples were run on ABI SOLiD and we generated 50bp single-end reads.  The reads were aligned using the Merman aligner in the BSmooth suite (\url{http://www.rafalab.org/bsmooth}).  See the primary publication for more details \cite{Hansen:2011}.
31 23
 
32 24
 This data is contained in the \Rpackage{bsseqData}
33 25
 
34
-<<load,results=hide>>=
26
+<<load,warnings=FALSE,messages=FALSE>>=
35 27
 library(bsseq)
36 28
 library(bsseqData)
37 29
 @ 
38 30
 
39
-The \Rpackage{bsseqData} contains a script, \texttt{inst/script/create\_BS.cancer.R}, describing how
40
-this data is created from the Merman alignment output (also contained in the package).  Note that
41
-the current version of the BSmooth pipeline uses a slightly different alignment output format.
31
+The \Rpackage{bsseqData} contains a script, \texttt{inst/script/create\_BS.cancer.R}, describing how this data is created from the Merman alignment output (also contained in the package).  Note that the current version of the BSmooth pipeline uses a slightly different alignment output format.
42 32
 
43 33
 The following object contains the unsmoothed ``raw'' summarized alignment data.
44 34
 <<showData>>=
... ...
@@ -49,7 +39,7 @@ pData(BS.cancer.ex)
49 39
 @ 
50 40
 
51 41
 If you use this package, please cite our BSmooth paper
52
-<<citation,echo=FALSE,results=tex>>=
42
+<<citation,echo=FALSE,results="asis">>=
53 43
 print(citation("bsseq"), style = "latex")
54 44
 @
55 45
 
... ...
@@ -72,18 +62,9 @@ BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
72 62
 BS.cancer.ex.fit
73 63
 @ 
74 64
 
75
-This step uses parallelization where each sample is run on a separate core using \Rcode{mclapply}
76
-from the \Rpackage{parallel} package.  This form of parallelization is built into \Rpackage{bsseq},
77
-and (as written) requires access to a machine with 6 cores and enough RAM.  The smoothing step is
78
-being done completely independently on each sample, so if you have a lot of samples (or other
79
-circumstances), an alternative is to split the computations manually.  A later subsection shows some
80
-example code for doing that.
65
+This step uses parallelization where each sample is run on a separate core using \Rcode{mclapply} from the \Rpackage{parallel} package.  This form of parallelization is built into \Rpackage{bsseq}, and (as written) requires access to a machine with 6 cores and enough RAM.  The smoothing step is being done completely independently on each sample, so if you have a lot of samples (or other circumstances), an alternative is to split the computations manually.  A later subsection shows some example code for doing that.
81 66
 
82
-Let us discuss coverage and representation.  The \Rcode{BS.cancer.ex} object contains all annotated
83
-CpGs on human chromosome 21 and 22, whether or not there is actual data.  Since we have multiple
84
-samples, we can roughly divide the genome into 3 categories: CpGs where all samples have data, CpGs
85
-where none of the samples have data and CpGs where some, but not all, of the samples have data.
86
-Examining the object at hand, we get
67
+Let us discuss coverage and representation.  The \Rcode{BS.cancer.ex} object contains all annotated CpGs on human chromosome 21 and 22, whether or not there is actual data.  Since we have multiple samples, we can roughly divide the genome into 3 categories: CpGs where all samples have data, CpGs where none of the samples have data and CpGs where some, but not all, of the samples have data.  Examining the object at hand, we get
87 68
 
88 69
 <<cpgNumbers>>=
89 70
 ## The average coverage of CpGs on the two chromosomes
... ...
@@ -174,17 +155,14 @@ load("BS1.fit.rda")
174 155
 load("BS2.fit.rda")
175 156
 BS.fit <- combine(BS1.fit, BS2.fit)
176 157
 @ 
177
-This still requires that you have one node with enough RAM to hold all samples in memory. 
158
+
159
+This still requires that you have one node with enough RAM to hold all samples in memory.
178 160
 
179 161
 
180 162
 \section*{Computing t-statistics}
181 163
 
182
-Before computing t-statistics, we will remove CpGs with little or no coverage.  If this is not done,
183
-you may find many DMRs in areas of the genome with very little coverage, which are most likely false
184
-positives.  It is open to personal preferences exactly which CpGs to remove, but for this analysis
185
-we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least
186
-2x in coverage.  For readability, we store the coverage in a separate matrix (this is just due to
187
-line breaks in Sweave)
164
+Before computing t-statistics, we will remove CpGs with little or no coverage.  If this is not done, you may find many DMRs in areas of the genome with very little coverage, which are most likely false positives.  It is open to personal preferences exactly which CpGs to remove, but for this analysis we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.  For readability, we store the coverage in a separate matrix (this is just due to line breaks in Sweave)
165
+
188 166
 <<keepLoci>>=
189 167
 BS.cov <- getCoverage(BS.cancer.ex.fit)
190 168
 keepLoci.ex <- which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
... ...
@@ -192,8 +170,8 @@ keepLoci.ex <- which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2
192 170
 length(keepLoci.ex)
193 171
 BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex,]
194 172
 @ 
195
-(the \Rcode{keepLoci.ex}  is also available for direct inspection in the \Rpackage{bsseqData}
196
-package.)
173
+
174
+(the \Rcode{keepLoci.ex} is also available for direct inspection in the \Rpackage{bsseqData} package.)
197 175
 
198 176
 
199 177
 We are now ready to compute t-statistics, by
... ...
@@ -206,36 +184,22 @@ BS.cancer.ex.tstat <- BSmooth.tstat(BS.cancer.ex.fit,
206 184
                                     verbose = TRUE)
207 185
 BS.cancer.ex.tstat
208 186
 @ 
209
-(the \Rcode{BS.cancer.ex.tstat}  is also available for direct inspection in the \Rpackage{bsseqData}
210
-package.)
211 187
 
188
+(the \Rcode{BS.cancer.ex.tstat} is also available for direct inspection in the \Rpackage{bsseqData} package.)
212 189
 
213 190
 
214
-The arguments to \Rcode{BSmooth.tstat} are simple.  \Rcode{group1} and \Rcode{group2} contain the
215
-sample names of the two groups being compared (it is always group1 - group2), and indices may be
216
-used instead of sample names.  \Rcode{estimate.var} describes which samples are being used to
217
-estimate the variability.  Because this is a cancer dataset, and cancer have higher variability than
218
-normals, we only use the normal samples to estimate the variability.  Other choices of
219
-\Rcode{estimate.var} are \Rcode{same} (assume same variability in each group) and \Rcode{paired} (do
220
-a paired t-test).  The argument \Rcode{local.correct} describes whether we should use a large-scale
221
-(low-frequency) mean correction.  This is especially important in cancer where we have found many
222
-large-scale methylation differences between cancer and normals.
191
+
192
+The arguments to \Rcode{BSmooth.tstat} are simple.  \Rcode{group1} and \Rcode{group2} contain the sample names of the two groups being compared (it is always group1 - group2), and indices may be used instead of sample names.  \Rcode{estimate.var} describes which samples are being used to estimate the variability.  Because this is a cancer dataset, and cancer have higher variability than normals, we only use the normal samples to estimate the variability.  Other choices of \Rcode{estimate.var} are \Rcode{same} (assume same variability in each group) and \Rcode{paired} (do a paired t-test).  The argument \Rcode{local.correct} describes whether we should use a large-scale (low-frequency) mean correction.  This is especially important in cancer where we have found many large-scale methylation differences between cancer and normals.
223 193
 
224 194
 We can look at the marginal distribution of the t-statistic by
225 195
 
226
-\setkeys{Gin}{width=0.5\textwidth}
227
-<<plotTstat,fig=TRUE,width=5,height=5>>=
196
+<<plotTstat,fig.width=4,fig.height=4>>=
228 197
 plot(BS.cancer.ex.tstat)
229 198
 @ 
230
-\setkeys{Gin}{width=0.8\textwidth}
231
-
232 199
 
233
-The ``blocks'' of hypomethylation are clearly visible in the marginal distribution of the
234
-uncorrected t-statistics.
200
+The ``blocks'' of hypomethylation are clearly visible in the marginal distribution of the uncorrected t-statistics.
235 201
 
236
-Even in comparisons where we do not observe these large-scale methylation differences, it often
237
-improves the marginal distribution of the t-statistics to locally correct them (``improves'' in the
238
-sense of making them more symmetric).
202
+Even in comparisons where we do not observe these large-scale methylation differences, it often improves the marginal distribution of the t-statistics to locally correct them (``improves'' in the sense of making them more symmetric).
239 203
 
240 204
 \section*{Finding DMRs}
241 205
 
... ...
@@ -276,20 +240,14 @@ pData(BS.cancer.ex.fit) <- pData
276 240
 
277 241
 Once this is setup, we can plot a single DMR like
278 242
 
279
-\setkeys{Gin}{width=1\textwidth}
280
-<<plotDmr,fig=TRUE,width=10,height=5>>=
243
+<<plotDmr,fig.width=8,fig.height=4>>=
281 244
 plotRegion(BS.cancer.ex.fit, dmrs[1,], extend = 5000, addRegions = dmrs)
282 245
 @ 
283
-\setkeys{Gin}{width=0.8\textwidth}
284
-\vspace*{-6\baselineskip}
285 246
 
286
-\Rcode{extend} tells us how many bp to extend to either side of the plotting region.
287
-\Rcode{addRegions} is a \Rclass{data.frame} or \Rcode{GRanges} listing additional regions that
288
-should be highlighted.
247
+\Rcode{extend} tells us how many bp to extend to either side of the plotting region.  \Rcode{addRegions} is a \Rclass{data.frame} or \Rcode{GRanges} listing additional regions that should be highlighted.
248
+
249
+Typically, we plot hundreds of DMRs in a single PDF file and use external tools to look at them.  For this purpose, \Rcode{plotManyRegions} is very useful since it is much faster than plotting individual DMRs with \Rcode{plotRegion}.  An example (not run) is
289 250
 
290
-Typically, we plot hundreds of DMRs in a single PDF file and use external tools to look at them.
291
-For this purpose, \Rcode{plotManyRegions} is very useful since it is much faster than plotting
292
-individual DMRs with \Rcode{plotRegion}.  An example (not run) is
293 251
 <<plotManyRegions,eval=FALSE>>=
294 252
 pdf(file = "dmrs_top200.pdf", width = 10, height = 5)
295 253
 plotManyRegions(BS.cancer.ex.fit, dmrs[1:200,], extend = 5000, 
... ...
@@ -331,14 +289,13 @@ space and would add complications to the internal data structures.
331 289
 
332 290
 \section*{SessionInfo}
333 291
 
334
-<<sessionInfo,results=tex,eval=TRUE,echo=FALSE>>=
292
+<<sessionInfo,results="asis",eval=TRUE,echo=FALSE>>=
335 293
 toLatex(sessionInfo())
336 294
 @ 
337 295
 
338 296
 \end{document}
339 297
 
340 298
 % Local Variables:
341
-% eval: (add-hook 'LaTeX-mode-hook '(lambda () (if (string= (buffer-name) "bsseq_analysis.Rnw") (setq fill-column 100))))
342 299
 % LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR bsseq bp 
343 300
 % LocalWords: DMRs differentially ABI SOLiD dataset WGBS BSmooth 
344 301
 % LocalWords: parallelization Bioconductor hypomethylation genomic pos PDF thresholding
Browse code

Commit made by the Bioconductor Git-SVN bridge.

Commit id: 3e3c4a483adc9ba1e5d101a35f9a0fef41d2b78b

updated vignettes to use BiocStyle


Commit id: fe749c1df8a2a39ca6c4e1cc3b2358195f307631

updating read.bismark


Commit id: c50eb68c8e7f1be51af4ff411c95064849b52d15

Fixing Description; updating bsseq.Rnw to use BiocStyle


Commit id: f38a77394f0248aae1a6a4e27102610a53ca892e

resaved with compress=xz


Commit id: 2dcedd4fead08fe0af2a0afb7de2c89f4818b962

added Raw parsing of cytosine report files from Bismark



git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@104827 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 11/06/2015 14:04:52
Showing 1 changed files
... ...
@@ -3,30 +3,13 @@
3 3
 %\VignetteDepends{bsseqData}
4 4
 %\VignettePackage{bsseq}
5 5
 \documentclass[12pt]{article}
6
-<<echo=FALSE,results=hide>>=
7
-options(width=70)
8
-@ 
9
-\SweaveOpts{eps=FALSE,echo=TRUE,eval=TRUE}
10
-\usepackage{times}
11
-\usepackage{color,hyperref}
12
-\usepackage{fullpage}
13
-\usepackage[numbers]{natbib}
14
-\usepackage{parskip}
15
-\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
16
-\hypersetup{colorlinks,breaklinks,
17
-            linkcolor=darkblue,urlcolor=darkblue,
18
-            anchorcolor=darkblue,citecolor=darkblue}
19
-
20
-\newcommand{\Rcode}[1]{{\texttt{#1}}}
21
-\newcommand{\Rclass}[1]{{\texttt{"#1"}}}
22
-\newcommand{\Rpackage}[1]{{\texttt{#1}}}
23
-\newcommand{\software}[1]{\textsf{#1}}
24
-\newcommand{\R}{\software{R}}
6
+<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
7
+BiocStyle::latex()
8
+@
25 9
 \newcommand{\bold}[1]{\textbf{#1}} % Fix for R bibentry
26
-
27 10
 \title{Analyzing WGBS with the bsseq package}
28
-\author{Kasper Daniel Hansen\\ \texttt{khansen@jhsph.edu}}
29
-\date{Modified: October 13, 2012.  Compiled: \today}
11
+\author{Kasper Daniel Hansen\\ \texttt{kasperdanielhansen@gmail.com}}
12
+\date{Modified: June 10, 2015.  Compiled: \today}
30 13
 \begin{document}
31 14
 
32 15
 \maketitle
... ...
@@ -344,7 +327,6 @@ space and would add complications to the internal data structures.
344 327
 %% Backmatter
345 328
 %%
346 329
 
347
-\bibliographystyle{unsrturl}
348 330
 \bibliography{bsseq}
349 331
 
350 332
 \section*{SessionInfo}
Browse code

various bugfixes

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@78387 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 12/07/2013 20:53:31
Showing 1 changed files
... ...
@@ -204,8 +204,8 @@ we will only keep CpGs where at least 2 cancer samples and at least 2 normal sam
204 204
 line breaks in Sweave)
205 205
 <<keepLoci>>=
206 206
 BS.cov <- getCoverage(BS.cancer.ex.fit)
207
-keepLoci.ex <- which(rowSums(BS.cov[, c("C1", "C2", "C3")] >= 2) >= 2 &
208
-                  rowSums(BS.cov[, c("N1", "N2", "N3")] >= 2) >= 2)
207
+keepLoci.ex <- which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
208
+                     rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 2)
209 209
 length(keepLoci.ex)
210 210
 BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex,]
211 211
 @ 
Browse code

moved to a SummarizedExperiment backend for class 'BSseq'

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@78143 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 06/07/2013 21:26:40
Showing 1 changed files
... ...
@@ -59,6 +59,8 @@ the current version of the BSmooth pipeline uses a slightly different alignment
59 59
 
60 60
 The following object contains the unsmoothed ``raw'' summarized alignment data.
61 61
 <<showData>>=
62
+data(BS.cancer.ex)
63
+BS.cancer.ex <- updateObject(BS.cancer.ex)
62 64
 BS.cancer.ex
63 65
 pData(BS.cancer.ex)
64 66
 @ 
... ...
@@ -83,6 +85,7 @@ support on MS Windows due to a limitation of the operating system.
83 85
 For ease of use, the \Rpackage{bsseqData} includes the result of this command:
84 86
 <<showDataFit>>=
85 87
 data(BS.cancer.ex.fit)
88
+BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit)
86 89
 BS.cancer.ex.fit
87 90
 @ 
88 91
 
Browse code

fix a latex error; bumping version

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@70595 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 16/10/2012 19:35:29
Showing 1 changed files
... ...
@@ -22,6 +22,7 @@ options(width=70)
22 22
 \newcommand{\Rpackage}[1]{{\texttt{#1}}}
23 23
 \newcommand{\software}[1]{\textsf{#1}}
24 24
 \newcommand{\R}{\software{R}}
25
+\newcommand{\bold}[1]{\textbf{#1}} % Fix for R bibentry
25 26
 
26 27
 \title{Analyzing WGBS with the bsseq package}
27 28
 \author{Kasper Daniel Hansen\\ \texttt{khansen@jhsph.edu}}
Browse code

fixed two bugs reported by Andreas Schoenegger, updated citation information

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@70488 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 13/10/2012 16:48:16
Showing 1 changed files
... ...
@@ -25,7 +25,7 @@ options(width=70)
25 25
 
26 26
 \title{Analyzing WGBS with the bsseq package}
27 27
 \author{Kasper Daniel Hansen\\ \texttt{khansen@jhsph.edu}}
28
-\date{Modified: June 26, 2011.  Compiled: \today}
28
+\date{Modified: October 13, 2012.  Compiled: \today}
29 29
 \begin{document}
30 30
 
31 31
 \maketitle
... ...
@@ -62,6 +62,11 @@ BS.cancer.ex
62 62
 pData(BS.cancer.ex)
63 63
 @ 
64 64
 
65
+If you use this package, please cite our BSmooth paper
66
+<<citation,echo=FALSE,results=tex>>=
67
+print(citation("bsseq"), style = "latex")
68
+@
69
+
65 70
 \section*{Smoothing}
66 71
 
67 72
 The first step of the analysis is to smooth the data
Browse code

fixes to vignette and combine

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@68240 bc3139a8-67e5-0310-9ffc-ced21a209358

khansen authored on 06/08/2012 03:03:11
Showing 1 changed files
... ...
@@ -24,7 +24,7 @@ options(width=70)
24 24
 \newcommand{\R}{\software{R}}
25 25
 
26 26
 \title{Analyzing WGBS with the bsseq package}
27
-\author{Kasper Daniel Hansen}
27
+\author{Kasper Daniel Hansen\\ \texttt{khansen@jhsph.edu}}
28 28
 \date{Modified: June 26, 2011.  Compiled: \today}
29 29
 \begin{document}
30 30
 
Browse code

Adds bsseq and DBCiP to the repos.

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/bsseq@67929 bc3139a8-67e5-0310-9ffc-ced21a209358

m.carlson authored on 20/07/2012 17:45:09
Showing 1 changed files
1 1
new file mode 100644
... ...
@@ -0,0 +1,356 @@
1
+%\VignetteIndexEntry{Analyzing WGBS with bsseq}
2
+%\VignetteDepends{bsseq}
3
+%\VignetteDepends{bsseqData}
4
+%\VignettePackage{bsseq}
5
+\documentclass[12pt]{article}
6
+<<echo=FALSE,results=hide>>=
7
+options(width=70)
8
+@ 
9
+\SweaveOpts{eps=FALSE,echo=TRUE,eval=TRUE}
10
+\usepackage{times}
11
+\usepackage{color,hyperref}
12
+\usepackage{fullpage}
13
+\usepackage[numbers]{natbib}
14
+\usepackage{parskip}
15
+\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
16
+\hypersetup{colorlinks,breaklinks,
17
+            linkcolor=darkblue,urlcolor=darkblue,
18
+            anchorcolor=darkblue,citecolor=darkblue}
19
+
20
+\newcommand{\Rcode}[1]{{\texttt{#1}}}
21
+\newcommand{\Rclass}[1]{{\texttt{"#1"}}}
22
+\newcommand{\Rpackage}[1]{{\texttt{#1}}}
23
+\newcommand{\software}[1]{\textsf{#1}}
24
+\newcommand{\R}{\software{R}}
25
+
26
+\title{Analyzing WGBS with the bsseq package}
27
+\author{Kasper Daniel Hansen}
28
+\date{Modified: June 26, 2011.  Compiled: \today}
29
+\begin{document}
30
+
31
+\maketitle
32
+
33
+\section*{Introduction}
34
+
35
+This document discusses the ins and outs of an analysis of a whole-genome shotgun bisulfite
36
+sequencing (WGBS) dataset, using the BSmooth algorithm, which was first used in \cite{Hansen:2011}
37
+and more formally presented and evaluated in \cite{Hansen:2012}.  The intention with the document is
38
+to focus on analysis-related tasks and questions.  Basic usage of the \Rpackage{bsseq} package is
39
+covered in ``The bsseq user's guide''.  It may be useful to consult the user's guide while reading
40
+this analysis guide.
41
+
42
+In this vignette we analyze chromosome 21 and 22 from \cite{Hansen:2011}.  This is primary data from
43
+3 patients with colon cancer.  For each patient we have normal colon tissue as well as cancer colon.
44
+The samples were run on ABI SOLiD and we generated 50bp single-end reads.  The reads were aligned
45
+using the Merman aligner in the BSmooth suite (\url{http://www.rafalab.org/bsmooth}).  See the
46
+primary publication for more details \cite{Hansen:2011}.
47
+
48
+This data is contained in the \Rpackage{bsseqData}
49
+
50
+<<load,results=hide>>=
51
+library(bsseq)
52
+library(bsseqData)
53
+@ 
54
+
55
+The \Rpackage{bsseqData} contains a script, \texttt{inst/script/create\_BS.cancer.R}, describing how
56
+this data is created from the Merman alignment output (also contained in the package).  Note that
57
+the current version of the BSmooth pipeline uses a slightly different alignment output format.
58
+
59
+The following object contains the unsmoothed ``raw'' summarized alignment data.
60
+<<showData>>=
61
+BS.cancer.ex
62
+pData(BS.cancer.ex)
63
+@ 
64
+
65
+\section*{Smoothing}
66
+
67
+The first step of the analysis is to smooth the data
68
+
69
+<<smooth,eval=FALSE>>=
70
+BS.cancer.ex.fit <- BSmooth(BS.cancer.ex, mc.cores = 1, verbose = TRUE)
71
+@ 
72
+This particular piece of code is not being run when the vignette is being created.  It takes roughly
73
+2 minutes per sample.  If you have 6 cores available, use \Rcode{mc.cores = 6} and the total run
74
+time will be roughly 2 minutes.  Note that setting \Rcode{mc.cores} to a value greater than 1 is not
75
+support on MS Windows due to a limitation of the operating system.
76
+
77
+For ease of use, the \Rpackage{bsseqData} includes the result of this command:
78
+<<showDataFit>>=
79
+data(BS.cancer.ex.fit)
80
+BS.cancer.ex.fit
81
+@ 
82
+
83
+This step uses parallelization where each sample is run on a separate core using \Rcode{mclapply}
84
+from the \Rpackage{parallel} package.  This form of parallelization is built into \Rpackage{bsseq},
85
+and (as written) requires access to a machine with 6 cores and enough RAM.  The smoothing step is
86
+being done completely independently on each sample, so if you have a lot of samples (or other
87
+circumstances), an alternative is to split the computations manually.  A later subsection shows some
88
+example code for doing that.
89
+
90
+Let us discuss coverage and representation.  The \Rcode{BS.cancer.ex} object contains all annotated
91
+CpGs on human chromosome 21 and 22, whether or not there is actual data.  Since we have multiple
92
+samples, we can roughly divide the genome into 3 categories: CpGs where all samples have data, CpGs
93
+where none of the samples have data and CpGs where some, but not all, of the samples have data.
94
+Examining the object at hand, we get
95
+
96
+<<cpgNumbers>>=
97
+## The average coverage of CpGs on the two chromosomes
98
+round(colMeans(getCoverage(BS.cancer.ex)), 1)
99
+## Number of CpGs in two chromosomes
100
+length(BS.cancer.ex)
101
+## Number of CpGs which are covered by at least 1 read in all 6 samples
102
+sum(rowSums(getCoverage(BS.cancer.ex) >= 1) == 6)
103
+## Number of CpGs with 0 coverage in all samples
104
+sum(rowSums(getCoverage(BS.cancer.ex)) == 0)
105
+@ 
106
+
107
+The CpG coverage is roughly 4x, so we would expect many zero coverage CpGs by
108
+chance. although that should not necessarily occur in all 6 samples at the same CpG.  If we assume
109
+that coverage genome-wide is Poisson distributed with a parameter (lambda) of 4, we would expect
110
+<<poisson>>=
111
+logp <- ppois(0, lambda = 4, lower.tail = FALSE, log.p = TRUE)
112
+round(1 - exp(6 * logp), 3)
113
+@ 
114
+of the CpGs to have at least one sample with zero coverage.
115
+
116
+There are roughly 130k CpGs with no data at all in any of the 6 samples.  This can happen either
117
+because of chance (although that is unlikely) or because the CpG is unmappable.  Since we are
118
+dealing with bisulfite converted reads, the unmappable portion of the genome is greater than with
119
+normal DNA-sequencing.  For this experiment we only used 50bp single-end reads (in our experience
120
+using 100bp paired-end reads greatly increases the mappable percentage of the genome).  These CpGs
121
+(with zero coverage in all samples) are in some sense easy to deal with: one should of course be
122
+careful drawing conclusions about CpGs with no data.
123
+
124
+We have roughly $959 - 573 - 136 = 250$k CpGs where some (but not all) of the samples have zero
125
+coverage, and these are in some sense harder to deal with.  Since we have very low coverage to begin
126
+with, it may happen just by chance that a single sample have zero coverage, and it may be too
127
+restrictive to just exclude these CpGs from an analysis.
128
+
129
+Smoothing is done separately for each sample, only using the data where the coverage (for that
130
+sample) is non-zero.  This estimates a genome-wide methylation profile, which is then
131
+\emph{evaluated} in all CpGs in the \Rclass{BSseq} object.  As a result, after smoothing, every CpG
132
+in the object has an estimated methylation value.  This is very nice for the situation where you
133
+want to compare a single CpG across multiple samples, but one or two of the samples have zero
134
+coverage by chance.  But note that these smoothed methylation profiles makes less sense in the parts
135
+of the genome where there are no covered CpGs nearby.  We fix this by removing these CpGs after
136
+smoothing, see below.
137
+
138
+Other arguments to the \Rcode{BSmooth} function are \Rcode{mc.cores}, \Rcode{mc.preschedule},
139
+\Rcode{parallelBy} which controls the parallelization built into the function as well as \Rcode{ns},
140
+\Rcode{h}, \Rcode{maxGap} which controls the smoothing.  \Rcode{ns} is the minimum number of CpGs
141
+contained in each window, \Rcode{h} is half the minimum window with (the actual window width is
142
+either 2 times \Rcode{h} or wide enough to contain \Rcode{ns} covered CpGs, whichever is greater).
143
+Note that the window width is different at each position in the genome and may also be different for
144
+different samples at the same position, since it depends on how many nearby CpGs with non-zero
145
+coverage.  Per default, a smoothing cluster is a whole chromosome.  By ``cluster'' we mean a set of
146
+CpGs which are processed together.  This means that even if there is a large distance between two
147
+CpGs, we borrow strength between them.  By setting \Rcode{maxGap} this can be prevented since the
148
+argument describes the longest distance between two CpGs before a cluster is broken up into two
149
+clusters.
150
+
151
+\subsubsection*{Manually splitting the smoothing computation}
152
+
153
+
154
+An example, only showing sample 1 and 2 for brevity, is (this example is not being run when the
155
+vignette is being created):
156
+
157
+<<smoothSplit,eval=FALSE>>=
158
+## Split datag
159
+BS1 <- BS.cancer.ex[, 1]
160
+save(BS1, file = "BS1.rda")
161
+BS2 <- BS.cancer.ex[, 2]
162
+save(BS1, file = "BS1.rda")
163
+## done splitting
164
+
165
+## Do the following on each node
166
+
167
+## node 1
168
+load("BS1.rda")
169
+BS1.fit <- BSmooth(BS1)
170
+save(BS1.fit)
171
+save(BS1.fit, file = "BS1.fit.rda")
172
+## done node 1
173
+
174
+## node 2
175
+load("BS2.rda")
176
+BS2.fit <- BSmooth(BS2)
177
+save(BS2.fit, file = "BS2.fit.rda")
178
+## done node 2
179
+
180
+## join; in a new R session
181
+load("BS1.fit.rda")
182
+load("BS2.fit.rda")
183
+BS.fit <- combine(BS1.fit, BS2.fit)
184
+@ 
185
+This still requires that you have one node with enough RAM to hold all samples in memory. 
186
+
187
+
188
+\section*{Computing t-statistics}
189
+
190
+Before computing t-statistics, we will remove CpGs with little or no coverage.  If this is not done,
191
+you may find many DMRs in areas of the genome with very little coverage, which are most likely false
192
+positives.  It is open to personal preferences exactly which CpGs to remove, but for this analysis
193
+we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least
194
+2x in coverage.  For readability, we store the coverage in a separate matrix (this is just due to
195
+line breaks in Sweave)
196
+<<keepLoci>>=
197
+BS.cov <- getCoverage(BS.cancer.ex.fit)
198
+keepLoci.ex <- which(rowSums(BS.cov[, c("C1", "C2", "C3")] >= 2) >= 2 &
199
+                  rowSums(BS.cov[, c("N1", "N2", "N3")] >= 2) >= 2)
200
+length(keepLoci.ex)
201
+BS.cancer.ex.fit <- BS.cancer.ex.fit[keepLoci.ex,]
202
+@ 
203
+(the \Rcode{keepLoci.ex}  is also available for direct inspection in the \Rpackage{bsseqData}
204
+package.)
205
+
206
+
207
+We are now ready to compute t-statistics, by
208
+<<BSmooth.tstat>>=
209
+BS.cancer.ex.tstat <- BSmooth.tstat(BS.cancer.ex.fit, 
210
+                                    group1 = c("C1", "C2", "C3"),
211
+                                    group2 = c("N1", "N2", "N3"), 
212
+                                    estimate.var = "group2",
213
+                                    local.correct = TRUE,
214
+                                    verbose = TRUE)
215
+BS.cancer.ex.tstat
216
+@ 
217
+(the \Rcode{BS.cancer.ex.tstat}  is also available for direct inspection in the \Rpackage{bsseqData}
218
+package.)
219
+
220
+
221
+
222
+The arguments to \Rcode{BSmooth.tstat} are simple.  \Rcode{group1} and \Rcode{group2} contain the
223
+sample names of the two groups being compared (it is always group1 - group2), and indices may be
224
+used instead of sample names.  \Rcode{estimate.var} describes which samples are being used to
225
+estimate the variability.  Because this is a cancer dataset, and cancer have higher variability than
226
+normals, we only use the normal samples to estimate the variability.  Other choices of
227
+\Rcode{estimate.var} are \Rcode{same} (assume same variability in each group) and \Rcode{paired} (do
228
+a paired t-test).  The argument \Rcode{local.correct} describes whether we should use a large-scale
229
+(low-frequency) mean correction.  This is especially important in cancer where we have found many
230
+large-scale methylation differences between cancer and normals.
231
+
232
+We can look at the marginal distribution of the t-statistic by
233
+
234
+\setkeys{Gin}{width=0.5\textwidth}
235
+<<plotTstat,fig=TRUE,width=5,height=5>>=
236
+plot(BS.cancer.ex.tstat)
237
+@ 
238
+\setkeys{Gin}{width=0.8\textwidth}
239
+
240
+
241
+The ``blocks'' of hypomethylation are clearly visible in the marginal distribution of the
242
+uncorrected t-statistics.
243
+
244
+Even in comparisons where we do not observe these large-scale methylation differences, it often
245
+improves the marginal distribution of the t-statistics to locally correct them (``improves'' in the