%\VignetteIndexEntry{The bsseq user's guide}
%\VignetteDepends{bsseq}
%\VignettePackage{bsseq}
%\VignetteEngine{knitr::knitr}
\documentclass[12pt]{article}
<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@
%% \newcommand{\bold}[1]{\textbf{#1}} % Fix for R bibentry
%% \newcommand{\Rhref}[2]{\href{#1}{#2}}

\title{The bsseq user's guide}
\author{Kasper Daniel Hansen \\ \texttt{kasperdanielhansen@gmail.com}}
\date{Modified: June 10, 2015.  Compiled: \today}
\begin{document}

\maketitle

\section*{Capabilities}

The packages required for this vignette are

<<load, warning=FALSE, message=FALSE>>=
library(bsseq)
@ 

\subsubsection*{Introduction}

This \R\ package is the reference implementation of the BSmooth algorithm for analyzing whole-genome
bisulfite sequencing (WGBS) data.  This package does \emph{not} contain alignment software, which is
available from \url{http://rafalab.jhsph.edu/bsmooth}.  This package is not intended for analysis of
single-loci bisulfite sequencing (typically Sanger bisulfite sequencing or pyro bisulfite
sequencing).

The package has been used to analyze capture bisulfite sequencing data.  For this type of data, the
most useful parts of the package is its data-handling functions.  The BSmooth algorithm itself may
not be useful for a standard capture bisulfite sequencing experiment, since it relies heavily on
smoothing, which again requires that we have measured methylation in bigger regions of the
genome.

The BSmooth algorithm is described in detail in \cite{Hansen:2012}.  It was applied to human cancer
data in \cite{Hansen:2011} and we have also used it to analyze data from Rhesus Macaque
\cite{Tung:2012}.  Specifically, the algorithm uses smoothing to get reliable semi-local methylation
estimates from low-coverage bisulfite sequencing.  After smoothing it uses biological replicates to
estimate biological variation and identify differentially methylated regions (DMRs).  The smoothing
portion could be used even on a single sample, but we believe that variation between individuals
is an important aspect of DNA methylation and needs to be accounted for, see also
\cite{Hansen:2011rna} for a relevant discussion.

The main focus for our development has been the analysis of CpG methylation in humans, and we have
successfully used it in other primates \cite{Tung:2012}.  It is highly likely that the approach will
work in non-human organisms, but care must be taken: the smoothing parameters (which we have
selected based on human data) should be evaluated carefully.  Furthermore, it may not be suitable at
all for organisms with vastly different (from humans) methylation structures.

With respect to non-CpG methylation, the situation is mixed.  We have never used the algorithm to
analyze non-CpG methylation, but it should be straightforward to do so.  However, the data
structures used in the current code might not conveniently scale from the 28.2M CpGs in the human
genome to the roughly 2x585M Cs (it may well be possible to do an separate analysis for each
chromosome).  This should be less of an issue for organisms with smaller genomes.  We are
considering changing these underlying data structures to allow for easier analysis of non-CpG
methylation in humans.

\subsubsection*{System Requirements}

The package requires that all the data is loaded into system memory.  By ``data'' we do not mean the
individual reads (which is big for a whole-genome experiment).  Instead, what we need are summarized
data given us the number of reads supporting methylation as well as the total number of reads at
each loci.  Focusing on human data, we need to work with objects with a maximum of 28.2 million
entries, per sample (since there are roughly 28.2 millions CpGs in the human genome).  This provides
us with an upper limit on the data object.  

Based on this, the 8 samples from \cite{Hansen:2011} including the smoothed values, take up roughly
1.2GB of RAM, meaning an analysis can easily be done with 8GB of RAM.  In order to improve speed,
the package allows for easy parallel processing of samples/chromosomes.  This will require multiple
copies of the data object for each core used, increasing the memory usage substantially to perhaps
even 32GB.  This can be avoided by processing the samples sequentially at the cost of speed.

On a 64GB node, the 8 samples from \cite{Hansen:2011} takes roughly one hour to process in parallel
using 8 cores (using much less than 64GB of RAM in total).  This does not including parsing the data
from the alignment output files.

\subsubsection*{Some terminology}

Because not all methylation happens at CpG sites, we have tried to use the term ``methylation loci''
instead of CpG.  We use this term to refer to a single-base methylation site.

Some standard terms from the DNA methylation field: differentially methylated region (DMR), hyper
methylation (methylation that is higher in one condition/sample than in another), hypo methylation
(methylation that is lower in one condition/sample than in another), and finally differentially
methylated position (DMP) referring to a single loci.

\subsubsection*{Citation}

If you use this package, please cite our BSmooth paper \cite{Hansen:2012}.

\section*{Overview}

The package assumes that the following data has been extracted from alignments:
\begin{enumerate}
\item genomic positions, including chromosome and location, for methylation loci.
\item a (matrix) of M (Methylation) values, describing the number of read supporting methylation covering a
  single loci.  Each row in this matrix is a methylation loci and each column is a sample.
\item a (matrix) of Cov (Coverage) values, describing the number of read supporting methylation covering a
  single loci.  Each row in this matrix is a methylation loci and each column is a sample.
\end{enumerate}

We can have a look at some data from \cite{Lister:2009}, specifically chromosome 22 from the IMR90
cell line.

<<BSchr22>>=
data(BS.chr22)
BS.chr22
@ 

The genomic positions are stored as a \Rcode{GRanges} object \Rcode{GRanges} are general genomic
regions; we represent a single base methylation loci as an interval of width 1 (which may seem a bit
strange, but there are good reasons for this).  For example, the first 4 loci in the Lister data are

<<ov-granges>>=
head(granges(BS.chr22), n = 4)
@ 

We also have the M and Cov matrices
<<ov-getCoverage>>=
head(getCoverage(BS.chr22, type = "M"), n = 4)
head(getCoverage(BS.chr22), n = 4)
@ 

Since CpG methylation is symmetric on the two strands of a chromosome, we aggregated reads on the
forward and reverse strand to form a single value, and we assume the genomic position points to the
C of the CpG.  It is not crucial in any way to do this, one may easily analyze each strand
separately, but CpG methylation is symmetric and this halves the number of loci.

How to input your methylation data into this data structure (called a \Rcode{BSseq} object) is
described in a section below.  We also have a section on how to operate on these types of objects.

An analysis typically consists of the following steps.
\begin{enumerate}
\item Smoothing, using the function \Rcode{BSmooth}.
\item Compute t-statistics, using the function \Rcode{BSmooth.tstat}.  This converts the
  \Rcode{BSseq} object into a \Rcode{BSseqTstat} object.
\item Threshold these t-statistics to identify DMRs, using the function \Rcode{dmrFinder}, returning
  a simple \Rcode{data.frame}.
\end{enumerate}
It is usually a good idea to look at the smoothed data either before or after identifying DMRs.
This can be done using the functions \Rcode{plotRegion} and \Rcode{plotManyRegions}.

We also have functions for assessing goodness of fit for binomial and poison models; this is like to
be of marginal interest to most users.  See the man page for \Rcode{binomialGoodnessOfFit}.

We also allow for easy computation of Fisher's exact test for each loci, using the function
\Rcode{fisherTests}. 

\section*{Using objects of class BSseq}

\subsubsection*{Basic operations}

Objects of class \Rcode{BSseq} contains a \Rcode{GRanges} object with the genomic locations.  This
\Rcode{GRanges} object can be obtained by \Rcode{granges}.  A number of standard \Rcode{GRanges}
methods works directly on the \Rcode{BSseq} object, such as \Rcode{start}, \Rcode{end},
\Rcode{seqnames} (chromosome names) etc.

These objects also contains a \Rcode{phenoData} object for sample pheno data.  Useful methods are
\Rcode{sampleNames}, \Rcode{pData}.

Finally, we have methods such as \Rcode{dim}, \Rcode{ncol} (number of columns; number of samples),
\Rcode{nrow} (number of rows; number of methylation loci).

Objects can be subsetted using two indicies \Rcode{BSseq[i,j]} with the first index being
methylation loci and the second index being samples.  Another very useful way of subsetting the
object is by using the method \Rcode{subsetByOverlaps}.  This selects all methylation loci inside a
set of genomic intervals (there is a difference between first and second argument and either can be
\Rcode{BSseq} or \Rcode{GRanges}).  

Examples:
<<using-examples>>=
head(start(BS.chr22), n = 4)
head(seqnames(BS.chr22), n = 4)
sampleNames(BS.chr22)
pData(BS.chr22)
dim(BS.chr22)
BS.chr22[1:6,1]
subsetByOverlaps(BS.chr22, 
                 GRanges(seqnames = "chr22", 
                         ranges = IRanges(start = 1, end = 2*10^7)))
@ 

\subsubsection*{Data handling}

We have a number of functions for manipulating one or more \Rcode{BSseq} objects.

\Rcode{BSseq} instantiates an object of class \Rcode{BSseq}.  Genomic locations are passed in,
either as a \Rcode{GRanges} object (argument \Rcode{gr}) or as chromosome and location vectors
(arguments \Rcode{chr} and \Rcode{pos}).  The arguments \Rcode{M} and \Rcode{Cov} accepts matrices,
and it is possible to directly give it a \Rcode{phenoData} object.

<<data-bsseq>>=
M <- matrix(0:8, 3, 3)
Cov <- matrix(1:9, 3, 3)
BStmp <- BSseq(chr = c("chr1", "chrX", "chr1"), pos = 1:3, 
               M = M, Cov = Cov, sampleNames = c("A1","A2", "B"))
@ 

A \Rcode{BSseq} object may be ordered by \Rcode{orderBSseq}.  This ensures that data from a single
chromosome appears in an ordered, contiguous block.  There is also the possibility for specifying
the chromosome order (this is less important).  The smoothing functions assumes that the underlying
\Rcode{BSseq} has been ordered.

<<data-order>>=
granges(BStmp)
BStmp <- orderBSseq(BStmp, seqOrder = c("chr1", "chrX"))
granges(BStmp)
@ 

\Rcode{chrSelectBSseq} performs the often useful task of selecting one or more chromosomes and can
also order the output.  In case \Rcode{order = TRUE}, the output is ordered according to the order
of the \Rcode{seqnames} argument.
<<data-chrSelect>>=
chrSelectBSseq(BStmp, seqnames = "chr1", order = TRUE)
@ 
Of course, in this case the function does little, since \Rcode{BS.chr22} already only contains data
from chromosome 22.

\Rcode{combine} combines two \Rcode{BSseq} objects in the following way: the samples for the return
objects is the union of the samples from the two objects, and the methylation loci are the union of
the two methylation loci.  The two objects do not need to have measured the same loci (in the
example below, \Rcode{BStmp} has data on chromosome 1 and X)..

<<data-combine>>=
BStmp2 <- combine(BStmp, BS.chr22[1:3,])
granges(BStmp2)
getCoverage(BStmp2)
@ 

\Rcode{collapseBSseq} performs the often useful task of adding several columns of a \Rcode{BSseq}
object.  This is often used in the beginning of an analysis where each column may correspond to a
lane and several such columns represents the data for a single biological sample.

<<data-collapse>>=
collapseBSseq(BStmp, columns = c("A", "A", "B"))
@ 

\subsubsection*{Obtaining coverage (methylation)}

Coverage, either Cov or M values, are obtained by \Rcode{getCoverage}, using the \Rcode{type} argument:
<<getCoverage1>>=
head(getCoverage(BS.chr22, type = "Cov"), n = 4)
head(getCoverage(BS.chr22, type = "M"), n = 4)
@ 
This will return a -- possibly very large -- matrix.  It is also possible to get region based
coverage by using the \Rcode{regions} argument to the function.  This argument is either a
\Rcode{data.frame} (with columns \Rcode{chr}, \Rcode{start} and \Rcode{end}) or a \Rcode{GRanges}
object.  Let us do an example
<<regions>>=
regions <- GRanges(seqnames = c("chr22", "chr22"), 
                   ranges = IRanges(start = 1.5 * 10^7 + c(0,200000), 
                                    width = 1000))
getCoverage(BS.chr22, regions = regions, what = "perRegionTotal")
@ 
When \Rcode{what} is \Rcode{perRegionTotal} the return value is the total coverage of each region
(and note that regions without coverage return \Rcode{NA}).  Similarly, \Rcode{perRegionAverage}
yields the average coverage in the region.  However, it is often useful to get the actual values, like
<<getCoverage2>>=
getCoverage(BS.chr22, regions = regions, what = "perBase")
@ 
This is the default behaviour, and it returns a list where each element corresponds to a region.
Note that regions with no coverage gets a \Rcode{NULL}.

Methylation estimates can be obtained in the same way, using the function \Rcode{getMeth}.  If
\Rcode{type} is set to \Rcode{raw} the function returns simple single-loci methylation estimates
(which are $\textrm{M}/\textrm{Cov}$).  To get smoothed estimates, the \Rcode{BSseq} object needs to
have been smoothed using Bsmooth, and \Rcode{type} set to \Rcode{smooth} (default).  The
\Rcode{getCoverage}, \Rcode{getMeth} has a \Rcode{regions} and a \Rcode{what} argument.  For
\Rcode{getMeth} the \Rcode{what} argument can be \Rcode{perBase} or \Rcode{perRegion} (the later is
really the per-region average methylation).  Additionally, confidence intervals can be computed
using a method taking possibly low coverage loci into account as well as loci where the methylation
percentage might be close to 0 or 1 \cite{Agresti:1998}.  Currently, confidence intervals cannot be
computed for region-level summary estimates.  Examples

<<getMeth>>=
getMeth(BS.chr22, regions, type = "raw")
getMeth(BS.chr22, regions[2], type = "raw", what = "perBase")
@ 

\section*{Reading data}

\subsubsection*{Alignment output from the BSmooth alignment suite}

It is straightforward to read output files (summarized evidence) from the BSmooth alignment suite,
using \Rcode{read.bsmooth}.  This function takes as argument a number of directories, usually
corresponding to samples, with the alignment output.  Sometimes, one might want to read only certain
chromosomes, which can be accomplished by the \Rcode{seqnames} argument.  Also, the default values
of \Rcode{qualityCutoff = 20} ensures that we do not use methylation evidence with a base quality
strictly lower than 20 (since we may not be confident whether the read really supports methylation
or not).  The function \Rcode{read.bsmooth} allows for both gzipped and non-gzipped input files. It
is faster to read gzipped files, so we recommend gzipping post-alignment.

During the development of BSmooth we experimented with a number of different output formats.  These
legacy formats can be read with \Rcode{read.umtab} and \Rcode{read.umtab2}.

\subsubsection*{Alignment output from other aligners}

We are interested in adding additional parsers to the package; if your favorite alignment software
is not supported, feel free to get in touch.

In general, we need summarized methylation evidence.  In short, for each methylation loci, we need
to know how many reads cover the loci and how many of those reads support methylation.

As an example, consider the Lister data.  The files posted online looks like
\begin{verbatim}
> head mc_imr90_r1_22
assembly        position        strand  class   mc      h
22      14430632        +       CG      9       10
22      14430633        -       CG      8       8
22      14430677        +       CG      1       1
22      14430678        -       CG      3       10
22      14430688        -       CG      6       10
22      14430703        -       CG      2       8
22      14431244        +       CG      5       10
22      14431245        -       CG      5       11
\end{verbatim}
For these files, the evidence is split by strand.  It is often easiest to read in one sample at a
time, instantiate a \Rcode{BSseq} object and the use \Rcode{combine} and \Rcode{collapseBSseq} to
combine the samples, since these functions deal objects that have different sets of CpGs.  In the
Lister data, note that the position is the position of the ``C'', so basically, if you want to
combine the evidence from both strands, CpGs on the ``-'' strand needs to have 1 subtracted from
their position.  A full script showing how these files are used to create \Rcode{BS.chr22} can be
found in \texttt{inst/scripts/get\_BS.chr22.R} in the \Rpackage{bsseq} package.  The path of this
file may be found by
<<locateScript,results="hide">>=
system.file("scripts", "get_BS.chr22.R", package = "bsseq")
@ 

\section*{Analysis}

Computing smoothed methylation levels is simple.  We subset \Rcode{BS.chr22} so we only smooth 1
sample, for run-time purpose

<<BSmooth>>=
BS.chr22.1 <- BSmooth(BS.chr22[,"r1"], verbose = TRUE)
BS.chr22.1
@ 

A number of arguments pertains to using multiple cores.  This is useful if you have multiple samples
or chromosomes.  \Rcode{mc.cores} tells \Rcode{BSmooth} how many cores to use and \Rcode{parallelBy}
describes whether the parallelization is over samples or chromosomes.  If the parallelization is
over samples, each core will compute the smoothing of the entire genome.  And if it is over
samples, each cores will smooth all samples for one or more chromosomes.  The appropriate choice
depends on the number of samples and cpu cores available.  If you have the exact same number of
samples as cores, the fastest is \Rcode{parallelBy = "sample"}.  If you have less samples than
cores, the fastest is \Rcode{parallelBy = "chromosome"}.  The argument \Rcode{mc.preshedule} should
not need to be changed (unless perhaps if a small value of \Rcode{maxGap} is uses); see the man page
for \Rcode{parallel::mclapply}.  Note that setting \Rcode{mc.cores} to a value greater than 1 is not
support on MS Windows due to a limitation of the operating system.

For a more detailed discussion of the analysis tools, read the companion vignette ``Analyzing WGBS
with the bsseq package'', which also finding DMRs and plotting them.

Fisher's exact tests may be efficiently computed using the function \Rcode{fisherTests}.

Binomial and poisson goodness of fit tests statistics may be computed using \Rcode{binomialGoodnessOfFit} and \Rcode{poissonGoodnessOfFit}.

\bibliography{bsseq}


\section*{SessionInfo}

<<sessionInfo,results="asis",eval=TRUE,echo=FALSE>>=
toLatex(sessionInfo())
@ 

\end{document}

% Local Variables:
% eval: (add-hook 'LaTeX-mode-hook '(lambda () (if (string= (buffer-name) "bsseq.Rnw") (setq fill-column 100))))
% LocalWords: LocalWords bisulfite methylation methylated CpG CpGs DMR
% LocalWords: DMRs differentially
% End: