inst/scripts/copynumber.Rnw
16fadf14
   Copy number routines in the \crlmm{} package are available for
   Affymetrix 5.0 and 6.0 platforms, as well as several Illumina
   platforms.  This vignette assumes that the arrays have already been
   successfully preprocessed and genotyped as per the instructions in
a3b625d4
   the \verb+AffyGW+ and \verb+IlluminaPreprocessCN+
16fadf14
   vignettes for the Affymetrix and Illumina platforms,
   respectively. While this vignette uses Affymetrix 6.0 arrays for
   illustration, the steps at this point are identical for both
669c6a90
   platforms.  See \citep{Scharpf2011} for details regarding the
16fadf14
   methodology implemented in \crlmm{} for copy number analysis.  In
   addition, a compendium describing copy number analysis using the
   \crlmm{} package is available from the author's website:
0621b63a
   \url{http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html}.
e63564bc
 
7ae22f52
 
16fadf14
 \textbf{Limitations:} While a minimum number of samples is not
 required for preprocessing and genotyping, copy number estimation in
 the \crlmm{} package currently requires at least 10 samples per batch.
 The parameter estimates for copy number and the corresponding
 estimates of raw copy number will tend to be more noisy for batches
 with small sample sizes (e.g., $<$ 50).  Chemistry plate or scan date
669c6a90
 are often useful surrogates for batch.  Samples that were processed at
 similar times (e.g., in the same month) can be grouped together in the
 same batch.
518a2908
 
16fadf14
 \section{Quality control}
518a2908
 
16fadf14
 The signal to noise ratio (SNR) estimated by the CRLMM genotyping
 algorithm is an overall measure of the separation of the diallelic
 genotype clusters at polymorphic loci and can be a useful measure of
 array quality.  Small SNR values can indicate possible problems with
 the DNA.  Depending on the size of the dataset and the number of
 samples with low SNR, users may wish to rerun the preprocessing and
 genotyping steps after excluding samples with low SNR.  The SNR is
 stored in the \verb+phenoData+ slot of the \Robject{CNSet} object and
 is available after preprocessing and genotyping. SNR values below 5
 for Affymetrix or below 25 for Illumina may indicate poor sample
 quality.  The following code chunk makes a histogram of the SNR values
 for the HapMap samples.
518a2908
 
16fadf14
 <<snr,fig=TRUE,include=FALSE,width=6, height=4>>=
7c0c9ac5
 library(lattice)
669c6a90
 invisible(open(cnSet$SNR))
16fadf14
 snr <- cnSet$SNR[]
 close(cnSet$SNR)
7c0c9ac5
 print(histogram(~snr,
 		panel=function(...){
 			panel.histogram(...)},
 		breaks=25, xlim=range(snr), xlab="SNR"))
e63564bc
 @
7c0c9ac5
 
2ae7850e
 
0198a9ad
 
16fadf14
 \section{Copy number estimation}
 
808cfecc
 There are two ways to obtain marker-level estimates of copy number
 that are supported by \Rpackage{crlmm}. One approach is to fit a
 linear model to the normalized intensities stratified by the diallelic
 genotype call at each SNP, as described in \cite{Scharpf2011}.
 Another alternative is to compute log R ratio and B allele
 frequencies. The latter is often better supported by downstream hidden
 Markov models such as those in the \Rpackage{VanillaICE} package. We
 describe each approach in the following two sections.
 
 \subsection{Linear model for normalized intensities}
 
 In this section, we fit a linear model to the normalized intensities
 stratified by the diallic genotype call.  The intercept and slope from
 the linear model are both SNP- and batch-specific.  The implementation
 in the \crlmm{} package is encapsulated by the function
 \Rfunction{crlmmCopynumber} that, using the default settings, can be
 called by passing a single object of class \Rclass{CNSet}.
 
16fadf14
 <<LDS_copynumber,cache=TRUE>>=
3be95c2b
 crlmmCopynumber(cnSet)
16fadf14
 @
66900fea
 
16fadf14
 The following steps were performed by the \Rfunction{crlmmCopynumber}
 function:
0198a9ad
 \begin{itemize}
16fadf14
 \item sufficient statistics for the genotype clusters for
   each batch
 \item unobserved genotype centers imputed
 \item posterior summaries of sufficient statistics
 \item intercept and slope for linear model
ac19c848
 \end{itemize}
16fadf14
 Depending on the value of \verb+ocProbesets()+, these summaries are
 computed for subsets of the markers to reduce the required RAM. Note
 that the value returned by the \Rfunction{crlmmCopynumber} function in
 the above example is \verb+TRUE+.  The reason the function returns
 \verb+TRUE+ in the above example is that the elements of the
7c0c9ac5
 \verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}.
 Rather than keep the statistical summaries in memory, the summaries
 are written to files on disk using protocols described in the
16fadf14
 \Rpackage{ff} package. Hence, while the \Robject{cnSet} object itself
 is unchanged as a result of the \Rfunction{crlmmCopynumber} function,
 the data on disk is updated accordingly.  Users that are interested in
 accessing these low-level summaries can refer to the
7c0c9ac5
 \verb+Infrastructure+ vignette.  Note that the data structure depends
 on whether the elements of the \verb+batchStatistics+ slot are
 \Robject{ff} objects or ordinary matrices.  In this example, the
 elements of \verb+batchStatistics+ have the class \Rclass{ff\_matrix}.
518a2908
 
669c6a90
 <<classes>>=
16fadf14
 nms <- ls(batchStatistics(cnSet))
 cls <- rep(NA, length(nms))
 for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet)[[nms[i]]])[1]
 all(cls == "ff_matrix")
 @
 
 The batch-specific statistical summaries computed by
 \Robject{crlmmCopynumber} are written to files on disk using protocols
 described in the \R{} package \Rpackage{ff}. The value returned by
 \Rfunction{crlmmCopynumber} is \verb+TRUE+, indicating that the files
 on disk have been successfully updated.  Note that while the
 \Robject{cnSet} object is unchanged, the values on disk are different.
808cfecc
 
16fadf14
 
 <<chr1index>>=
 chr1.index <- which(chromosome(cnSet) == 1)
453e688a
 open(cnSet)
e63564bc
 @
518a2908
 
16fadf14
 <<subset,cache=TRUE>>=
 cnSet2 <- cnSet[chr1.index, ]
 @
 
 <<valuematrix>>=
 close(cnSet)
 for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet2)[[nms[i]]])[1]
 all(cls == "matrix")
 @
 
 <<clean, echo=FALSE, results=hide>>=
 rm(cnSet2); gc()
 @
 
7c0c9ac5
 \paragraph{Raw total copy number.}
16fadf14
 Several functions are available that will compute relatively quickly
 the allele-specific, \emph{raw} copy number estimates. At allele $k$,
0198a9ad
 marker $i$, sample $j$, and batch $p$, the estimate of allele-specific
 copy number is computed by subtracting the estimated background from
7c0c9ac5
 the normalized intensity and scaling by the slope coefficient. More
 formally, \newcommand{\A}{A} \newcommand{\B}{B}
66900fea
 \begin{eqnarray}
   \label{eq:cnK}
 {\hat c}_{k,ijp} &=& \mbox{max}\left\{\frac{1}{{\hat
     \phi}_{k,ip}}\left(I_{k,ijp}-{\hat \nu}_{k,ip}\right), ~0\right\}
 \mbox{~for~} k \in \{\A, \B\}.
0198a9ad
 \end{eqnarray}
063b3d14
 \noindent See \cite{Scharpf2011} for details.
66900fea
 
16fadf14
 The function \Rfunction{totalCopynumber} translates the normalized
 intensities to an estimate of raw copy number by adding the
 allele-specific summaries in Equation \eqref{eq:cnK}. For large
 datasets, the calculation will not be instantaneous as the I/O can be
 substantial.  Users should specify either a subset of the markers or a
 subset of the samples to avoid using all of the available RAM.  For
 example, in the following code chunk we compute the total copy number
 at all markers for the first 2 samples, and the total copy number for
 chromosome 20 for the first 50 samples.
 
 <<totalCopynumber>>=
7c0c9ac5
 tmp <- totalCopynumber(cnSet, i=seq_len(nrow(cnSet)), j=1:2)
16fadf14
 dim(tmp)
7c0c9ac5
 tmp2 <- totalCopynumber(cnSet, i=which(chromosome(cnSet) == 20), j=seq_len(ncol(cnSet)))
16fadf14
 dim(tmp2)
 @
 
 Alternatively, the functions \Rfunction{CA} and \Rfunction{CB} compute
 the allele-specific copy number.  For instance, the following code
063b3d14
 chunk computes the allele-specific summaries at all polymorphic loci
 for the first 2 samples.
66900fea
 
 <<ca>>=
7eaf43e8
 snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)))
063b3d14
 ca <- CA(cnSet, i=snp.index, j=1:2)
 cb <- CB(cnSet, i=snp.index, j=1:2)
16fadf14
 @
 
0198a9ad
 
063b3d14
 \subsection{Container for log R ratios and B allele frequencies}
808cfecc
 \label{sec:lrrBaf}
16fadf14
 
 A useful container for storing the \crlmm{} genotypes, genotype
063b3d14
 confidence scores, and the total or relative copy number at each
808cfecc
 marker is the \Rclass{BafLrrSetList} class.  Coercion of a
 \Rclass{CNSet} object to a \Rfunction{BafLrrSetList} object can be
 acheived by the function \Rfunction{BafLrrSetList} as
063b3d14
 illustrated below. Users should note that if the \verb+assayData+
 elements in the \Rclass{CNSet} instance are \Rpackage{ff} objects, the
808cfecc
 \verb+assayData+ elements of each element in the \Rclass{BafLrrSetList}
063b3d14
 object will be \Rpackage{ff}-dervied objects (a new
 \verb+total_cn*.ff+ file will be created in the \verb+ldPath()+
808cfecc
 directory). The following code-chunk is not evalutated.
 
 Estimation of log R ratios and B allele frequencies from an object of class \Rclass{CNSet} instantiated from genotyping Affymetrix CEL files or Illumina iDat files requires running the \R{} function \Rfunction{crlmmCopynumber} as a preliminary step.  Specifying \texttt{fit.linearModel=FALSE} will speed up computation as this function will by default compute summary statistics unnecessary for computing BAFs and log R ratios.
669c6a90
 
808cfecc
 <<callingCrlmmCopynumber,eval=FALSE>>=
 crlmmCopynumber(cnSet, fit.linearModel=FALSE)
 @
 
ccbb01f6
 <<oligoSnpSet>>=
808cfecc
 open(cnSet)
 oligoSetList <- BafLrrSetList(cnSet)
 close(cnSet)
063b3d14
 show(oligoSetList)
 class(oligoSetList)
 ## oligoSnpSet of first chromosome
 oligoSetList[[1]]
16fadf14
 @
 
808cfecc
 \noindent Log R ratios and B allele frequences can be retrieved by the
 accessors \Rfunction{lrr} and \Rfunction{baf}, respectively.
16fadf14
 
ccbb01f6
 <<testEqual>>=
808cfecc
 lrrList <- lrr(oligoSetList)
063b3d14
 class(lrrList)
 dim(lrrList[[1]]) ## log R ratios for chromosome 1.
 bafList <- baf(oligoSetList)
 dim(bafList[[1]]) ## B allele frequencies for chromosome 1
 @
808cfecc
 
 If the \Rfunction{crlmmCopynumber} function is not run prior to the
 \Rfunction{BafLrrSetList} construction, the log R ratios and BAFs will
 be initialized as NAs.