%\VignetteIndexEntry{Infrastructure for copy number analysis}
%\VignetteDepends{crlmm, genomewidesnp6Crlmm}
%\VignetteKeywords{crlmm, SNP 6}
\newcommand{\oligo}{\Rpackage{oligo }}

\title{Infrastructure for copy number analysis in \crlmm{}}
\author{Rob Scharpf}



  This vignette provides an overview of the \Rclass{CNSet} class and a
  brief discussion of the underlying infrastructure for large data
  support with the \Rpackage{ff} package.  This vignette instantiates
  an object of class \Rclass{CNSet} using a trivial dataset with 3
  files.  As this sample size is too small for estimating copy number
  with the \crlmm{} package, the final section of this vignette loads
  an object created by the analysis of 180 HapMap CEL files
  (Affymetrix 6.0 platform).  This object was instantiated by running
  the \verb+AffyGW+ vignette.



\section{Supported platforms}

The supported Affymetrix and Infinium platforms are those for which a
corresponding annotation package is available.  The annotation
packages contain information on the markers, such as physical position
and chromosome, as well as pre-computed parameters estimated from
HapMap used during the preprocessing and genotyping steps. For
Affymetrix, the 5.0 and 6.0 platforms are supported and the
corresponding annotation packages are \Rpackage{genomewidesnp5Crlmm}
and \Rpackage{genomewidesnp6Crlmm}.  Supported Infinium platforms are
listed in the following code chunk.

pkgs <- annotationPackages()
crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)]
crlmm.pkgs[grep("human", crlmm.pkgs)]

\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint
for copy number estimation, we require the \Rpackage{ff}.  The
\Rpackage{ff} package provides infrastructure for accessing and
writing data to disk instead of keeping data in memory.  As the
functions for preprocessing, genotyping, and copy number estimation do
not simultaneously require all samples and all probes in memory,
memory-usage by \crlmm{} can be fine-tuned by reading in and
processing subsets of the markers and/or samples. The functions
\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the
\Rpackage{oligoClasses} package can be used to declare how many
markers and samples to read at once.  In general, specifying smaller
values should reduce the RAM required for a particular job.  In
general, smaller values will increase the run-time. In the following
code-chunk, we declare that \crlmm{} should process 100,000 markers at
a time (when possible) and 200 samples at a time.  If our dataset
contained fewer than 200 samples, the \Rfunction{ocSamples} option
would not have any effect.  One can view the current settings for
these commands, by typing the functions without an argument.


\section{The \Rclass{CNSet} container}

\subsection{Instantiating an object of class \Rclass{CNSet}}

An object of class \Rclass{CNSet} can be instantiated by one of two


\item[Approach 1:] during the preprocessing of the raw intensities for
  Illumina and Affymetrix arrays by the the functions
  \Rfunction{constructInf} and \Rfunction{genotype},
  respectively. (The \Rfunction{genotype} calls the function
  \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for
  Affymetrix platforms.)

\item[Approach 2:] by subsetting an existing \Robject{CNSet} object.
  As per usual, the `[' method can be used to extract a subset of
  markers $i$ as in `[i, ]', a subset of samples $j$ as in `[, j]', or
  a subset of markers $i$ and samples $j$ as in `[i, j]'.


There are important differences in the underlying data representation
depending on how the \Rclass{CNSet} object was instantiated.  In
particular, objects generated by the functions
\Rfunction{constructInf} and \Rfunction{genotype} store
high-dimensional data on disk rather than in memory through protocols
defined in the \R{} package \ff{}.  For instance, the normalized
intensities and genotype calls in a \Rclass{CNSet}-instance from
approach (1) are \ff{}-derived objects.  By contrast, when such an
objected generated by approach (1) is subset by the `[' method, an
object of the same class is returned but the \Rpackage{ff}-derived
objects are coerced to ordinary matrices. Note, therefore, that both
approaches (1) and (2) may involve substantial I/O and that (2) should
be performed judiciously.

\subsubsection{Approach 1}

To illustrate the first approach, we begin by specifying a local
directory to store output files and setting the \Rfunction{ldPath}
function with this path.

outdir <- paste("/local_data/r00/crlmm/", getRversion(), "/infrastructure", sep="")
if(!file.exists(outdir)) dir.create(outdir)

Next we load the annotation package required, as well as the \R{}
package \Rpackage{hapmapsnp6} that contains 3 example CEL files.  We
use the \Rfunction{system.file} function to find the path to the CEL

require(genomewidesnp6Crlmm) & require(hapmapsnp6)
path <- system.file("celFiles", package="hapmapsnp6")
celfiles <- list.celfiles(path, full.names=TRUE)

Typically, an object of class \Rclass{CNSet} is instantiated as part
of the preprocessing and genotyping by calling the function
\Rfunction{genotype}, as illustrated in the
\verb+AffyGW+ vignette.

exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")

Several files with \verb+.ff+ extensions now appear in the directory
indicated by the \Rfunction{ldPath} function.


One could also instantiate an object of class \Rclass{CNSet} without
preprocessing/genotyping by calling the exported function
\Rfunction{constructAffy} directly using the \Rfunction{:::} operator.

tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")

The \Rfunction{show} method provides a concise summary of the
\Robject{exampleSet} object. Note the class of the elements in the
\verb+batchStatistics+ and \verb+assayData+ slots is indicated in the
first line of the summary.


% We briefly outline some of the unique aspects of the
% \Rclass{CNSet}-class using in \crlmm{} that may differ from the more
% standard extensions of the virtual class \Rclass{eSet} defined in
% the \Rpackage{Biobase} package.

As the \verb+assayData+ elements of the \Robject{exampleSet} object
are stored on disk rather than in memory, we inspect attributes of the
elements by first opening the file connection using the
\Rfunction{open}. For example, in the following code we extract the
normalized intensities for the A allele by first opening the object
returned by the \Rfunction{A} function.  The name of the file where
the data is stored on disk is provided by the
\Rfunction{filename}. Finally, the normalized intensities can be
pulled from disk to memory by the `[' method.  It can be useful to
wrap the `[' method by the \Rfunction{as.matrix} to ensure that the
output is the desired class.

as.matrix(A(exampleSet)[1:5, ])

\paragraph{Moving \texttt{*.ff} files} Ideally, the files with
\verb+.ff+ extension should not be moved.  However, if this is not
possible, the safest way to move these files is to clone all of the
\ff{} objects using the \Rfunction{clone}, followed by the
\Rfunction{delete} function to remove the original files on disk. An
example of the \Rfunction{delete} function is included at the end of
the \verb+IlluminaPreprocessCN+ vignette. See the documentation for
the \Rfunction{clone} and \Rfunction{delete} functions in the \ff{}
package for additional details.

\paragraph{Order of operations:} For \Rclass{CNSet}-instances derived
by approach (1), users should be mindful of the substantial I/O when
using accessors to extract data from the class.  For example, the
following 2 methods would extract identical results, with the latter
being much more efficient (extra parentheses are added to the second
operation to emphasize the order of operations):

A(exampleSet[1:5, ])
## preferred
(A(exampleSet))[1:5, ]

\subsubsection{Approach 2: using `['}

Here we instantiate a new object of class \Rclass{CNSet} by applying the
`[' method to an existing object of class \Rclass{CNSet}.

cnset.subset <- exampleSet[1:5, ]

Note the class of the \verb+batchStatistics+ and \verb+assayData+
elements of the \Robject{cnset.subset} object printed in the first
line of the summary.


\subsection{Slots of class \Rclass{CNSet}}

Information on physical position, chromosome, and whether the marker
is a SNP can be accessed through accessors defined for the

is.snp <- isSnp(exampleSet)
snp.index <- which(is.snp)
np.index <- which(!is.snp)
chr1.index <- which(chromosome(exampleSet) == 1)


The \verb+assayData+ elements are of the class
\Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how
the \Rclass{CNSet} object was instantiated.  Elements in the
\verb+assayData+ environment can be listed using the \Rfunction{ls}


The normalized intensities for the A and B alleles have names
\verb+alleleA+ and \verb+alleleB+ and can be accessed by the methods
\Rfunction{A} and \Rfunction{B}, respectively.  Genotype calls and
confidence scores can be accessed by \Rfunction{snpCall} and
\Rfunction{snpCallProbability}, respectively.  Note that the
confidence scores are represented as integers to reduce the filesize,
but can be tranlated to the probability scale using the \R{} function
\Rfunction{i2p}. For example,

scores <- as.matrix(snpCallProbability(exampleSet)[1:5, 1:2])

Note that for the Affymetrix 6.0 platform the assay data elements each
have a row dimension corresponding to the total number of polymorphic
and nonpolymorphic markers interrogated by the Affymetrix 6.0
platform.  A consequence of keeping the rows of the assay data
elements the same for all of the statistical summaries is that the
matrix used to store genotype calls is larger than necessary.

\subsubsection{\texttt{batch} and \texttt{batchStatistics}}

As defined in Leek \textit{et al.} 2010, \textit{Batch effects are
  sub-groups of measurements that have qualitatively different
  behaviour across conditions and are unrelated to the biological or
  scientific variables in a study}.  The \verb+batchStatistics+ slot
is an environment used to store SNP- and batch-specific summaries,
such as the sufficient statistics for the genotype clusters and the
linear model parameters used for copy number estimation.  The
\verb+batch+ slot is used to store the 'batch name' for each
array. For small studies in which the samples were processed at
similar times (e.g., within a month), all the samples can be
considered to be in the same batch.  For large studies in which the
samples were processed over several months, users should the scan date
of the array or the chemistry plate are useful surrogates. The only
constraint on the \verb+batch+ variable is that it must be a character
vector that is the same length as the number of samples to be
processed.  The \verb+batch+ is specified as an argument to the \R{}
functions \Rfunction{constructInf} and \Rfunction{genotype} that
instantiate \Rclass{CNSet} objects for the Illumina and Affymetrix
platforms, respectively.  The \Rfunction{batch} function can be used
to access the \verb+batch+ information on the samples as in the
following example.


For the \verb+batchStatistics+ slot, the elements in the environment
have the class \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix},
depending on how the \Rclass{CNSet} object was instantiated.  The
dimension of each element is the number of markers (SNPs +
nonpolymorphic markers) $\times$ the number of batches. The names of
the elements in the environment can be list using the \R{} function


Currently, the batch-specific summaries are stored to allow some
flexibility in the choice of downstream analyses of copy number and
visual assessments of model fit.  Documentation for such applications
will be expanded in future versions of \crlmm{}, and are currently not
intended to be accessed directly by the user.

%The \Robject{CNSet} class also contains the slot
%\Robject{batchStatistics} that contains batch-specific summaries
%needed for copy number estimation. In particular, each element is a
%matrix (or an ff object) with R rows and C columns, correspoinding to
%R markers and C batches.  The summaries includes the within genotype
%cluster medians and median absolute deviations (mads), but also
%parameters estimated from the linear model.  (For unobserved
%genotypes, the medians are imputed and the variance is obtained the
%median variance (across markers) within a batch. ) The elements of the
%slot can be listed as follows.


Sample-level summaries obtained during the preprocessing/genotyping
steps include skew (SKW), the signal to noise ratio (SNR), and gender
(1=male, 2=female) are stored in the \verb+phenoData+ slot.  As for
other \Rclass{eSet} extensions, the \verb+$+ method can be used to
extract these summaries.  For \Rclass{CNSet} objects generated by
approach (1), these elements are of the class


The `[' methods without arguments can be used to coerce to a vector.

c("male", "female")[exampleSet$gender[]]


The scan date of the arrays are stored in the \verb+protocolData+.


\section{Trouble shooting with a HapMap example}

This section uses an object of class \Rclass{CNSet} instantiated by
the \verb+AffyGW+ vignette and saved to a local path
on our computing cluster indicated by the object \Robject{outdir}
below.  The \verb+copynumber+ vignette was used to fill out the
\verb+batchStatistics+ slot of the \Robject{cnSet} object.

if(getRversion() < "2.13.0"){
	rpath <- getRversion()
} else rpath <- "trunk"
outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")

Next, we load the \Robject{cnSet} object.

if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))

\subsection{Missing values}

Most often, missing values occur when the genotype confidence scores
for a SNP were below the threshold used by the
\Robject{crlmmCopynumber} function. For the HapMap analysis, we used a
confidence threshold of 0.80 (the default).  In the following code, we
assess NA's appearing for the raw copy number estimates for the first
10 samples.

GT.CONF.THR <- 0.80
autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23)
sample.index <- 1:10
ct <- totalCopynumber(cnSet, i=autosome.index, j=sample.index)
ca <- CA(cnSet, i=autosome.index, j=sample.index)
cb <- CB(cnSet, i=autosome.index, j=sample.index)
missing.ca <- is.na(ca)
missing.cb <- is.na(cb)
(nmissing.ca <- sum(missing.ca))
(nmissing.cb <- sum(missing.cb))
identical(nmissing.ca, nmissing.cb)

If \Robject{nmissing.ca} is nonzero, check the genotype confidence
scores provided by the crlmm genotyping algorithm against the
threshold specified in \Robject{crlmmCopynumber}.

if(nmissing.ca > 0){
	gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index])
	below.thr <- gt.conf < GT.CONF.THR
	index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index)))
	nmissingBecauseOfGtThr <- length(index.allbelow) * length(sample.index)
	stopifnot(identical(nmissingBecauseOfGtThr, nmissing.ca))
	## or calculate the proportion of missing effected by low crlmm confidence
	length(index.allbelow) * length(sample.index)/nmissing.ca

One could inspect the cluster plots for the 'low confidence' calls.


We repeat the above check for missing values at polymorphic loci on
chromosome X.  In this case, we compare the \Robject{rowSums} of the
missing values to the number of samples to check whether all of the
estimates are missing for a given SNP.

## start with first batch
sample.index <- which(batch(cnSet)==batch(cnSet)[1])
X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
ca.X <- CA(cnSet, i=X.index, j=sample.index)
missing.caX <- is.na(ca.X)
(nmissing.caX <- sum(missing.caX))
missing.snp.index <- which(rowSums(missing.caX) == length(sample.index))
index <- which(rowSums(missing.caX) == length(sample.index))

From the above codechunk, we see that \Sexpr{length(index)} SNPs have
NAs for all the samples. Next, we tally the number of NAs for
polymorphic markers on chromosome X that are below the confidence
threshold.  For the HapMap analysis, all of the missing values arose
from SNPs in which either the men or the women had confidence scores
that were all below the threshold.

F <- which(cnSet$gender[sample.index] == 2)
M <- which(cnSet$gender[sample.index] == 1)
gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index])
below.thr <- gt.conf < GT.CONF.THR
index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F)))
index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M)))
all.equal(index.allbelowF, index.allbelowM)
all.equal(as.integer(index.allbelowF), as.integer(index))

For nonpolymorphic loci, the genotype confidence scores are irrelevant
and estimates are available at most markers.

np.index <- which(!isSnp(cnSet) & chromosome(cnSet)==23)
ca.F <- CA(cnSet, i=np.index, j=F)
ca.M <- CA(cnSet, i=np.index, j=M)
## NAs for one marker
ca.F <- ca.F[-match("CN_974939", rownames(ca.F)), ]
ca.M <- ca.M[-match("CN_974939", rownames(ca.M)), ]

%TODO: marker CN\_974939 has NAs for the normalized intensities.  This
%is because CN\_974939 is not in the \Robject{npProbesFid} file in
%\Rpackage{genomewidesnp6Crlmm}.  The \Robject{npProbesFid} file should
%be updated in the next \Rpackage{genomewidesnp6Crlmm} release.

In total, there were \Sexpr{length(missing.snp.index)} polymorphic
markers on chromosome X for which copy number estimates are not
available.  Lowering the confidence threshold would permit estimation
of copy number at most of these loci.  A confidence threshold is
included as a parameter for the copy number estimation as an approach
to reduce the sensitivity of genotype-specific summary statistics,
such as the within-genotype median, to intensities from samples that
do not clearly fall into one of the biallelic genotype clusters. There
are drawbacks to this approach, including variance estimates that can
be a bit optimistic at some loci.  More direct approaches for outlier
detection and removal may be explored in the future.

Copy number estimates for other chromosomes, such as mitochondrial and
chromosome Y, are not currently available in \crlmm{}.


\section{Session information}
<<sessionInfo, results=tex>>=

%  \bibliographystyle{plain}
%  \bibliography{refs}