%\VignetteIndexEntry{crlmm copy number Vignette}
%\VignetteDepends{crlmm, genomewidesnp6Crlmm}
%\VignetteKeywords{crlmm, SNP 6}
\newcommand{\oligo}{\Rpackage{oligo }}

\title{Copy number estimation and genotype calling with \Rpackage{crlmm}}
\author{Rob Scharpf}

<<setup, echo=FALSE, results=hide>>=
options(continue=" ", width=70)

%\section{Estimating copy number}

%At present, software for copy number estimation is provided only for the
%Affymetrix 6.0 platform.

  This vignette estimates copy number for HapMap samples on the
  Affymetrix 6.0 platform.  See \citep{Scharpf2010} for additional


\section{Copy number estimation}

\subsection{Set up}


Several genotyping platforms are currently supported.  Supported
platforms must have a corresponding annotation package (installed
separately) that are listed below.

pkgs <- annotationPackages()
pkgs <- pkgs[grep("Crlmm", pkgs)]

Note that 'pd.genomewidesnp6' and 'genomewidesnp6Crlmm' are both
annotation packages for the Affymetrix 6.0 platform available on
Bioconductor, but the 'genomewidesnp6Crlmm' annotation package must be
used for copy number estimation.  The annotation package is specified
through the 'cdfName' -- the identifier without the 'Crlmm' postfix.
In the following code, we specify the cdf name for Affymetrix 6.0,
provide the complete path to the CEL files, and indicate where
intermediate files from the copy number estimation are to be saved.

cdfName <- "genomewidesnp6"
pathToCels <- "/thumper/ctsa/snpmicroarray/hapmap/raw/affy/1m"
if(getRversion() < "2.13.0"){
	rpath <- getRversion()
} else rpath <- "trunk"
outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
dir.create(outdir, recursive=TRUE, showWarnings=FALSE)

All long computations are saved in the output directory
\Robject{outdir}.  Users should change these variables as appropriate.
The following code chunk should fail unless the above arguments have
been set appropriately.

if(!file.exists(outdir)) stop("Please specify a valid directory for storing output")
if(!file.exists(pathToCels)) stop("Please specify the correct path to the CEL files")

Processed data from codechunks requiring long computations are saved
to disk by wrapping function calls in the \Robject{checkExists}
function.  After running this vignette as a batch job, subsequent
calls to \verb@Sweave@ will load the saved computations from disk. See
the \Rfunction{checkExists} help file for additional details.

\subsection{Preprocessing and genotyping.}

In the following code chunk, we provide the complete path to the
Affymetrix CEL files and define a 'batch' variable. The
\Robject{batch} variable will be used to initialize a container for
storing the normalized intensities, the genotype calls, and the
parameter estimates for copy number.  Often the chemistry plate or the
scan date of the array is a useful surrogate for batch. For the HapMap
CEL files in our analysis, the CEPH (C) and Yoruban (Y) samples were
prepared on separate chemistry plates.  In the following code chunk,
we extract the population identifier from the CEL file names and
assign these identifiers to the variable \Robject{batch}.

celFiles <- list.celfiles(pathToCels, full.names=TRUE, pattern=".CEL")
celFiles <- celFiles[substr(basename(celFiles), 13, 13) %in% c("C", "Y")]
batch <- as.factor(substr(basename(celFiles), 13, 13))

The preprocessing steps for copy number estimation includes quantile
normalization of the raw intensities for each probe and a step that
summarizes the intensities of multiple probes at a single locus.  For
example, the Affymetrix 6.0 platform has 3 or 4 identical probes at
each polymorphic locus and the normalized intensities are summarized
by a median.  For the nonpolymorphic markers on Affymetrix 6.0, only
one probe per locus is available and the summarization step is not

After preprocessing the arrays, the \crlmm{} package estimates the
genotype and provides a confidence score at each polymorphic locus.
Unless the dataset is small (e.g., fewer than 50 samples), we suggest
installing and loading the \R{} package \Rpackage{ff} to reduce the
RAM required for preprocessing and genotyping.  Loading the
\Rpackage{ff} package at this point will automatically enable large
data support (LDS).

The function \Rfunction{genotype} checks to see whether the
\Rpackage{ff} is loaded.  If loaded, the normalized intensities and
genotype are stored as \Robject{ff} objects on disk.  Otherwise, the
genotypes and normalized intensities are stored in matrices.  A word
of caution: the \Rfunction{genotype} function without \Rpackage{ff}
requires a potentially large amount of RAM.  A more RAM-friendly
approach to preprocessing and genotyping requires the \Rpackage{ff}
package.  In particular, the functions \Rfunction{ocProbesets} and
\Rfunction{ocSamples} can be used to manage how many probesets and
samples are to processed at a time and can therefore be used to fine
tune the needed RAM for a particular job.  The function
\Rfunction{ldPath} indicates that \Robject{ff} objects will be stored
in the directory \Robject{outdir}.


With LDS enabled, we preprocess and genotype 180 samples from the CEPH
and Yoruban populations.  Users interested only in the genotypes
should instead use the \R{} function \Rfunction{crlmm} or
\Rfunction{crlmm2}. We wrap the call to \Rfunction{genotype} in
\Rfunction{checkExists} so that subsequent calls to \verb@Sweave@ can
be run interactively.

if(!file.exists(file.path(outdir, "cnSet.rda"))){
	gtSet <- checkExists("gtSet",

The value returned by genotype is an instance of the class
\Robject{CNSet}.  In addition to the normalization and genotyping, the
\Robject{genotype} function initializes a container that will store
summary statistics for the batches and parameters needed for copy
number estimation.  At this point, the batch summaries and parameters
for copy number are all NA's.

\subsection{Copy number estimation.}

The \Rfunction{crlmmCopynumber} performs the following steps:
\item computes summary statistics for each batch
\item imputes unobserved genotype centers (for each batch)
\item shrinks the within-genotype variances
\item estimates parameters for allele-specific copy number

  With \texttt{verbose=TRUE}, the above steps for CN estimation are
  displayed during the processing.

cnSet <- checkExists("cnSet", .path=outdir, .FUN=crlmmCopynumber, object=gtSet)

In an effort to reduce I/O, the \Rpackage{crlmmCopynumber} function no
longer stores the allele-specific estimates of copy number as part of
the object.  Rather, several functions are available that will compute
relatively quickly the allele-specific estimates from the stored
normalized intensities and the linear model parameters. At allele $k$,
marker $i$, sample $j$, and batch $p$, the estimate of allele-specific
copy number is computed by subtracting the estimated background from
the observed intensity and scaling by the slope coefficient.
\newcommand{\A}{A} \newcommand{\B}{B}
{\hat c}_{k,ijp} &=& \mbox{max}\left\{\frac{1}{{\hat
    \phi}_{k,ip}}\left(I_{k,ijp}-{\hat \nu}_{k,ip}\right), ~0\right\}
\mbox{~for~} k \in \{\A, \B\}.
\noindent See \cite{Scharpf2010} for details.

For large datasets, the above computation will not be instantaneous as
the I/O can be substantial.  The functions \Rfunction{CA},
\Rfunction{CB}, and \Rfunction{totalCopynumber} should be used to
extract CN estimates from the \Robject{CNSet} container.  The
following code chunks illustrate several examples, as well as some of
the useful accessors for extracting which markers are SNPs, which are
chromosomes, etc.

snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)))
ca <- CA(cnSet, i=snp.index, j=1:5)
cb <- CB(cnSet, i=snp.index, j=1:5)
ct <- ca+cb

Alternatively, total copy number can be obtained by
ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5)
stopifnot(all.equal(ct, ct2))

At nonpolymorphic loci, either the \Rfunction{CA} or
\Rfunction{totalCopynumber} functions can be used to obtain estimates
of total copy number.

marker.index <- which(!isSnp(cnSet))
ct <- CA(cnSet, i=marker.index, j=1:5)
## all zeros
stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0))
ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5)
stopifnot(all.equal(ct, ct2))

 Nonpolymorphic markers on chromosome X:

<<nonpolymorphicX, fig=TRUE, width=8, height=4>>=
npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet))
M <- sample(which(cnSet$gender==1), 5)
F <- sample(which(cnSet$gender==2), 5)
cn.M <- CA(cnSet, i=npx.index, j=M)
cn.F <- CA(cnSet, i=npx.index, j=F)
boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", col="grey60", outline=FALSE)
\caption{Copy number estimates for nonpolymorphic loci on chromosome
  X (5 men, 5 women).  crlmm assumes that the median copy number
  across samples at a given marker on X is 1 for men and 2 for women.}

  \caption{Copy number estimates for nonpolymorphic loci on chromosome
    X (5 men, 5 women).  crlmm assumes that the median copy number
    across samples at a given marker on X is 1 for men and 2 for women.}

<<polymorphicX, fig=TRUE, width=8, height=4>>=
## copy number estimates on X for SNPs are biased towards small values.
X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
ca.M <- CA(cnSet, i=X.markers, j=M)
cb.M <- CB(cnSet, i=X.markers, j=M)
ca.F <- CA(cnSet, i=X.markers, j=F)
cb.F <- CB(cnSet, i=X.markers, j=F)
cn.M <- ca.M+cb.M
cn.F <- ca.F+cb.F
boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", outline=FALSE, col=cols, xaxt="n", ylim=c(0,5))
legend("topleft", fill=unique(cols), legend=c("Male", "Female"))
cn2 <- totalCopynumber(cnSet, i=X.markers, j=c(M, F))
stopifnot(all.equal(cbind(cn.M, cn.F), cn2))
\caption{Copy number estimates for polymorphic markers on chromosome
  X. crlmm assumes that the median copy number across samples at a
  given marker on X is 1 for men and 2 for women.}

Accessors for physical position and chromosome are also provided. In
the following codechunk we extract the position and chromosome for the
first 10 markers in the \Robject{cnSet} object.


\section{The CNSet container}

The objects returned by the \Rfunction{genotype} and
\Rfunction{crlmmCopynumber} have assay data elements that are pointers
to \Robject{ff} objects stored in the directory \Robject{outdir}.  Had
we not loaded the \Rpackage{ff} prior to running these functions, the
\Robject{AssayData} elements would be ordinary matrices, though the
RAM required for running the algorithm would be substantial. The
functions \Rfunction{open} and \Rfunction{close} open and close the
connections to the \Robject{assayData} elements. Subsetting an
\Robject{ff} object pulls the data from disk into memory and should be
used with caution. In particular, subsetting the \Robject{gtSet} would
subset each element in the \Robject{assayData} slot, returning an
object of the same class but with \Robject{assayData} elements that
are matrices.  Such an operation can be exceedingly slow when
performed over a network and require subsantial RAM.  The preferred
approach is to extract only the assay data element that is needed. In
the example below, we extract the genotype calls for the the first 50

print(object.size(cnSet), units="Mb")
gt <- calls(cnSet)[, 1:50]

The \Robject{CNSet} class also contains the slot
\Robject{batchStatistics} that contains batch-specific summaries
needed for copy number estimation. In particular, each element is a
matrix (or an ff object) with R rows and C columns, correspoinding to
R markers and C batches.  The summaries includes the within genotype
cluster medians and median absolute deviations (mads), but also
parameters estimated from the linear model.  (For unobserved
genotypes, the medians are imputed and the variance is obtained the
median variance (across markers) within a batch. ) The elements of the
slot can be listed as follows.


Note that for the Affymetrix 6.0 platform the assay data elements each
have a row dimension corresponding to the total number of polymorphic
and nonpolymorphic markers interrogated by the Affymetrix 6.0
platform.  A consequence of keeping the rows of the assay data
elements the same for all of the statistical summaries is that the
matrix used to store genotype calls is larger than necessary.  Also,
note the additional overhead of some operations when using
\Robject{ff} objects.  For instance, the posterior probabilities for
the CRLMM genotype calls are represented as integers. The accessor
\Robject{snpCallProbability} can be used to access these confidence
scores.  When stored as matrices, converting the integer
representation back to the probability scale is straightforward as
shown below.  However, for the \Robject{ff} objects we must first
convert the ff object to a matrix. One could use the function
\Rfunction{[,]} but this could be slow and require a lot of RAM
depending on the size of the dataset. We suggest pulling only the
needed rows and columns from memory. In the following example, we
convert the integer scores to probabilities for the CEPH samples.  As
genotype confidence scores are not applicable to the nonpolymorphic
markers, we extract only the polymorphic markers using the
\Rfunction{isSnp} function.

rows <- which(isSnp(cnSet))
cols <- which(batch == "C")
posterior.prob <- tryCatch(i2p(snpCallProbability(cnSet)),
			   error=function(e) print("This will not work for an ff object."))

Accessors for the quantile normalized intensities for the A allele at
polymorphic loci:

snp.index <- which(isSnp(cnSet))
np.index <- which(!isSnp(cnSet))
a <- (A(cnSet))[snp.index, ]

The extra set of parentheses surrounding \Robject{A(cnSet2)} above is
added to emphasize the appropriate order of operations. Subsetting the
entire \Robject{cnSet} object in the following, unevaluated codechunk
should be avoided for large datasets.

a <- A(cnSet[snp.index, ])

The quantile-normalized intensities for nonpolymorphic loci are
obtained by:

npIntensities <- (A(cnSet))[np.index, ]

Quantile normalized intensities for the B allele at polymorphic loci:

b.snps <- (B(cnSet))[snp.index, ]

Note that NA's are stored in the slot for normalized 'B' allele

all(is.na(B(cnSet)[np.index, ]))

<<clean, echo=FALSE>>=
rm(b.snps, a, npIntensities); gc()

\subsection{Other accessors}

Information on physical position and chromosome can be accessed as follows:

xx <- position(cnSet)
yy <- chromosome(cnSet)

Parameters from the linear model used to estimate copy number are
stored in the slot \Robject{batchStatistics}.

TODO: Describe accessors for batch-level summaries.

\section{Suggested visualizations}


A histogram of the signal to noise ratio for the HapMap samples:

<<plotSnr, fig=TRUE, include=FALSE>>=
hist(cnSet$SNR[, ], xlab="SNR", main="", breaks=25)
  \caption{Signal to noise ratios for the HapMap samples. SNRs below 5
   for the Affymetrix platform are often samples of lower quality.
   Such samples will tend to have much more variable estimates of copy

\paragraph{One sample at a time: locus-level estimates}

Figure \ref{fig:oneSample} plots physical position (horizontal axis)
versus copy number (vertical axis) for the first sample.  There is
less information to estimate copy number at nonpolymorphic loci;
improvements to the univariate prediction regions at nonpolymorphic
loci are a future area of research. If the \Rpackage{SNPchip} is
available, an idiogram can be added to the existing plotting
coordinates as indicated in the following example.

<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>=
marker.index <- which(chromosome(cnSet) == 1)
cn <- totalCopynumber(cnSet, i=marker.index, j=1)
x <- position(cnSet)[marker.index]
par(las=1, mar=c(4, 5, 4, 2))
plot(x, cn, pch=".",
     cex=2, xaxt="n", col="grey60", ylim=c(0,6),
     ylab="copy number", xlab="physical position (Mb)",
     main=paste(sampleNames(cnSet)[1], ", CHR: 1"))
axis(1, at=pretty(x), labels=pretty(x)/1e6)
invisible(plotCytoband(1, new=FALSE, cytoband.ycoords=c(5.5, 6), label.cytoband=FALSE))

  \caption{\label{fig:oneSample} Total copy number (y-axis) for
    chromosome 1 plotted against physical position (x-axis) for one
    sample.  Estimates at nonpolymorphic loci are plotted in light

\paragraph{One SNP at a time}

Scatterplots of the A and B allele intensities (log-scale) can be
useful for assessing the biallelic genotype calls.  This section of
the vignette is currently under development.

\section{Reasons for missing values}

There are several reasons for estimates of the allele-specific copy
number to have missing values (\texttt{NA}'s). This section briefly
elaborates on the source of missing values in the HapMap analysis and
discusses possible alternatives to reduce the number of missing
values.  Note that allele-specific copy number, 'CA' and 'CB', is not
saved in the \Robject{cnSet} object.  Rather, the respective accessors
calculate 'CA' and 'CB' on the fly from the normalized intensities and
from the marker-specific parameter estimates in the linear model.  In
general, a missing value arises when the background or slope parameter
was not estimated in the linear model. Most often, missing values
occur when the genotype confidence scores for a SNP were below the
threshold used by the \Robject{crlmmCopynumber} function. For the
HapMap analysis, we used a confidence threshold of

autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23)
sample.index <- which(batch(cnSet) == "C")
ca <- CA(cnSet, i=autosome.index, j=sample.index)
missing.ca <- is.na(ca)
(nmissing.ca <- sum(missing.ca))

If \Robject{nmissing.ca} is nonzero, then one could check the genotype
confidence scores provided by the crlmm genotyping algorithm against
the threshold specified in \Robject{crlmmCopynumber}.

if(nmissing.ca > 0){
	gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index])
	below.thr <- gt.conf < GT.CONF.THR
	index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index)))
	all.equal(index, index.allbelow)
	## or calculate the proportion of missing effected by low crlmm confidence
	sum(index.allbelow %in% index)/length(index)

We repeat the above check for missing values at polymorphic loci on
chromosome X.  In this case, we compare the \Robject{rowSums} of the
missing values to the number of samples to check whether all of the
estimates are missing for a given SNP.

X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
ca.X <- CA(cnSet, i=X.index, j=sample.index)
missing.caX <- is.na(ca.X)
(nmissing.caX <- sum(missing.caX))
missing.snp.index <- which(rowSums(missing.caX) == length(sample.index))
index <- which(rowSums(missing.caX) == length(sample.index))

From the above codechunk, we see that \Sexpr{length(index)} SNPs have
NAs for all the samples. Next, we tally the number of NAs for
polymorphic markers on chromosome X that are below the confidence
threshold.  For the HapMap analysis, all of the missing values arose
from SNPs in which either the men or the women had confidence scores
that were all below the threshold.

if(nmissing.caX > 0){
	gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index])
	below.thr <- gt.conf < GT.CONF.THR
	F <- which(cnSet$gender[sample.index] == 2)
	M <- which(cnSet$gender[sample.index] == 1)
	index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F)))
	index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M)))
	index.allbelow <- as.integer(unique(c(index.allbelowF, index.allbelowM)))
	all.equal(index, index.allbelow)
	## or calculate the proportion of missing effected by low crlmm confidence
	sum(index.allbelow %in% index)/length(index)

Next, we verify that the number of missing values is the same for the
'B' allele at autosomal polymorphic loci

cb <- CB(cnSet, i=autosome.index, j=sample.index)
missing.cb <- is.na(cb)

For nonpolymorphic loci, the genotype confidence scores are irrelevant
and estimates are available at most markers.

np.index <- which(!isSnp(cnSet) & chromosome(cnSet)==23)
ca.F <- CA(cnSet, i=np.index, j=F)
ca.M <- CA(cnSet, i=np.index, j=M)
## NAs for one marker
ca.F <- ca.F[-match("CN_974939", rownames(ca.F)), ]
ca.M <- ca.M[-match("CN_974939", rownames(ca.M)), ]

TODO: marker CN\_974939 has NAs for the normalized intensities.  This
is because CN\_974939 is not in the \Robject{npProbesFid} file in
\Rpackage{genomewidesnp6Crlmm}.  The \Robject{npProbesFid} file should
be updated in the next \Rpackage{genomewidesnp6Crlmm} release.

In total, there were \Sexpr{length(missing.snp.index)} polymorphic
markers on chromosome X for which copy number estimates are not
available.  Lowering the confidence threshold would permit estimation
of copy number at most of these loci.  A confidence threshold is
included as a parameter for the copy number estimation as an approach
to reduce the sensitivity of genotype-specific summary statistics,
such as the within-genotype median, to intensities from samples that
do not clearly fall into one of the biallelic genotype clusters. There
are drawbacks to this approach, including variance estimates that can
be a bit optimistic at some loci.  More direct approaches for outlier
detection and removal may be explored in the future.

Copy number estimates for other chromosomes, such as mitochondrial and
chromosome Y, are not currently available in \crlmm{}.


\section{Session information}
<<sessionInfo, results=tex>>=