%\VignetteIndexEntry{crlmm copy number Vignette}
%\VignetteDepends{crlmm, genomewidesnp6Crlmm}
%\VignetteKeywords{crlmm, SNP 6}
\newcommand{\oligo}{\Rpackage{oligo }}

\title{Copy number estimation and genotype calling with \Rpackage{crlmm}}
\author{Rob Scharpf}

<<setup, echo=FALSE, results=hide>>=
options(continue=" ", width=70)

%\section{Estimating copy number}

%At present, software for copy number estimation is provided only for the
%Affymetrix 6.0 platform.

  This vignette estimates copy number for HapMap samples on the
  Affymetrix 6.0 platform.  See \citep{Scharpf2010} for the working


\section{Simple usage}

CRLMM supports the following platforms:

cdfName <- "genomewidesnp6"

\paragraph{Preprocess and genotype.}

In the following code chunk, we provide the complete path to the
Affymetrix CEL files, assign a path to store the normalized
intensities and genotype data, and define a 'batch' variable. The
batch variable will later be useful when we estimate copy number.  We
typically use the 96 well chemistry plate or the scan date of the
array as a surrogate for batch. Adjusting by date or chemistry plate
can be helpful for limiting the influence of batch effects.

celFiles <- list.celfiles("/thumper/ctsa/snpmicroarray/hapmap/raw/affy/1m", full.names=TRUE, pattern=".CEL")
outdir <- "/thumper/ctsa/snpmicroarray/rs/data/hapmap/crlmmVignette/release/affy"
dir.create(outdir, showWarnings=FALSE)
## CEPH and Yoruban
batch <- substr(basename(celFiles), 13, 13)
celFiles <- celFiles[batch %in% c("C", "Y")]
batch <- batch[batch %in% c("C", "Y")]
stopifnot(length(batch) == length(celFiles))

The preprocessing steps for copy number estimation includes quantile
normalization of the polymorphic and nonpolymorphic markers and a
summarization step whereby the quantile normalized intenities are
summarized for each locus.  As the markers at polymorphic loci on the
Affymetrix 6.0 platform are identical, we summarize the intensities by
the median. For the nonpolymorphic markers, no summarization step is
performs.  Next, the \crlmm{} package provides a genotype call and a
confidence score at each polymorphic locus.  Unless the dataset is small
(e.g., fewer than 50 samples), we suggest installing and loading the
\R{} package \Rpackage{ff}.  The function \Rfunction{genotype} checks to
see whether the \Rpackage{ff} is loaded.  If loaded, the normalized
intensities and genotype are stored as \Robject{ff} objects on disk.
Otherwise, the genotypes and normalized intensities are stored in
matrices.  A word of caution: the \Rfunction{genotype} function requires
a potentially large amount of RAM even when the \R{} package
\Rpackage{ff} is loaded.  Users with large datasets may want to skip
this part.  For the 180 HapMap samples processed in this vignette, the
\Rfunction{genotype} function required 25MB RAM. The next two code
chunks should be submitted as batch jobs.

	if(!file.exists(file.path(outdir, "cnSet.rda"))){
		cnSet <- genotype(celFiles[1:50], cdfName=cdfName, copynumber=TRUE, batch=batch[1:50])
		save(cnSet, file=file.path(outdir, "cnSet.rda"))
	} else{
		message("Loading cnSet container")
		load(file.path(outdir, "cnSet.rda"))

A more memory efficient approach to preprocessing and genotyping is
implemented in the \R{} function \Rfunction{genotype2}.  The arguments
to this function are similar to \Rfunction{genotype} but requires
explicit loading of the \R{} package \Rpackage{ff} prior to calling
the function. The functions \Rfunction{ocProbesets} and
\Rfunction{ocSamples} can be used to manage how many probesets and
samples are to be loaded into memory and processed.  Using the default
settings, the following code required 1.9Mb RAM to process 180 CEL

ld.path <- file.path(outdir, "ffobjs")
if(!file.exists(ld.path)) dir.create(ld.path)
	if(!file.exists(file.path(outdir, "cnSet2.rda"))){
		cnSet2 <- genotype2(celFiles, cdfName=cdfName, copynumber=TRUE, batch=batch)
		save(cnSet2, file=file.path(outdir, "cnSet2.rda"))
	} else{
		message("Loading cnSet2 container")
		load(file.path(outdir, "cnSet2.rda"))

The objects returned by the \Rfunction{genotype} and
\Rfunction{genotype2} differ in size as the former returns an object
with matrices in the assay data slot, whereas the latter return an
object with pointers to files on disk.  The functions \Rfunction{open}
and \Rfunction{close} can be used to open and close the connections
stored in the assay data of the \Robject{cnSet2} object, respectively.
Subsetting the \Robject{ff} object pulls the data from disk into
memory.  As subsetting operations pull the data from disk to memory,
these operations must be performed with care.  In particular,
subsetting the \Robject{cnSet2} will subset each assay data element
and this can be slow.  The preferred approach would be to extract the
assay data element that is needed, and then subset this function
object as illustrated below for the genotype calls.

print(object.size(cnSet), units="Mb")
print(object.size(cnSet2), units="Mb")
	replicate(5, system.time(gt1 <- calls(cnSet)[, 1:50])[[1]])
	replicate(5, system.time(gt2 <- calls(cnSet2)[, 1:50])[[1]])
gt1 <- calls(cnSet)[, 1:50]
gt2 <- calls(cnSet2)[, 1:50]
all.equal(gt1, gt2)

\noindent With \Robject{ff} objects the additional I/O time required
for extracting data is substantial and will increase for larger
datasets.  Patience is required.

Note that for the Affymetrix 6.0 platform the assay data elements each
have a row dimension corresponding to the total number of polymorphic
and nonpolymorphic markers interrogated by the Affymetrix 6.0
platform.  A consequence of keeping the rows of the assay data
elements the same for all of the statistical summaries is that the
matrix used to store genotype calls is larger than necessary.  Also,
note the additional overhead of some operations when using
\Robject{ff} objects.  For instance, the posterior probabilities for
the CRLMM genotype calls are represented as integers. The accessor
\Robject{snpCallProbability} can be used to access these confidence
scores.  When stored as matrices, converting the integer
representation back to the probability scale is straightforward as
shown below.  However, for the \Robject{ff} objects we must first
bring the data into memory. To bring all the scores into memory, one
could use the function \Rfunction{[,]} but this could be slow and
require a lot of RAM depending on the size of the dataset.  Instead,
we suggest pulling only the needed rows and columns from memory. In
the following example, we convert the integer scores to probabilities
for the CEPH samples.  As genotype confidence scores are not
applicable to the nonpolymorphic markers, we extract only the
polymorphic markers using the \Rfunction{isSnp} function.  We observe
a slight difference in the posterior probabilities.  The difference
arises because \Robject{cnSet2} was processed with a larger number of
samples -- for some SNPs a larger sample size can improve the genotype
calling algorithm as the AA, AB, and BB clusters are better defined.

integerScoreToProbability <- function(x)  1-2^(-x/1000)
rows <- which(isSnp(cnSet))
cols <- 1:50
posterior.prob <- tryCatch(integerScoreToProbability(snpCallProbability),
			    error=function(e) print("This will not work for an ff object."))
posterior.prob1 <- integerScoreToProbability(snpCallProbability(cnSet)[rows, cols])
posterior.prob2 <- integerScoreToProbability(snpCallProbability(cnSet2)[rows, cols])
all.equal(posterior.prob1, posterior.prob2)

Next, we obtain locus-level estimates of copy number by fitting a
linear model to each SNP. A variable named 'batch' must be indicated
in the \Robject{protocolData} of the \Robject{cnSet} object. As the
inverse variance of the within-genotype normalized intensities are
used as weights in the linear model (and hence the design matrix in
the regression changes for each SNP), the time is linear with the
number of markers on the array. Copy number estimation at
nonpolymorphic markers and polymorphic markers with unobserved
genotypes is more difficult. We refer the interested reader to the
technical report \citep{Scharpf2009}.  Again, we peform the copy
number estimation using the matrix version and the ff version in
parallel and encourage users with large datasets to explore the
latter. As with the preprocessing and genotyping, the following code
should be submitted as part of the batch job as it is too slow for
interactive analysis.

<<cn, echo=FALSE, eval=FALSE>>=
##Matrix format
cnSet <- crlmmCopynumber(cnSet)

<<save.cnset, echo=FALSE, eval=FALSE>>=
save(cnSet, file=file.path(outdir, "cnSet.rda"))

<<cn.ff, echo=FALSE, eval=FALSE>>=
rm(cnSet); gc()
##ff objects
system.time(cnSet2 <- crlmmCopynumber2(cnSet2))
save(cnSet2, file=file.path(outdir, "cnSet2.rda"))

<<timings, eval=FALSE>>=
tryCatch(print(paste("Time for matrix version:", time1)), error=function(e) print("timing for CN estimation not available"))
tryCatch(print(paste("Time for ff version:", time2)), error=function(e) print("timing for CN estimation not available"))
rm(cnSet2); gc()


For the remainder of this vignette, we illustrate accessors and
visualizations using the sample object provided in the
\Rpackage{crlmm} package.

x <- sample.CNSetLM

\subsection{Quantile-normalized intensities}

Accessors for the quantile normalized intensities for the A allele at
polymorphic loci:

snp.index <- which(isSnp(x))
np.index <- which(!isSnp(x))
a <- (A(x))[snp.index, ]

The extra set of parentheses surrounding \Robject{A(cnSet2)} above is
added to emphasize the appropriate order of operations. Subsetting the
entire \Robject{x} object in the following code should be avoided for
large datasets.

a <- A(x[snp.index, ])

The quantile-normalized intensities for nonpolymorphic loci are obtained

npIntensities <- (A(x))[np.index, ]

Quantile normalized intensities for the B allele at polymorphic loci:

b.snps <- (B(x))[snp.index, ]

Note that NAs are recorded in the 'B' assay data element for
nonpolymorphic loci:

all(is.na((B(x))[np.index, ]))

<<clean, echo=FALSE>>=
rm(b.snps, a, npIntensities); gc()

\paragraph{\Robject{SnpSet}: Genotype calls and confidence scores}

Genotype calls:
genotypes <- (snpCall(x))[snp.index, ]
Confidence scores of the genotype calls:
genotypeConf <- integerScoreToProbability(snpCallProbability(x)[snp.index[1:10], ])

\paragraph{\Robject{CopyNumberSet}: allele-specific copy number}

Allele-specific copy number at polymorphic loci:
ca <- CA(x[snp.index, ])/100

Total copy number at nonpolymorphic loci:
cn.nonpolymorphic <- CA(x[np.index, ])/100

Total copy number at both polymorphic and nonpolymorphic loci:
path <- system.file("scripts", package="crlmm")
source(file.path(path, "helperFunctions.R"))
cn <- totalCopyNumber(x, i=c(snp.index, np.index))
apply(cn, 2, median, na.rm=TRUE)

(Note: the accessor totalCopyNumber method is sourced from a few
helper functions. Documentation will be available in the devel version
of the \Rpackage{oligoClasses} package.)

\subsection{Other accessors}

Information on physical position and chromosome can be accessed as follows:

xx <- position(x)
yy <- chromosome(x)

Parameters from the linear model used to estimate copy number are
stored in the slot \Robject{lM}.


\section{Suggested visualizations}


A histogram of the signal to noise ratio for the HapMap samples:

<<plotSnr, fig=TRUE, include=FALSE>>=
hist(x$SNR, xlab="SNR", main="", breaks=25)

  \caption{Signal to noise ratios for the HapMap samples.}

\paragraph{One sample at a time: locus-level estimates}

Figure \ref{fig:oneSample} plots physical position (horizontal axis)
versus copy number (vertical axis) for the first sample.  There is less
information to estimate copy number at nonpolymorphic loci; improvements
to the univariate prediction regions at nonpolymorphic loci are a future
area of research.

<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>=
par(las=1, mar=c(4, 5, 4, 2))
plot(position(x), copyNumber(x)[, 1]/100, pch=21,
     cex=0.4, xaxt="n", col="grey20", ylim=c(0,5),
     ylab="copy number", xlab="physical position (Mb)",
     main=paste(sampleNames(x)[1], ", CHR:", unique(chromosome(x))))
points(position(x)[!isSnp(x)], copyNumber(x)[!isSnp(x), 1]/100,
       pch=21, cex=0.4, col="lightblue", bg="lightblue")
axis(1, at=pretty(range(position(x))), labels=pretty(range(position(x)))/1e6)

<<idiogram, eval=FALSE, echo=FALSE>>=
plotCytoband(22, new=FALSE, cytoband.ycoords=c(3.8, 4), label.cytoband=FALSE)

  \caption{\label{fig:oneSample} Total copy number (y-axis) for
    chromosome 1 plotted against physical position (x-axis) for one
    sample.  Estimates at nonpolymorphic loci are plotted in light

%<<overlayHmmPredictions, fig=TRUE, include=FALSE>>=
%ask <- FALSE
%op <- par(mfrow=c(3, 1), las=1, mar=c(1, 4, 1, 1), oma=c(3, 1, 1, 1), ask=ask)
%##Put fit on the copy number scale
%cns <- copyNumber(cnSet2)
%cnState <- hmmPredictions - as.integer(1)
%xlim <- c(10*1e6, max(position(cnSet2)))
%cols <- brewer.pal(8, "Dark2")[1:4]
%for(j in 1:3){
%	plot(position(cnSet2), cnState[, j], pch=".", col=cols[2], xaxt="n",
%	     ylab="copy number", xlab="Physical position (Mb)", type="s", lwd=2,
%	     ylim=c(0,6), xlim=xlim)
%	points(position(cnSet2), cns[, j], pch=".", col=cols[3])
%	lines(position(cnSet2), cnState[,j], lwd=2, col=cols[2])
%	axis(1, at=pretty(position(cnSet)),
%	     labels=pretty(position(cnSet))/1e6)
%	abline(h=c(1,3), lty=2, col=cols[1])
%	legend("topright", bty="n", legend=sampleNames(cnSet)[j])
%	legend("topleft", lty=1, col=cols[2], legend="copy number state",
%	       bty="n", lwd=2)
%	plotCytoband(CHR, cytoband.ycoords=c(5, 5.2), new=FALSE,
%		     label.cytoband=FALSE, xlim=xlim)
%  \includegraphics[width=\textwidth]{copynumber-overlayHmmPredictions}
%  \caption{\label{fig:overlayHmmPredictions} Total copy number (y-axis)
%    for chromosome 22 plotted against physical position (x-axis) for
%    three samples.  Estimates at nonpolymorphic loci are plotted in
%    light blue. }

\paragraph{One SNP at a time}

Scatterplots of the A and B allele intensities (log-scale) can be
useful for assessing the biallelic genotype calls.  This section of
the vignette is currently under development.
% The following code chunk is
%displayed in Figure \ref{fig:prediction}.

<<predictionRegions, fig=TRUE, width=8, height=8, include=FALSE, eval=FALSE, echo=FALSE>>=
i <- snp.index[1]
##trace(plotCNSetLM, browser)
plot(i, x, copynumber=2)
##myScatter <- function(object, add=FALSE, ...){
##	A <- log2(A(object))
##	B <- log2(B(object))
##	if(!add){
##		plot(A, B, ...)
##	} else{
##		points(A, B, ...)
##	}
##index <- which(isSnp(cnSet))[1:9]
##xlim <- ylim <- c(6.5,13)
##par(mfrow=c(3,3), las=1, pty="s", ask=FALSE, mar=c(2, 2, 2, 2), oma=c(2, 2, 1, 1))
##for(i in index){
##	gt <- calls(cnSet)[i, ]
##	if(i != 89){
##		myScatter(cnSet[i, ],
##			  pch=pch,
##			  col=colors[snpCall(cnSet)[i, ]],
##			  bg=colors[snpCall(cnSet)[i, ]], cex=cex,
##			  xlim=xlim, ylim=ylim)
##		mtext("A", 1, outer=TRUE, line=1)
##		mtext("B", 2, outer=TRUE, line=1)
##		crlmm:::ellipse.CNSet(cnSet[i, ], copynumber=2, batch="C", lwd=2, col="black")
##		crlmm:::ellipse.CNSet(cnSet[i, ], copynumber=2, batch="Y", lwd=2, col="grey50")
##	} else {
##		plot(0:1, xlim=c(0,1), ylim=c(0,1), type="n", xaxt="n", yaxt="n")
##		legend("center",
##		       legend=c("CN = 2, CEPH", "CN = 2, Yoruban"),
##		       col=c("black", "grey50"), lwd=2, bty="n")
##	}

%  \centering
%  \includegraphics[width=0.8\textwidth]{copynumber-predictionRegions}
%  \caption{\label{fig:prediction} Scatterplots of A versus B
%    intensities.  Each panel displays a single SNP. The ellipses
%    indicate the 95\% probability region for copy number 2 for the CEPH
%    (black) and Yoruban subjects (grey).}

%\section{Details for the copy number estimation procedure}
%See the technical report \citep{Scharpf2009}.

\section{Trouble shooting}

Suppose that you are only interested in estimating copy number at
autosomal markers and you encountered an error using the
crlmmCopynumber2 function for estimating copy number at nonpolymorphic
markers on chromosome X.

Making a subset of a cnSet2 without pulling all the data from disk to
cnSet <- cnSet2

cnSet <- crlmmCopynumber2(cnSet)

marker.index <- which(chromosome(cnSet) < 23)
nr <- length(marker.index)
nc <- ncol(cnSet)

Set up a new directory for storing ff objects so that we can
completely remove all the ff objects in the old directory when we're


Next, initialize the container using utilities from the oligoClasses package.

CA <- initializeBigMatrix("CA", nr, nc)
CB <- initializeBigMatrix("CB", nr, nc)
A <- initializeBigMatrix("A", nr, nc)
B <- initializeBigMatrix("B", nr, nc)
GT <- initializeBigMatrix("GT", nr, nc)
PP <- initializeBigMatrix("confs", nr, nc)
cnSet.autosomes <- new("CNSetLM",
		       featureData=featureData(cnSet)[marker.index, ],

Next, we need to populate the ff objects with data. The easiest way to
do this without requiring too much RAM is to loop through the
samples. Lets do 5 samples at a time.

sample.index <- splitIndicesByLength(1:ncol(cnSet), 5)
for(j in sample.index){
	A(cnSet.autosomes)[, j] <- A(cnSet)[marker.index, j]
	B(cnSet.autosomes)[, j] <- B(cnSet)[marker.index, j]
	calls(cnSet.autosomes)[, j] <- calls(cnSet)[marker.index, j]
	snpCallProbability(cnSet.autosomes)[, j] <- snpCallProbability(cnSet)[marker.index, j]

Check to see that the data is there.

The subset operation would pull all the data from disk and create a
cnSet object with matrices.  If the dataset is very large, the subset
operation would be very slow and potentially create a large amount of

\section{Session information}
<<sessionInfo, results=tex>>=