inst/scripts/Infrastructure.Rnw
7c5db426
 %\VignetteIndexEntry{Infrastructure for copy number analysis}
16fadf14
 %\VignetteDepends{crlmm, genomewidesnp6Crlmm}
 %\VignetteKeywords{crlmm, SNP 6}
 %\VignettePackage{crlmm}
 \documentclass{article}
 \usepackage{graphicx}
 \usepackage{natbib}
 \usepackage{url}
 \newcommand{\Rfunction}[1]{{\texttt{#1}}}
 \newcommand{\Rmethod}[1]{{\texttt{#1}}}
 \newcommand{\Rcode}[1]{{\texttt{#1}}}
 \newcommand{\Robject}[1]{{\texttt{#1}}}
 \newcommand{\Rpackage}[1]{{\textsf{#1}}}
 \newcommand{\Rclass}[1]{{\textit{#1}}}
 \newcommand{\oligo}{\Rpackage{oligo }}
 \newcommand{\R}{\textsf{R}}
 \newcommand{\crlmm}{\Rpackage{crlmm}}
 \newcommand{\ff}{\Rpackage{ff}}
 \usepackage[margin=1in]{geometry}
 
 \begin{document}
 \title{Infrastructure for copy number analysis in \crlmm{}}
 \date{\today}
 \author{Rob Scharpf}
669c6a90
 
16fadf14
 \maketitle
 
669c6a90
 \begin{abstract}
16fadf14
 
669c6a90
   This vignette provides an overview of the \Rclass{CNSet} class and a
   brief discussion of the underlying infrastructure for large data
e34f6c9e
   support with the \Rpackage{ff} package.  This vignette instantiates
669c6a90
   an object of class \Rclass{CNSet} using a trivial dataset with 3
   files.  As this sample size is too small for estimating copy number
   with the \crlmm{} package, the final section of this vignette loads
   an object created by the analysis of 180 HapMap CEL files
e34f6c9e
   (Affymetrix 6.0 platform).  This object was instantiated by running
   the \verb+AffyGW+ vignette.
16fadf14
 
669c6a90
 \end{abstract}
16fadf14
 
 <<libraries,results=hide>>=
 library(ff)
 library(crlmm)
 @
 
 \section{Supported platforms}
 
 The supported Affymetrix and Infinium platforms are those for which a
 corresponding annotation package is available.  The annotation
 packages contain information on the markers, such as physical position
 and chromosome, as well as pre-computed parameters estimated from
 HapMap used during the preprocessing and genotyping steps. For
 Affymetrix, the 5.0 and 6.0 platforms are supported and the
 corresponding annotation packages are \Rpackage{genomewidesnp5Crlmm}
 and \Rpackage{genomewidesnp6Crlmm}.  Supported Infinium platforms are
 listed in the following code chunk.
 
 <<supportedPlatforms>>=
 pkgs <- annotationPackages()
 crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)]
 crlmm.pkgs[grep("human", crlmm.pkgs)]
 @
 
 \textbf{Large data:} In order to reduce \crlmm{}'s memory footprint
 for copy number estimation, we require the \Rpackage{ff}.  The
 \Rpackage{ff} package provides infrastructure for accessing and
 writing data to disk instead of keeping data in memory.  As the
 functions for preprocessing, genotyping, and copy number estimation do
 not simultaneously require all samples and all probes in memory,
 memory-usage by \crlmm{} can be fine-tuned by reading in and
 processing subsets of the markers and/or samples. The functions
 \Rfunction{ocSamples} and \Rfunction{ocProbesets} in the
 \Rpackage{oligoClasses} package can be used to declare how many
 markers and samples to read at once.  In general, specifying smaller
 values should reduce the RAM required for a particular job.  In
 general, smaller values will increase the run-time. In the following
e34f6c9e
 code-chunk, we declare that \crlmm{} should process 100,000 markers at
 a time (when possible) and 200 samples at a time.  If our dataset
 contained fewer than 200 samples, the \Rfunction{ocSamples} option
16fadf14
 would not have any effect.  One can view the current settings for
 these commands, by typing the functions without an argument.
 
 <<ram>>=
e34f6c9e
 ocProbesets(100e3)
16fadf14
 ocSamples(200)
 @
 
 \section{The \Rclass{CNSet} container}
 
669c6a90
 \subsection{Instantiating an object of class \Rclass{CNSet}}
16fadf14
 
669c6a90
 An object of class \Rclass{CNSet} can be instantiated by one of two
 methods:
16fadf14
 
669c6a90
 \begin{itemize}
16fadf14
 
e34f6c9e
 \item[Approach 1:] during the preprocessing of the raw intensities for
   Illumina and Affymetrix arrays by the the functions
   \Rfunction{constructInf} and \Rfunction{genotype},
   respectively. (The \Rfunction{genotype} calls the function
   \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for
   Affymetrix platforms.)
16fadf14
 
669c6a90
 \item[Approach 2:] by subsetting an existing \Robject{CNSet} object.
   As per usual, the `[' method can be used to extract a subset of
   markers $i$ as in `[i, ]', a subset of samples $j$ as in `[, j]', or
   a subset of markers $i$ and samples $j$ as in `[i, j]'.
16fadf14
 
669c6a90
 \end{itemize}
16fadf14
 
 There are important differences in the underlying data representation
e34f6c9e
 depending on how the \Rclass{CNSet} object was instantiated.  In
 particular, objects generated by the functions
 \Rfunction{constructInf} and \Rfunction{genotype} store
 high-dimensional data on disk rather than in memory through protocols
 defined in the \R{} package \ff{}.  For instance, the normalized
 intensities and genotype calls in a \Rclass{CNSet}-instance from
 approach (1) are \ff{}-derived objects.  By contrast, when such an
 objected generated by approach (1) is subset by the `[' method, an
 object of the same class is returned but the \Rpackage{ff}-derived
 objects are coerced to ordinary matrices. Note, therefore, that both
 approaches (1) and (2) may involve substantial I/O and that (2) should
 be performed judiciously.
16fadf14
 
669c6a90
 \subsubsection{Approach 1}
 
 To illustrate the first approach, we begin by specifying a local
 directory to store output files and setting the \Rfunction{ldPath}
 function with this path.
 
 <<path>>=
063b3d14
 outdir <- paste("/local_data/r00/crlmm/", getRversion(), "/infrastructure", sep="")
669c6a90
 ldPath(outdir)
063b3d14
 if(!file.exists(outdir)) dir.create(outdir)
16fadf14
 @
 
 
669c6a90
 Next we load the annotation package required, as well as the \R{}
 package \Rpackage{hapmapsnp6} that contains 3 example CEL files.  We
 use the \Rfunction{system.file} function to find the path to the CEL
 files.
 
 <<annotation_metadata>>=
 require(genomewidesnp6Crlmm) & require(hapmapsnp6)
 path <- system.file("celFiles", package="hapmapsnp6")
 celfiles <- list.celfiles(path, full.names=TRUE)
 @
 
 Typically, an object of class \Rclass{CNSet} is instantiated as part
 of the preprocessing and genotyping by calling the function
 \Rfunction{genotype}, as illustrated in the
e34f6c9e
 \verb+AffyGW+ vignette.
669c6a90
 
 <<instantiateToyExample,cache=TRUE>>=
 exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
16fadf14
 @
 
669c6a90
 Several files with \verb+.ff+ extensions now appear in the directory
 indicated by the \Rfunction{ldPath} function.
 
 <<ldpath>>=
 ldPath()
 @
 
 One could also instantiate an object of class \Rclass{CNSet} without
e34f6c9e
 preprocessing/genotyping by calling the exported function
669c6a90
 \Rfunction{constructAffy} directly using the \Rfunction{:::} operator.
 
 <<constructAffy,eval=FALSE>>=
e34f6c9e
 tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
669c6a90
 @
 
 The \Rfunction{show} method provides a concise summary of the
 \Robject{exampleSet} object. Note the class of the elements in the
 \verb+batchStatistics+ and \verb+assayData+ slots is indicated in the
 first line of the summary.
 
 <<show>>=
 invisible(open(exampleSet))
 exampleSet
 @
 
 % We briefly outline some of the unique aspects of the
 % \Rclass{CNSet}-class using in \crlmm{} that may differ from the more
 % standard extensions of the virtual class \Rclass{eSet} defined in
 % the \Rpackage{Biobase} package.
 
 As the \verb+assayData+ elements of the \Robject{exampleSet} object
 are stored on disk rather than in memory, we inspect attributes of the
 elements by first opening the file connection using the
 \Rfunction{open}. For example, in the following code we extract the
 normalized intensities for the A allele by first opening the object
 returned by the \Rfunction{A} function.  The name of the file where
 the data is stored on disk is provided by the
 \Rfunction{filename}. Finally, the normalized intensities can be
 pulled from disk to memory by the `[' method.  It can be useful to
 wrap the `[' method by the \Rfunction{as.matrix} to ensure that the
 output is the desired class.
 
 <<approach1>>=
 invisible(open(exampleSet))
 class(A(exampleSet))
 filename(A(exampleSet))
 as.matrix(A(exampleSet)[1:5, ])
 @
 
 \paragraph{Moving \texttt{*.ff} files} Ideally, the files with
 \verb+.ff+ extension should not be moved.  However, if this is not
 possible, the safest way to move these files is to clone all of the
 \ff{} objects using the \Rfunction{clone}, followed by the
 \Rfunction{delete} function to remove the original files on disk. An
 example of the \Rfunction{delete} function is included at the end of
 the \verb+IlluminaPreprocessCN+ vignette. See the documentation for
 the \Rfunction{clone} and \Rfunction{delete} functions in the \ff{}
 package for additional details.
 
16fadf14
 
 \paragraph{Order of operations:} For \Rclass{CNSet}-instances derived
 by approach (1), users should be mindful of the substantial I/O when
 using accessors to extract data from the class.  For example, the
 following 2 methods would extract identical results, with the latter
 being much more efficient (extra parentheses are added to the second
 operation to emphasize the order of operations):
 
 <<orderOfOperations>>=
 ##inefficient
669c6a90
 ##invisible(open(cnSet))
 A(exampleSet[1:5, ])
16fadf14
 ## preferred
669c6a90
 (A(exampleSet))[1:5, ]
16fadf14
 @
 
 
669c6a90
 \subsubsection{Approach 2: using `['}
 
 Here we instantiate a new object of class \Rclass{CNSet} by applying the
 `[' method to an existing object of class \Rclass{CNSet}.
 
 <<approach2>>=
 cnset.subset <- exampleSet[1:5, ]
 @
 
 Note the class of the \verb+batchStatistics+ and \verb+assayData+
 elements of the \Robject{cnset.subset} object printed in the first
 line of the summary.
 
 <<show.subset>>=
 show(cnset.subset)
 @
 
 \subsection{Slots of class \Rclass{CNSet}}
 \subsubsection{\texttt{featureData}}
16fadf14
 
 Information on physical position, chromosome, and whether the marker
 is a SNP can be accessed through accessors defined for the
 \verb+featureData+.
 
 <<featuredataAccessors>>=
063b3d14
 library(Biobase)
669c6a90
 fvarLabels(exampleSet)
 position(exampleSet)[1:10]
 chromosome(exampleSet)[1:10]
 is.snp <- isSnp(exampleSet)
16fadf14
 table(is.snp)
 snp.index <- which(is.snp)
 np.index <- which(!is.snp)
669c6a90
 chr1.index <- which(chromosome(exampleSet) == 1)
16fadf14
 @
 
669c6a90
 \subsubsection{\texttt{assayData}}
16fadf14
 
 The \verb+assayData+ elements are of the class
 \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how
 the \Rclass{CNSet} object was instantiated.  Elements in the
 \verb+assayData+ environment can be listed using the \Rfunction{ls}
 function.
 
 <<assayData>>=
669c6a90
 ls(assayData(exampleSet))
16fadf14
 @
 
 The normalized intensities for the A and B alleles have names
 \verb+alleleA+ and \verb+alleleB+ and can be accessed by the methods
 \Rfunction{A} and \Rfunction{B}, respectively.  Genotype calls and
 confidence scores can be accessed by \Rfunction{snpCall} and
 \Rfunction{snpCallProbability}, respectively.  Note that the
 confidence scores are represented as integers to reduce the filesize,
 but can be tranlated to the probability scale using the \R{} function
 \Rfunction{i2p}. For example,
 
 <<i2p>>=
669c6a90
 scores <- as.matrix(snpCallProbability(exampleSet)[1:5, 1:2])
16fadf14
 i2p(scores)
 @
 
 Note that for the Affymetrix 6.0 platform the assay data elements each
 have a row dimension corresponding to the total number of polymorphic
 and nonpolymorphic markers interrogated by the Affymetrix 6.0
 platform.  A consequence of keeping the rows of the assay data
 elements the same for all of the statistical summaries is that the
 matrix used to store genotype calls is larger than necessary.
 
669c6a90
 \subsubsection{\texttt{batch} and \texttt{batchStatistics}}
16fadf14
 
 As defined in Leek \textit{et al.} 2010, \textit{Batch effects are
   sub-groups of measurements that have qualitatively different
   behaviour across conditions and are unrelated to the biological or
   scientific variables in a study}.  The \verb+batchStatistics+ slot
 is an environment used to store SNP- and batch-specific summaries,
 such as the sufficient statistics for the genotype clusters and the
 linear model parameters used for copy number estimation.  The
 \verb+batch+ slot is used to store the 'batch name' for each
 array. For small studies in which the samples were processed at
 similar times (e.g., within a month), all the samples can be
 considered to be in the same batch.  For large studies in which the
 samples were processed over several months, users should the scan date
 of the array or the chemistry plate are useful surrogates. The only
 constraint on the \verb+batch+ variable is that it must be a character
 vector that is the same length as the number of samples to be
 processed.  The \verb+batch+ is specified as an argument to the \R{}
 functions \Rfunction{constructInf} and \Rfunction{genotype} that
 instantiate \Rclass{CNSet} objects for the Illumina and Affymetrix
 platforms, respectively.  The \Rfunction{batch} function can be used
 to access the \verb+batch+ information on the samples as in the
 following example.
 
 <<batchAccessor>>=
669c6a90
 batch(exampleSet)
16fadf14
 @
 
 For the \verb+batchStatistics+ slot, the elements in the environment
 have the class \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix},
 depending on how the \Rclass{CNSet} object was instantiated.  The
 dimension of each element is the number of markers (SNPs +
 nonpolymorphic markers) $\times$ the number of batches. The names of
 the elements in the environment can be list using the \R{} function
 \Rfunction{ls}.
 
 <<batchStatistics>>=
669c6a90
 ls(batchStatistics(exampleSet))
16fadf14
 @
 
 Currently, the batch-specific summaries are stored to allow some
 flexibility in the choice of downstream analyses of copy number and
 visual assessments of model fit.  Documentation for such applications
 will be expanded in future versions of \crlmm{}, and are currently not
 intended to be accessed directly by the user.
 
 %The \Robject{CNSet} class also contains the slot
 %\Robject{batchStatistics} that contains batch-specific summaries
 %needed for copy number estimation. In particular, each element is a
 %matrix (or an ff object) with R rows and C columns, correspoinding to
 %R markers and C batches.  The summaries includes the within genotype
 %cluster medians and median absolute deviations (mads), but also
 %parameters estimated from the linear model.  (For unobserved
 %genotypes, the medians are imputed and the variance is obtained the
 %median variance (across markers) within a batch. ) The elements of the
 %slot can be listed as follows.
 
669c6a90
 \subsubsection{\texttt{phenoData}}
16fadf14
 
 Sample-level summaries obtained during the preprocessing/genotyping
 steps include skew (SKW), the signal to noise ratio (SNR), and gender
 (1=male, 2=female) are stored in the \verb+phenoData+ slot.  As for
 other \Rclass{eSet} extensions, the \verb+$+ method can be used to
 extract these summaries.  For \Rclass{CNSet} objects generated by
 approach (1), these elements are of the class
 \Rclass{ff\_vector}.
 
 <<phenodataAccessors>>=
669c6a90
 varLabels(exampleSet)
 class(exampleSet$gender)
 invisible(open(exampleSet$gender))
 exampleSet$gender
16fadf14
 @
 
 The `[' methods without arguments can be used to coerce to a vector.
 
 <<vector>>=
669c6a90
 c("male", "female")[exampleSet$gender[]]
 invisible(close(exampleSet$gender))
16fadf14
 @
 
669c6a90
 
 
 
 \subsubsection{\texttt{protocolData}}
16fadf14
 
 The scan date of the arrays are stored in the \verb+protocolData+.
 
 <<protocolData>>=
669c6a90
 varLabels(protocolData(exampleSet))
 protocolData(exampleSet)$ScanDate
16fadf14
 @
 
669c6a90
 \section{Trouble shooting with a HapMap example}
 
 This section uses an object of class \Rclass{CNSet} instantiated by
e34f6c9e
 the \verb+AffyGW+ vignette and saved to a local path
669c6a90
 on our computing cluster indicated by the object \Robject{outdir}
 below.  The \verb+copynumber+ vignette was used to fill out the
 \verb+batchStatistics+ slot of the \Robject{cnSet} object.
 
 <<path>>=
 if(getRversion() < "2.13.0"){
 	rpath <- getRversion()
 } else rpath <- "trunk"
 outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
 @
 
 Next, we load the \Robject{cnSet} object.
 
 <<loadcnset>>=
 if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))
 invisible(open(cnSet))
 @
 
16fadf14
 
 \subsection{Missing values}
 
 Most often, missing values occur when the genotype confidence scores
 for a SNP were below the threshold used by the
 \Robject{crlmmCopynumber} function. For the HapMap analysis, we used a
669c6a90
 confidence threshold of 0.80 (the default).  In the following code, we
 assess NA's appearing for the raw copy number estimates for the first
 10 samples.
16fadf14
 
 <<NA>>=
 GT.CONF.THR <- 0.80
 autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23)
669c6a90
 sample.index <- 1:10
 ct <- totalCopynumber(cnSet, i=autosome.index, j=sample.index)
16fadf14
 ca <- CA(cnSet, i=autosome.index, j=sample.index)
 cb <- CB(cnSet, i=autosome.index, j=sample.index)
 missing.ca <- is.na(ca)
 missing.cb <- is.na(cb)
 (nmissing.ca <- sum(missing.ca))
 (nmissing.cb <- sum(missing.cb))
 identical(nmissing.ca, nmissing.cb)
 @
 
 If \Robject{nmissing.ca} is nonzero, check the genotype confidence
 scores provided by the crlmm genotyping algorithm against the
 threshold specified in \Robject{crlmmCopynumber}.
 
 <<NAconfidenceScores>>=
 if(nmissing.ca > 0){
669c6a90
 	##invisible(open(snpCallProbability(cnSet)))
16fadf14
 	gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index])
669c6a90
 	##invisible(close(snpCallProbability(cnSet)))
16fadf14
 	below.thr <- gt.conf < GT.CONF.THR
 	index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index)))
 	nmissingBecauseOfGtThr <- length(index.allbelow) * length(sample.index)
 	stopifnot(identical(nmissingBecauseOfGtThr, nmissing.ca))
 	## or calculate the proportion of missing effected by low crlmm confidence
 	length(index.allbelow) * length(sample.index)/nmissing.ca
 }
 @
 
 One could inspect the cluster plots for the 'low confidence' calls.
 
 <<clusterPlotsLowConfidence>>=
 ## TODO
 @
 
 We repeat the above check for missing values at polymorphic loci on
 chromosome X.  In this case, we compare the \Robject{rowSums} of the
 missing values to the number of samples to check whether all of the
 estimates are missing for a given SNP.
 
 <<NAchromosomeX>>=
669c6a90
 ## start with first batch
063b3d14
 sample.index <- which(batch(cnSet)==batch(cnSet)[1])
16fadf14
 X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
 ca.X <- CA(cnSet, i=X.index, j=sample.index)
 missing.caX <- is.na(ca.X)
 (nmissing.caX <- sum(missing.caX))
 missing.snp.index <- which(rowSums(missing.caX) == length(sample.index))
 index <- which(rowSums(missing.caX) == length(sample.index))
669c6a90
 ##length(index)*length(sample.index)/nmissing.caX
16fadf14
 @
 
 From the above codechunk, we see that \Sexpr{length(index)} SNPs have
 NAs for all the samples. Next, we tally the number of NAs for
 polymorphic markers on chromosome X that are below the confidence
 threshold.  For the HapMap analysis, all of the missing values arose
 from SNPs in which either the men or the women had confidence scores
 that were all below the threshold.
 
 <<NAcrlmmConfidenceScore>>=
669c6a90
 invisible(open(cnSet$gender))
 F <- which(cnSet$gender[sample.index] == 2)
 M <- which(cnSet$gender[sample.index] == 1)
 gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index])
 below.thr <- gt.conf < GT.CONF.THR
 index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F)))
 index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M)))
 all.equal(index.allbelowF, index.allbelowM)
 all.equal(as.integer(index.allbelowF), as.integer(index))
16fadf14
 @
 
 For nonpolymorphic loci, the genotype confidence scores are irrelevant
 and estimates are available at most markers.
 
 <<NAnonpolymorphic.autosome>>=
 np.index <- which(!isSnp(cnSet) & chromosome(cnSet)==23)
 ca.F <- CA(cnSet, i=np.index, j=F)
 ca.M <- CA(cnSet, i=np.index, j=M)
 ## NAs for one marker
 ca.F <- ca.F[-match("CN_974939", rownames(ca.F)), ]
 ca.M <- ca.M[-match("CN_974939", rownames(ca.M)), ]
 sum(is.na(ca.F))
 sum(is.na(ca.M))
 @
 
 %TODO: marker CN\_974939 has NAs for the normalized intensities.  This
 %is because CN\_974939 is not in the \Robject{npProbesFid} file in
 %\Rpackage{genomewidesnp6Crlmm}.  The \Robject{npProbesFid} file should
 %be updated in the next \Rpackage{genomewidesnp6Crlmm} release.
 
 In total, there were \Sexpr{length(missing.snp.index)} polymorphic
 markers on chromosome X for which copy number estimates are not
 available.  Lowering the confidence threshold would permit estimation
 of copy number at most of these loci.  A confidence threshold is
 included as a parameter for the copy number estimation as an approach
 to reduce the sensitivity of genotype-specific summary statistics,
 such as the within-genotype median, to intensities from samples that
 do not clearly fall into one of the biallelic genotype clusters. There
 are drawbacks to this approach, including variance estimates that can
 be a bit optimistic at some loci.  More direct approaches for outlier
 detection and removal may be explored in the future.
 
 Copy number estimates for other chromosomes, such as mitochondrial and
 chromosome Y, are not currently available in \crlmm{}.
 
669c6a90
 <<close>>=
16fadf14
 invisible(close(cnSet))
669c6a90
 invisible(close(cnSet$gender))
16fadf14
 @
 
 
 \section{Session information}
 <<sessionInfo, results=tex>>=
 toLatex(sessionInfo())
 @
 
 %\section*{References}
 %
 %%\begin{bibliography}
 %  \bibliographystyle{plain}
 %  \bibliography{refs}
 %\end{bibliography}
 
 
 \end{document}