inst/scripts/IlluminaPreprocessCN.Rnw
16fadf14
 %\VignetteIndexEntry{IlluminaPreprocessCN Vignette for Illumina}
 %\VignetteDepends{crlmm, ff, cacheSweave}
 %\VignetteKeywords{crlmm, illumina, copy number, SNP}
2ae7850e
 %\VignettePackage{crlmm}
 \documentclass{article}
 \usepackage{graphicx}
 \usepackage{natbib}
7c0c9ac5
 \usepackage{amsmath}
063b3d14
 \usepackage{url}
adde216c
 \usepackage[margin=1in]{geometry}
 \newcommand{\crlmm}{\Rpackage{crlmm}}
2ae7850e
 \newcommand{\Rfunction}[1]{{\texttt{#1}}}
 \newcommand{\Rmethod}[1]{{\texttt{#1}}}
 \newcommand{\Rcode}[1]{{\texttt{#1}}}
 \newcommand{\Robject}[1]{{\texttt{#1}}}
 \newcommand{\Rpackage}[1]{{\textsf{#1}}}
 \newcommand{\Rclass}[1]{{\textit{#1}}}
 \newcommand{\oligo}{\Rpackage{oligo }}
 \newcommand{\R}{\textsf{R}}
f0f8ba16
 \newcommand{\ff}{\Rpackage{ff}}
2ae7850e
 
 \begin{document}
16fadf14
 \title{Preprocessing and Genotyping Illumina Arrays for Copy Number Analysis}
f86cba61
 
f1e14c0e
 \date{\today}
f86cba61
 
2ae7850e
 \author{Rob Scharpf}
 \maketitle
 
5cee1b4f
 
 \begin{abstract}
16fadf14
 
f0f8ba16
   This vignette illustrates the steps required prior to copy number
   analysis for Infinium platforms.  Specifically, we require
   construction of a container to store processed forms of the raw
   data, preprocessing to normalize the arrays, and genotyping using
   the CRLMM algorithm.  After completing these steps, users can refer
16fadf14
   to the \verb+copynumber+ vignette.
 
5cee1b4f
 \end{abstract}
 
16fadf14
 
 \section{Set up}
f0f8ba16
 
5cee1b4f
 <<crlmm, results=hide, echo=FALSE>>=
68686d90
 library(crlmm)
f0f8ba16
 options(width=70)
 options(continue=" ")
68686d90
 @
 
284565dd
 %\textbf{Supported platforms:} The supported Infinium platforms are
 %those for which a corresponding annotation package is available.  The
 %annotation packages contain information on the markers, such as
 %physical position and chromosome, as well as pre-computed parameters
 %estimated from HapMap used during the preprocessing and genotyping
 %steps. Currently supported Infinium platforms are listed in the
 %following code chunk.
 The following codechunk declares a directory for saving \Robject{ff}
 files that will contain the normalized intensities and the genotype
 calls.
f0f8ba16
 
 <<ldpath,results=hide>>=
6a6b67c5
 library(ff)
3d159620
 options(ffcaching="ffeachflush")
a1394042
 outdir <- paste("/local_data/r00/crlmm/", getRversion(), "/illumina_vignette", sep="")
f0f8ba16
 ldPath(outdir)
a2d94778
 dir.create(outdir, recursive=TRUE, showWarnings=FALSE)
f1e14c0e
 @
e4e0b214
 
284565dd
 We will also store cached computations in the directory \verb+outdir+.
3e4db4b1
 
f0f8ba16
 <<cacheSweave, echo=FALSE, results=hide>>=
 library(cacheSweave)
 setCacheDir(outdir)
f1e14c0e
 @
518a2908
 
284565dd
 We declare that \crlmm{} should process 150,000 markers at a time
16fadf14
 and/or 500 samples at a time (when possible) to reduce the memory
 footprint.  As our example dataset in this vignette contains fewer
 than 500 samples, all samples will be processed simultaneously.
284565dd
 
 <<ram>>=
 ocProbesets(150e3)
 ocSamples(500)
 @
 
7c0c9ac5
 %\section{Initializing a container for storing processed data}
 %
 %This section will initialize a container for storing processed forms
 %of the data, including the normalized intensities for the A and B
 %alleles and the CRLMM genotype calls and confidence scores.  In
 %addition, the container will store information on the markers
 %(physical position, chromosome, and a SNP indicator), the batch, and
 %the samples (e.g., gender).  To construct this container for Infinium
 %platforms, several steps are required.
f0f8ba16
 
 We begin by specifying the path containing the raw IDAT files for a
 set of samples from the Infinium 370k platform.
 
284565dd
 <<datadir>>=
f0f8ba16
 datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k"
 @
 
 For Infinium platforms, an Illumina sample sheet containing
 information for reading the raw IDAT files is required. Please refer
 to the BeadStudio Genotyping guide, Appendix A, for additional
 information.  The following code reads in the samplesheet for the IDAT
16fadf14
 files on our local server.
5cee1b4f
 
453e688a
 <<samplesheet>>=
3ddde61b
 samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE)
16fadf14
 @
 
 For the purposes of this vignette, we indicate that we only wish to
 process a subset of the arrays.  For our dataset, the file extensions
 are `Grn.dat' and `Red.idat'.  We store the complete path to the
 filename without the file extension in the \Robject{arrayNames} and
 check that all of the green and red IDAT files exists.
 
 <<subsetArrays>>=
3ddde61b
 samplesheet <- samplesheet[-c(28:46,61:75,78:79), ]
 arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"]))
f0f8ba16
 all(file.exists(paste(arrayNames, "_Grn.idat", sep="")))
 all(file.exists(paste(arrayNames, "_Red.idat", sep="")))
 arrayInfo <- list(barcode=NULL, position="SentrixPosition")
453e688a
 @
 
16fadf14
 All supported platforms have a corresponding annotation package.  The
 appropriate annotation package is specified by the platform identifier
 without the \verb+Crlmm+ postfix.
453e688a
 
f0f8ba16
 <<cdfname>>=
 cdfName <- "human370v1c"
d5b39e21
 @
 
f0f8ba16
 Next, we construct a character vector that specifies the batch for
 each of the \Sexpr{length(arrayNames)} arrays.  Here, we have a small
 dataset and process the samples in a single batch. Processing the
 samples as a single batch is generally reasonable if the samples were
 processed at similar times (e.g., within a few weeks).
5cee1b4f
 
f0f8ba16
 <<batch>>=
 batch <- rep("1", nrow(samplesheet))
1507ac0c
 @
73fd7374
 
7c0c9ac5
 %Finally, we initialize an object of class \Robject{CNSet} using the
 %function \Rfunction{constructInf}.
 %
 %<<container,cache=TRUE>>=
 %cnSet <- constructInf(sampleSheet=samplesheet,
 %		      arrayNames=arrayNames,
 %		      batch=batch,
 %		      arrayInfoColNames=arrayInfo,
 %		      cdfName=cdfName,
 %		      verbose=TRUE,
 %		      saveDate=TRUE)
 %@
 %
 %A concise summary of the object's contents can be viewed with the
 %\Rfunction{print} function.
 %
 %<<cnset>>=
 %print(cnSet)
 %@
 %
 %Note that the above object does not yet contain any processed data
 %(only \verb+NA+'s).  As the elements of the \verb+assayData+ slot are
 %\Robject{ff} objects (not matrices), several \verb+.ff+ files now
 %appear in the \verb+outdir+. The \verb+.ff+ files should not be
 %removed and can be listed using the \R{} function
 %\Rfunction{list.files}.  %For the most part, the \emph{appearance} that
16fadf14
 %the data is stored in memory is preserved.
7c0c9ac5
 %
 %<<listff>>=
 %sapply(assayData(cnSet), function(x) class(x)[1])
 %list.files(outdir, pattern=".ff")[1:5]
 %@
 %
 \section{Preprocessing and genotyping}
5cee1b4f
 
f0f8ba16
 The raw intensities from the Infinium IDAT files are read and
 normalized using the function \Rfunction{preprocessInf}.  The function
 \Rfunction{preprocessInf} returns a \Robject{ff} object containing the
 parameters for the mixture model used by the CRLMM genotyping
7c0c9ac5
 algorithm.  Following preprocessing, the \Rfunction{genotypeInf}
 genotypes the samples. The function \Rfunction{genotype.Illumina} is a
 wrapper to the above functions and returns an object of class
 \Rclass{CNSet}.
 
 %<<preprocess,cache=TRUE, results=hide>>=
 %mixtureParams <- preprocessInf(cnSet=cnSet, sampleSheet=samplesheet,
 %			       arrayNames=arrayNames,
 %			       arrayInfoColNames=arrayInfo,
 %			       cdfName=cdfName)
 %@
 %
 %<<showMixtureParams>>=
 %invisible(open(mixtureParams))
 %str(mixtureParams[])
 %invisible(close(mixtureParams))
 %@
 %
 %Note that the normalized intensities for the A and B alleles are no
 %longer \verb+NA+s and can be inspected using the methods \Rfunction{A}
 %and \Rfunction{B}, respectively.
 %
 %<<intensities>>=
 %invisible(open(A(cnSet)))
 %invisible(open(B(cnSet)))
 %as.matrix(A(cnSet)[1:5, 1:5])
 %as.matrix(B(cnSet)[1:5, 1:5])
 %invisible(close(A(cnSet)))
 %invisible(close(B(cnSet)))
 %@
 %
 %\section{Genotyping}
 %
 %CRLMM genotype calls and confidence scores are estimated using the
 %function \Rfunction{genotypeInf}.
 %
 %<<genotype,cache=TRUE>>=
 %updated <- genotypeInf(cnSet, mixtureParams=mixtureParams, cdfName=cdfName)
 %@
 %\vspace{-0.5em}
 %<<>>=
 %updated
 %@
 %
 %\textbf{Wrapper:} As an alternative to calling the functions
 %\Rfunction{constructInf}, \Rfunction{preprocessInf} and
 %\Rfunction{genotypeInf} in sequence, a convenience function called
 %\Rfunction{genotype.Illumina} is a wrapper for the above functions and
 %produces identical results.
f0f8ba16
 
 <<genotype.Illumina,cache=TRUE,results=hide>>=
7c0c9ac5
 cnSet <- genotype.Illumina(sampleSheet=samplesheet,
063b3d14
 			   arrayNames=arrayNames,
 			   arrayInfoColNames=arrayInfo,
 			   cdfName="human370v1c",
 			   batch=batch)
f1e14c0e
 @
2e87b991
 
adde216c
 
7c0c9ac5
 Note, to fully remove the data associated with the \Robject{cnSet2}
 object, one should use the \Rfunction{delete} function in the
 \Rpackage{ff} package followed by the \Rfunction{rm} function.  The
 following code is not evaluated is it would change the results of the
 cached computations in the previous code chunk.
f0f8ba16
 
 <<delete, eval=FALSE>>=
7c0c9ac5
 lapply(assayData(cnSet), delete)
 lapply(batchStatistics(cnSet), delete)
 delete(cnSet$gender)
 delete(cnSet$SNR)
 delete(cnSet$SKW)
 rm(cnSet)
f0f8ba16
 @
adde216c
 
7c0c9ac5
 %\section{Copy number estimation}
 
 \SweaveInput{copynumber}
16fadf14
 
 \section{Session information}
 <<sessionInfo, results=tex>>=
 toLatex(sessionInfo())
 @
 
7c0c9ac5
 \begin{figure}[f]
   \begin{center}
   \includegraphics[width=0.6\textwidth]{IlluminaPreprocessCN-snr.pdf}
   \caption{The signal to noise ratio (SNR) for 180 HapMap samples. For
     Affymetrix platforms, SNR values below 5 can indicate possible
     problems with sample quality.  In some circumstances, it may be
     more helpful to exclude samples with poor DNA quality.}
 \end{center}
 \end{figure}
 
 
 \bibliography{refs}
 \bibliographystyle{plain}
 
2ae7850e
 \end{document}