Browse code

Commit made by the Bioconductor Git-SVN bridge. Consists of 1 commit.

Commit information:

Commit id: 736588e3d47fd58358f61ce2241bb2c85b4d9a0c

update to vignette to process previously unevaluated codechunks

Committed by: Rob Scharpf
Author Name: Rob Scharpf
Commit date: 2014-09-19 10:14:04 -0400
Author date: 2014-09-19 10:14:04 -0400


git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@94285 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 19/09/2014 14:22:04
Showing1 changed files
... ...
@@ -200,8 +200,7 @@ Estimation of log R ratios and B allele frequencies from an object of class \Rcl
200 200
 crlmmCopynumber(cnSet, fit.linearModel=FALSE)
201 201
 @
202 202
 
203
-<<oligoSnpSet,eval=FALSE>>=
204
-library(VanillaICE)
203
+<<oligoSnpSet>>=
205 204
 open(cnSet)
206 205
 oligoSetList <- BafLrrSetList(cnSet)
207 206
 close(cnSet)
... ...
@@ -214,7 +213,7 @@ oligoSetList[[1]]
214 213
 \noindent Log R ratios and B allele frequences can be retrieved by the
215 214
 accessors \Rfunction{lrr} and \Rfunction{baf}, respectively.
216 215
 
217
-<<testEqual,eval=FALSE>>=
216
+<<testEqual>>=
218 217
 lrrList <- lrr(oligoSetList)
219 218
 class(lrrList)
220 219
 dim(lrrList[[1]]) ## log R ratios for chromosome 1.
Browse code

Commit made by the Bioconductor Git-SVN bridge. Consists of 1 commit.

Commit information:

Commit id: 7a7b52509b1e09bf8d3f517244f90f9045d102ca
Commit message: Set enough.men and enough.women to FALSE by default in fit.lm3 function.
Committed by: Rob Scharpf
Author Name: Rob Scharpf
Commit date: 2014-08-25 21:44:19 -0400
Author date: 2014-08-25 21:44:19 -0400



git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@93599 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 26/08/2014 01:45:03
Showing1 changed files
... ...
@@ -225,4 +225,3 @@ dim(bafList[[1]]) ## B allele frequencies for chromosome 1
225 225
 If the \Rfunction{crlmmCopynumber} function is not run prior to the
226 226
 \Rfunction{BafLrrSetList} construction, the log R ratios and BAFs will
227 227
 be initialized as NAs.
228
-
Browse code

merging from collab

* collab:
add warning in vignette about NAs with BafLrrSetList function
Added Human Omni Express Exome 8 v1.1b as a supported chip
updated version number of pacakge and man pages to reflect these changes
skeleton for krlmm capability added. genotype.Illumina() can now take and XY object as input
update copynumber.Rnw to use BafLrrSetList
updates to vignettes
update namespace

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@79138 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 31/07/2013 01:37:34
Showing1 changed files
... ...
@@ -53,15 +53,24 @@ print(histogram(~snr,
53 53
 
54 54
 \section{Copy number estimation}
55 55
 
56
-As described in \cite{Scharpf2011}, the CRLMM-CopyNumber algorithm
57
-fits a linear model to the normalized intensities stratified by the
58
-diallic genotype call.  The intercept and slope from the linear model
59
-are both SNP- and batch-specific.  The implementation in the \crlmm{}
60
-package is encapsulated by the function \Rfunction{crlmmCopynumber}
61
-that, using the default settings, can be called by passing a single
62
-object of class \Rclass{CNSet}.  See the appropriate
63
-preprocessing/genotyping vignette for the construction of an object of
64
-class \Rclass{CNSet}.
56
+There are two ways to obtain marker-level estimates of copy number
57
+that are supported by \Rpackage{crlmm}. One approach is to fit a
58
+linear model to the normalized intensities stratified by the diallelic
59
+genotype call at each SNP, as described in \cite{Scharpf2011}.
60
+Another alternative is to compute log R ratio and B allele
61
+frequencies. The latter is often better supported by downstream hidden
62
+Markov models such as those in the \Rpackage{VanillaICE} package. We
63
+describe each approach in the following two sections.
64
+
65
+\subsection{Linear model for normalized intensities}
66
+
67
+In this section, we fit a linear model to the normalized intensities
68
+stratified by the diallic genotype call.  The intercept and slope from
69
+the linear model are both SNP- and batch-specific.  The implementation
70
+in the \crlmm{} package is encapsulated by the function
71
+\Rfunction{crlmmCopynumber} that, using the default settings, can be
72
+called by passing a single object of class \Rclass{CNSet}.
73
+
65 74
 <<LDS_copynumber,cache=TRUE>>=
66 75
 crlmmCopynumber(cnSet)
67 76
 @
... ...
@@ -105,12 +114,7 @@ described in the \R{} package \Rpackage{ff}. The value returned by
105 114
 \Rfunction{crlmmCopynumber} is \verb+TRUE+, indicating that the files
106 115
 on disk have been successfully updated.  Note that while the
107 116
 \Robject{cnSet} object is unchanged, the values on disk are different.
108
-On the other hand, subsetting the \Robject{cnSet} with the `[' method
109
-coerces all of the elements to class \Rclass{matrix}. The
110
-batch-specific summaries are now ordinary matrices stored in RAM. The
111
-object returned by \Robject{crlmmCopynumber} is an object of class
112
-\Rclass{CNSet} with the matrices in the \verb+batchStatistics+ slot
113
-updated.
117
+
114 118
 
115 119
 <<chr1index>>=
116 120
 chr1.index <- which(chromosome(cnSet) == 1)
... ...
@@ -127,20 +131,10 @@ for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet2)[[nms[i]]])[1]
127 131
 all(cls == "matrix")
128 132
 @
129 133
 
130
-<<matrix_copynumber,cache=TRUE>>=
131
-cnSet3 <- crlmmCopynumber(cnSet2)
132
-class(cnSet3)
133
-@
134
-
135 134
 <<clean, echo=FALSE, results=hide>>=
136 135
 rm(cnSet2); gc()
137 136
 @
138 137
 
139
-\subsection{Marker-specific estimates}
140
-%\subsection{Raw copy number}
141
-
142
-%\paragraph{Log R ratios and B allele frequencies.}
143
-
144 138
 \paragraph{Raw total copy number.}
145 139
 Several functions are available that will compute relatively quickly
146 140
 the allele-specific, \emph{raw} copy number estimates. At allele $k$,
... ...
@@ -186,38 +180,49 @@ cb <- CB(cnSet, i=snp.index, j=1:2)
186 180
 
187 181
 
188 182
 \subsection{Container for log R ratios and B allele frequencies}
183
+\label{sec:lrrBaf}
189 184
 
190 185
 A useful container for storing the \crlmm{} genotypes, genotype
191 186
 confidence scores, and the total or relative copy number at each
192
-marker is the \Rclass{oligoSetList} class.  Coercion of a
193
-\Rclass{CNSet} object to a \Rfunction{oligoSnpSet} object can be
194
-acheived by using the function \Rfunction{constructOligoSetFrom} as
187
+marker is the \Rclass{BafLrrSetList} class.  Coercion of a
188
+\Rclass{CNSet} object to a \Rfunction{BafLrrSetList} object can be
189
+acheived by the function \Rfunction{BafLrrSetList} as
195 190
 illustrated below. Users should note that if the \verb+assayData+
196 191
 elements in the \Rclass{CNSet} instance are \Rpackage{ff} objects, the
197
-\verb+assayData+ elements of each element in the \Rclass{oligoSetList}
192
+\verb+assayData+ elements of each element in the \Rclass{BafLrrSetList}
198 193
 object will be \Rpackage{ff}-dervied objects (a new
199 194
 \verb+total_cn*.ff+ file will be created in the \verb+ldPath()+
200
-directory).
195
+directory). The following code-chunk is not evalutated.
196
+
197
+Estimation of log R ratios and B allele frequencies from an object of class \Rclass{CNSet} instantiated from genotyping Affymetrix CEL files or Illumina iDat files requires running the \R{} function \Rfunction{crlmmCopynumber} as a preliminary step.  Specifying \texttt{fit.linearModel=FALSE} will speed up computation as this function will by default compute summary statistics unnecessary for computing BAFs and log R ratios.
201 198
 
202
-<<oligoSnpSet>>=
199
+<<callingCrlmmCopynumber,eval=FALSE>>=
200
+crlmmCopynumber(cnSet, fit.linearModel=FALSE)
201
+@
202
+
203
+<<oligoSnpSet,eval=FALSE>>=
203 204
 library(VanillaICE)
204
-open(cnSet3)
205
-oligoSetList <- constructOligoSetListFrom(cnSet3, batch.name=batch(cnSet3)[1])
206
-close(cnSet3)
205
+open(cnSet)
206
+oligoSetList <- BafLrrSetList(cnSet)
207
+close(cnSet)
207 208
 show(oligoSetList)
208 209
 class(oligoSetList)
209 210
 ## oligoSnpSet of first chromosome
210 211
 oligoSetList[[1]]
211 212
 @
212 213
 
213
-\noindent Note that log R ratios stored in the \Robject{oligoSnpSet}
214
-object can be retrieved by the \Rfunction{copyNumber} accessor.  B
215
-allele frequences are retrieved by the \Rfunction{baf} accessor.
214
+\noindent Log R ratios and B allele frequences can be retrieved by the
215
+accessors \Rfunction{lrr} and \Rfunction{baf}, respectively.
216 216
 
217
-<<testEqual>>=
218
-lrrList <- copyNumber(oligoSetList)
217
+<<testEqual,eval=FALSE>>=
218
+lrrList <- lrr(oligoSetList)
219 219
 class(lrrList)
220 220
 dim(lrrList[[1]]) ## log R ratios for chromosome 1.
221 221
 bafList <- baf(oligoSetList)
222 222
 dim(bafList[[1]]) ## B allele frequencies for chromosome 1
223 223
 @
224
+
225
+If the \Rfunction{crlmmCopynumber} function is not run prior to the
226
+\Rfunction{BafLrrSetList} construction, the log R ratios and BAFs will
227
+be initialized as NAs.
228
+
Browse code

merge with collab branch containing fix for dqrls and bug-fix for computeRBaf that can misalign sample index with batch index (when batch is not in alphabetical order)

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@71295 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 17/11/2012 15:24:46
Showing1 changed files
... ...
@@ -2,7 +2,7 @@
2 2
   Affymetrix 5.0 and 6.0 platforms, as well as several Illumina
3 3
   platforms.  This vignette assumes that the arrays have already been
4 4
   successfully preprocessed and genotyped as per the instructions in
5
-  the \verb+AffymetrixPreprocessCN+ and \verb+IlluminaPreprocessCN+
5
+  the \verb+AffyGW+ and \verb+IlluminaPreprocessCN+
6 6
   vignettes for the Affymetrix and Illumina platforms,
7 7
   respectively. While this vignette uses Affymetrix 6.0 arrays for
8 8
   illustration, the steps at this point are identical for both
Browse code

Merge branch 'collab'

* collab:
Update AffyGW.pdf and IlluminaPreprocessCN.pdf
crlmmIlluminaV2 calls crlmmGT. comment call to crlmmGT2
Update AffyGW vignette to step through construction of CNSet, normalization, and genotyping
v 1.15.27: Fix bug in crlmmGT2. clean up comments in crlmmIllumina.
update getFeatureData

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@69561 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 19/09/2012 12:31:51
Showing1 changed files
... ...
@@ -63,7 +63,7 @@ object of class \Rclass{CNSet}.  See the appropriate
63 63
 preprocessing/genotyping vignette for the construction of an object of
64 64
 class \Rclass{CNSet}.
65 65
 <<LDS_copynumber,cache=TRUE>>=
66
-(cnSet.updated <- crlmmCopynumber(cnSet))
66
+crlmmCopynumber(cnSet)
67 67
 @
68 68
 
69 69
 The following steps were performed by the \Rfunction{crlmmCopynumber}
Browse code

Merge branch 'collab'

* collab: (34 commits)
revert change to IlluminaPreprocessCN
fix bug in isValidCdfName
print warning when all features in a batch of probes are flagged, but allow processing to continue
add utility cleancdfnames
Add validCdfNames.Rd
export validCdfNames
imputeGender fix when chromosome Y not available
Use splitIndicesByLength(index, ocSamples/getDoParWorkers())
Can not allocate vector of size XG with genotype.Illumina. Use splitIndicesByNode() only if the length of the list is greater than the split from splitIndicesByLength(). Otherwise, split by length using ocSamples()
update .gitignore
Add make.unique for sampleSheet$Sample_ID in readIdatFiles
bug in description
ensure sample ids stored in samplesheet are unique when constructing cnSet object
update oligoClasses dependency
update unit test for genotype.Illumina
revert change in constructInf call from genotype.Illumina
Update genotype.Rd
edit ACN function
1.15.6 use make.unique(basename(arrayNames)) to allow processing of Illumina samples with duplicated barcodes
check that sample identifies are unique in crlmm function
...

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@67435 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 08/07/2012 19:00:03
Showing1 changed files
... ...
@@ -154,7 +154,7 @@ formally, \newcommand{\A}{A} \newcommand{\B}{B}
154 154
     \phi}_{k,ip}}\left(I_{k,ijp}-{\hat \nu}_{k,ip}\right), ~0\right\}
155 155
 \mbox{~for~} k \in \{\A, \B\}.
156 156
 \end{eqnarray}
157
-\noindent See \cite{Scharpf2010} for details.
157
+\noindent See \cite{Scharpf2011} for details.
158 158
 
159 159
 The function \Rfunction{totalCopynumber} translates the normalized
160 160
 intensities to an estimate of raw copy number by adding the
... ...
@@ -175,42 +175,49 @@ dim(tmp2)
175 175
 
176 176
 Alternatively, the functions \Rfunction{CA} and \Rfunction{CB} compute
177 177
 the allele-specific copy number.  For instance, the following code
178
-chunk computes the allele-specific summaries at all polymorphic loci.
178
+chunk computes the allele-specific summaries at all polymorphic loci
179
+for the first 2 samples.
179 180
 
180 181
 <<ca>>=
181 182
 snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)))
182
-ca <- CA(cnSet, i=snp.index, j=seq_len(ncol(cnSet)))
183
-cb <- CB(cnSet, i=snp.index, j=seq_len(ncol(cnSet)))
183
+ca <- CA(cnSet, i=snp.index, j=1:2)
184
+cb <- CB(cnSet, i=snp.index, j=1:2)
184 185
 @
185 186
 
186 187
 
187
-\subsection{Container for raw copy number}
188
+\subsection{Container for log R ratios and B allele frequencies}
188 189
 
189 190
 A useful container for storing the \crlmm{} genotypes, genotype
190
-confidence scores, and the total copy number at each marker is the
191
-\Rclass{oligoSnpSet} class.  Coercion of a \Rclass{CNSet} object to a
192
-\Rfunction{oligoSnpSet} object can be acheived by using the method
193
-\Rfunction{as} (as illustrated below). Users should note that if the
194
-\verb+assayData+ elements in the \Rclass{CNSet} instance are
195
-\Rpackage{ff} objects, the \verb+assayData+ elements of the
196
-instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff}-dervied
197
-objects (a new \verb+total_cn*.ff+ file will be created in the
198
-\verb+ldPath()+ directory).
191
+confidence scores, and the total or relative copy number at each
192
+marker is the \Rclass{oligoSetList} class.  Coercion of a
193
+\Rclass{CNSet} object to a \Rfunction{oligoSnpSet} object can be
194
+acheived by using the function \Rfunction{constructOligoSetFrom} as
195
+illustrated below. Users should note that if the \verb+assayData+
196
+elements in the \Rclass{CNSet} instance are \Rpackage{ff} objects, the
197
+\verb+assayData+ elements of each element in the \Rclass{oligoSetList}
198
+object will be \Rpackage{ff}-dervied objects (a new
199
+\verb+total_cn*.ff+ file will be created in the \verb+ldPath()+
200
+directory).
199 201
 
200 202
 <<oligoSnpSet>>=
203
+library(VanillaICE)
201 204
 open(cnSet3)
202
-oligoSet <- as(cnSet3, "oligoSnpSet")
205
+oligoSetList <- constructOligoSetListFrom(cnSet3, batch.name=batch(cnSet3)[1])
203 206
 close(cnSet3)
204
-class(copyNumber(oligoSet))
207
+show(oligoSetList)
208
+class(oligoSetList)
209
+## oligoSnpSet of first chromosome
210
+oligoSetList[[1]]
205 211
 @
206 212
 
207
-\noindent Note that the raw copy number estimates stored in the
208
-\Robject{oligoSnpSet} object can be retrieved by the
209
-\Rfunction{copyNumber} accessor and is equivalent to that returned by
210
-the \Rfunction{totalCopynumber} function defined over the same row and
211
-column indices.
213
+\noindent Note that log R ratios stored in the \Robject{oligoSnpSet}
214
+object can be retrieved by the \Rfunction{copyNumber} accessor.  B
215
+allele frequences are retrieved by the \Rfunction{baf} accessor.
212 216
 
213 217
 <<testEqual>>=
214
-total.cn3 <- totalCopynumber(cnSet3, i=1:nrow(cnSet3), j=seq_len(ncol(cnSet3)))
215
-all.equal(copyNumber(oligoSet), total.cn3)
216
-@
217 218
\ No newline at end of file
219
+lrrList <- copyNumber(oligoSetList)
220
+class(lrrList)
221
+dim(lrrList[[1]]) ## log R ratios for chromosome 1.
222
+bafList <- baf(oligoSetList)
223
+dim(bafList[[1]]) ## B allele frequencies for chromosome 1
224
+@
Browse code

Merge branch 'mymac'

* mymac:
add AffyGW.pdf
update vignettes in inst/scripts
Change argument of validCEL to celfiles
Update constructInf to accommodate GenomeDataFrame class for featureData
bump version to 1.13.7
Add doRUnit.R
Add celfile-utils.Rd
Streamlne some of the Rd files
add validCEL function that checks whether all celfiles can be read
getFeatureData returns GenomeAnnotatedDataFrame
Remove imports from methods. Remove pdf of illumina_copynumber.pdf (large file) and copynumber.pdf
getFeatureDAta returns GenomeAnnotatedDataFrame
Remove separate vignette for copy number in inst/scripts. Include copynumber section in both affy and illumina pipelines.
update documentation files for genotype.Illumina, preprocessInf, and genotypeInf (cdfName added as argument. Indicate that 'batch' should be a character string)
pass cdfName to genotypeInf and preprocessInf
add unitTests and cn-functions for 'simple usage'
Combine AffyPreprocess and copynumber. Combine IlluminaPreprocess and copynumber
remove depency on ff to allow installation on my mac

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@62108 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 17/01/2012 19:13:44
Showing1 changed files
... ...
@@ -1,35 +1,3 @@
1
-%\VignetteIndexEntry{crlmm copy number Vignette}
2
-%\VignetteDepends{crlmm, genomewidesnp6Crlmm}
3
-%\VignetteKeywords{crlmm, SNP 6}
4
-%\VignettePackage{crlmm}
5
-\documentclass{article}
6
-\usepackage{graphicx}
7
-\usepackage{natbib}
8
-\usepackage{url}
9
-\usepackage{amsmath,amssymb}
10
-\newcommand{\Rfunction}[1]{{\texttt{#1}}}
11
-\newcommand{\Rmethod}[1]{{\texttt{#1}}}
12
-\newcommand{\Rcode}[1]{{\texttt{#1}}}
13
-\newcommand{\Robject}[1]{{\texttt{#1}}}
14
-\newcommand{\Rpackage}[1]{{\textsf{#1}}}
15
-\newcommand{\Rclass}[1]{{\textit{#1}}}
16
-\newcommand{\oligo}{\Rpackage{oligo }}
17
-\newcommand{\R}{\textsf{R}}
18
-\newcommand{\crlmm}{\Rpackage{crlmm}}
19
-\usepackage[margin=1in]{geometry}
20
-
21
-\begin{document}
22
-\title{Copy number estimation with \Rpackage{crlmm}}
23
-\date{\today}
24
-\author{Rob Scharpf}
25
-\maketitle
26
-
27
-<<setup, echo=FALSE, results=hide>>=
28
-options(continue=" ", width=70)
29
-@
30
-
31
-\begin{abstract}
32
-
33 1
   Copy number routines in the \crlmm{} package are available for
34 2
   Affymetrix 5.0 and 6.0 platforms, as well as several Illumina
35 3
   platforms.  This vignette assumes that the arrays have already been
... ...
@@ -44,35 +12,6 @@ options(continue=" ", width=70)
44 12
   \crlmm{} package is available from the author's website:
45 13
   \url{http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html}.
46 14
 
47
-\end{abstract}
48
-
49
-\section{Set up}
50
-
51
-<<libraries,results=hide>>=
52
-library(ff)
53
-library(crlmm)
54
-library(lattice)
55
-library(cacheSweave)
56
-if(getRversion() < "2.13.0"){
57
-	rpath <- getRversion()
58
-} else rpath <- "trunk"
59
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
60
-@
61
-
62
-<<ram>>=
63
-ldPath(outdir)
64
-setCacheDir(outdir)
65
-ocProbesets(150e3)
66
-ocSamples(200)
67
-@
68
-
69
-We begin by loading the \Robject{cnSet} object created by the
70
-\verb+AffymetrixPreprocessCN+ vignette.
71
-
72
-<<loadcnset>>=
73
-if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))
74
-@
75
-
76 15
 
77 16
 \textbf{Limitations:} While a minimum number of samples is not
78 17
 required for preprocessing and genotyping, copy number estimation in
... ...
@@ -100,24 +39,16 @@ quality.  The following code chunk makes a histogram of the SNR values
100 39
 for the HapMap samples.
101 40
 
102 41
 <<snr,fig=TRUE,include=FALSE,width=6, height=4>>=
42
+library(lattice)
103 43
 invisible(open(cnSet$SNR))
104 44
 snr <- cnSet$SNR[]
105 45
 close(cnSet$SNR)
106
-print(histogram(~snr, panel=function(...){
107
-	panel.histogram(...)
108
-	panel.abline(v=5, col="grey",lty=2)
109
-},
110
-		breaks=25,xlim=c(4.5,10), xlab="SNR"))
46
+print(histogram(~snr,
47
+		panel=function(...){
48
+			panel.histogram(...)},
49
+		breaks=25, xlim=range(snr), xlab="SNR"))
111 50
 @
112
-\begin{figure}[t!]
113
-  \begin{center}
114
-  \includegraphics[width=0.6\textwidth]{copynumber-snr.pdf}
115
-  \caption{The signal to noise ratio (SNR) for 180 HapMap samples. For
116
-    Affymetrix platforms, SNR values below 5 can indicate possible
117
-    problems with sample quality.  In some circumstances, it may be
118
-    more helpful to exclude samples with poor DNA quality.}
119
-\end{center}
120
-\end{figure}
51
+
121 52
 
122 53
 
123 54
 \section{Copy number estimation}
... ...
@@ -149,29 +80,17 @@ computed for subsets of the markers to reduce the required RAM. Note
149 80
 that the value returned by the \Rfunction{crlmmCopynumber} function in
150 81
 the above example is \verb+TRUE+.  The reason the function returns
151 82
 \verb+TRUE+ in the above example is that the elements of the
152
-\verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}.  Rather
153
-than keep the statistical summaries in memory, the summaries are
154
-written to files on disk using protocols described in the
83
+\verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}.
84
+Rather than keep the statistical summaries in memory, the summaries
85
+are written to files on disk using protocols described in the
155 86
 \Rpackage{ff} package. Hence, while the \Robject{cnSet} object itself
156 87
 is unchanged as a result of the \Rfunction{crlmmCopynumber} function,
157 88
 the data on disk is updated accordingly.  Users that are interested in
158 89
 accessing these low-level summaries can refer to the
159
-\verb+Infrastructure+ vignette.  Computation of the raw copy number
160
-estimates for each allele is described in the following section.
161
-
162
-For users that are interested in the analysis of a specific chromosome
163
-(subset of markers) or a s
164
-
165
-pointers to files on disk, are stored in the \verb+batchStatistics+
166
-slot of the class \Rclass{CNSet}.  Using the default settings for the
167
-\Rfunction{crlmmCopynumber} function, only an object of class
168
-\Rclass{CNSet} is required.  %(See \verb+AffyPreprocessCN+ or
169
-%\verb+IlluminaPreprocessCN+ vignettes for details.)
170
-
171
-Note that depends on whether the elements of the \verb+batchStatistics+
172
-slot are \Robject{ff} objects or ordinary matrices.  In this example,
173
-the elements of \verb+batchStatistics+ have the class
174
-\Rclass{ff\_matrix}.
90
+\verb+Infrastructure+ vignette.  Note that the data structure depends
91
+on whether the elements of the \verb+batchStatistics+ slot are
92
+\Robject{ff} objects or ordinary matrices.  In this example, the
93
+elements of \verb+batchStatistics+ have the class \Rclass{ff\_matrix}.
175 94
 
176 95
 <<classes>>=
177 96
 nms <- ls(batchStatistics(cnSet))
... ...
@@ -186,10 +105,8 @@ described in the \R{} package \Rpackage{ff}. The value returned by
186 105
 \Rfunction{crlmmCopynumber} is \verb+TRUE+, indicating that the files
187 106
 on disk have been successfully updated.  Note that while the
188 107
 \Robject{cnSet} object is unchanged, the values on disk are different.
189
-
190
-
191
-\noindent On the other hand, subsetting the \Robject{cnSet} with the
192
-`[' method coerces all of the elements to class \Rclass{matrix}. The
108
+On the other hand, subsetting the \Robject{cnSet} with the `[' method
109
+coerces all of the elements to class \Rclass{matrix}. The
193 110
 batch-specific summaries are now ordinary matrices stored in RAM. The
194 111
 object returned by \Robject{crlmmCopynumber} is an object of class
195 112
 \Rclass{CNSet} with the matrices in the \verb+batchStatistics+ slot
... ...
@@ -219,14 +136,18 @@ class(cnSet3)
219 136
 rm(cnSet2); gc()
220 137
 @
221 138
 
222
-\subsection{Raw copy number}
139
+\subsection{Marker-specific estimates}
140
+%\subsection{Raw copy number}
141
+
142
+%\paragraph{Log R ratios and B allele frequencies.}
223 143
 
144
+\paragraph{Raw total copy number.}
224 145
 Several functions are available that will compute relatively quickly
225 146
 the allele-specific, \emph{raw} copy number estimates. At allele $k$,
226 147
 marker $i$, sample $j$, and batch $p$, the estimate of allele-specific
227 148
 copy number is computed by subtracting the estimated background from
228
-the normalized intensity and scaling by the slope coefficient. More formally,
229
-\newcommand{\A}{A} \newcommand{\B}{B}
149
+the normalized intensity and scaling by the slope coefficient. More
150
+formally, \newcommand{\A}{A} \newcommand{\B}{B}
230 151
 \begin{eqnarray}
231 152
   \label{eq:cnK}
232 153
 {\hat c}_{k,ijp} &=& \mbox{max}\left\{\frac{1}{{\hat
... ...
@@ -246,9 +167,9 @@ at all markers for the first 2 samples, and the total copy number for
246 167
 chromosome 20 for the first 50 samples.
247 168
 
248 169
 <<totalCopynumber>>=
249
-tmp <- totalCopynumber(cnSet, i=1:nrow(cnSet), j=1:2)
170
+tmp <- totalCopynumber(cnSet, i=seq_len(nrow(cnSet)), j=1:2)
250 171
 dim(tmp)
251
-tmp2 <- totalCopynumber(cnSet, i=which(chromosome(cnSet) == 20), j=1:50)
172
+tmp2 <- totalCopynumber(cnSet, i=which(chromosome(cnSet) == 20), j=seq_len(ncol(cnSet)))
252 173
 dim(tmp2)
253 174
 @
254 175
 
... ...
@@ -258,98 +179,12 @@ chunk computes the allele-specific summaries at all polymorphic loci.
258 179
 
259 180
 <<ca>>=
260 181
 snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)))
261
-ca <- CA(cnSet, i=snp.index, j=1:5)
262
-cb <- CB(cnSet, i=snp.index, j=1:5)
263
-@
264
-
265
-\noindent Note the equivalence of the following calculations.
266
-
267
-<<totalCopynumber>>=
268
-ct <- ca+cb
269
-ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5)
270
-stopifnot(all.equal(ct, ct2))
271
-@
272
-
273
-At nonpolymorphic loci, \Rfunction{CA} function returns the total copy
274
-number and, by construction, the \Rfunction{CB} function returns 0.
275
-
276
-<<nonpolymorphicAutosomes>>=
277
-marker.index <- which(!isSnp(cnSet))
278
-ct <- CA(cnSet, i=marker.index, j=1:5)
279
-## all zeros
280
-stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0))
281
-ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5)
282
-stopifnot(all.equal(ct, ct2))
283
-@
284
-
285
-In the following code chunk, we extract estimates of the total copy
286
-number at nonpolymorphic markers on chromosome X.
287
-
288
-<<npx>>=
289
-set.seed(123)
290
-npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet))
291
-M <- sample(which(cnSet$gender[]==1), 5)
292
-F <- sample(which(cnSet$gender[]==2), 5)
293
-cn.M <- CA(cnSet, i=npx.index, j=M)
294
-cn.F <- CA(cnSet, i=npx.index, j=F)
295
-@
296
-
297
-\noindent Again, the function \Rfunction{totalCopynumber} is
298
-equivalent.
299
-
300
-<<npx2>>=
301
-cnX <- cbind(cn.M, cn.F)
302
-cnX2 <- totalCopynumber(cnSet, i=npx.index, j=c(M, F))
303
-stopifnot(all.equal(cnX, cnX2))
182
+ca <- CA(cnSet, i=snp.index, j=seq_len(ncol(cnSet)))
183
+cb <- CB(cnSet, i=snp.index, j=seq_len(ncol(cnSet)))
304 184
 @
305 185
 
306
-<<nonpolymorphicX,fig=TRUE,include=FALSE, echo=FALSE, eval=FALSE>>=
307
-df <- data.frame(cn=as.numeric(cnX), id=factor(rep(sampleNames(cnSet)[c(M,F)], each=nrow(cnX)), levels=sampleNames(cnSet)[c(M,F)], ordered=T))
308
-library(RColorBrewer)
309
-cols <- brewer.pal(8, "Accent")[c(1, 5)]
310
-print(bwplot(cn~id,df, panel=function(x,y,...){
311
-	panel.grid(v=-10,h=0)
312
-	panel.bwplot(x,y, ...)
313
-	panel.abline(h=1:2, col="grey70",lwd=2)
314
-},
315
-	scales=list(x=list(rot=90)), cex=0.5, ylab="total copy number", main="Chr X SNPs",
316
-       fill=cols[cnSet$gender[c(M,F)]], key=mykey))
317
-@
318
-
319
-%\begin{figure}[t!]
320
-% \centering
321
-% \includegraphics[width=0.5\textwidth]{copynumber-nonpolymorphicX}
322
-% \caption{Copy number estimates for nonpolymorphic loci on chromosome
323
-%   X (5 men, 5 women).  We assume that the median copy number across
324
-%   samples at a given marker on X is 1 for men and 2 for women.}
325
-%\end{figure}
326
-
327
-Polymorphic markers on chromosome X:
328 186
 
329
-<<polymorphicX,fig=TRUE,include=FALSE>>=
330
-library(RColorBrewer)
331
-cols <- brewer.pal(8, "Accent")[c(1, 5)]
332
-X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
333
-cnX <- totalCopynumber(cnSet, i=X.markers, j=c(M,F))
334
-df <- data.frame(cn=as.numeric(cnX), id=factor(rep(sampleNames(cnSet)[c(M,F)], each=length(X.markers)), levels=sampleNames(cnSet)[c(M,F)], ordered=T))
335
-mykey <- simpleKey(c("male", "female"), points=FALSE,col=cols)
336
-print(bwplot(cn~id,df, panel=function(x,y,...){
337
-	panel.grid(v=-10,h=0)
338
-	panel.bwplot(x,y, ...)
339
-	panel.abline(h=1:2, col="grey70",lwd=2)
340
-},
341
-	scales=list(x=list(rot=90)), cex=0.5, ylab="total copy number", main="Chr X SNPs",
342
-       fill=cols[cnSet$gender[c(M,F)]], key=mykey))
343
-@
344
-\begin{figure}[t!]
345
- \centering
346
- \includegraphics[width=0.5\textwidth]{copynumber-polymorphicX}
347
- \caption{Copy number estimates for polymorphic markers on chromosome
348
-   X. crlmm assumes that the median copy number across samples at a
349
-   given marker on X is 1 for men and 2 for women.}
350
-\end{figure}
351
-
352
-\subsection{A container for raw copy number}
187
+\subsection{Container for raw copy number}
353 188
 
354 189
 A useful container for storing the \crlmm{} genotypes, genotype
355 190
 confidence scores, and the total copy number at each marker is the
... ...
@@ -362,28 +197,6 @@ instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff}-dervied
362 197
 objects (a new \verb+total_cn*.ff+ file will be created in the
363 198
 \verb+ldPath()+ directory).
364 199
 
365
-%The advantage of this approach is that the raw copy number estimates
366
-%can be computed for all markers and all samples simultaneously without
367
-%excessive RAM.  The disadvantage is that such a step creates
368
-%additional I/O for reading and writing the raw copy number estimates
369
-%from disk.  For very large data sets (1000+ samples), the time
370
-%required for I/O may outweigh the benefits.  The alternative is to
371
-%perform downstream processing (such as smoothing) on the total copy
372
-%number for subsets of markers and/or samples.  As indicated
373
-%previously, the subset `[' method applied to an object of class
374
-%\Robject{CNSet} will return an object of the same class with simple
375
-%matrices instead of \Rpackage{ff} objects.  Therefore, one can
376
-%instantiate a \Rpackage{oligoSnpSet} object from a \Robject{CNSet}
377
-%object without the I/O overhead for storing the total copy number
378
-%estimates by first subsetting the \Rclass{CNSet} object.  One could
379
-%then smooth the raw copy number estimates and store a summary copy
380
-%number for the interval, thereby substantially reducing the
381
-%dimensionality of the copy number estimates.  Examples of smoothing
382
-%applications such as circular binary segmentation implemented in the
383
-%\Rpackage{DNAcopy} package or a hidden Markov model implemented in the
384
-%\Rpackage{VanillaICE} package are provided in the \verb+SmoothingCN+
385
-%vignette.
386
-
387 200
 <<oligoSnpSet>>=
388 201
 open(cnSet3)
389 202
 oligoSet <- as(cnSet3, "oligoSnpSet")
... ...
@@ -398,19 +211,6 @@ the \Rfunction{totalCopynumber} function defined over the same row and
398 211
 column indices.
399 212
 
400 213
 <<testEqual>>=
401
-total.cn3 <- totalCopynumber(cnSet3, i=1:nrow(cnSet3), j=1:ncol(cnSet3))
214
+total.cn3 <- totalCopynumber(cnSet3, i=1:nrow(cnSet3), j=seq_len(ncol(cnSet3)))
402 215
 all.equal(copyNumber(oligoSet), total.cn3)
403
-@
404
-
405
-
406
-\section{Session information}
407
-<<sessionInfo, results=tex>>=
408
-toLatex(sessionInfo())
409
-@
410
-
411
-%\begin{bibliography}
412
-  \bibliographystyle{plain}
413
-  \bibliography{refs}
414
-%\end{bibliography}
415
-
416
-\end{document}
216
+@
417 217
\ No newline at end of file
Browse code

Update copy number vignettes. Pass cdfName to cnrmaAffy function.

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54327 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 31/03/2011 16:52:11
Showing1 changed files
... ...
@@ -38,7 +38,7 @@ options(continue=" ", width=70)
38 38
   vignettes for the Affymetrix and Illumina platforms,
39 39
   respectively. While this vignette uses Affymetrix 6.0 arrays for
40 40
   illustration, the steps at this point are identical for both
41
-  platforms.  See \citep{Scharpf2010} for details regarding the
41
+  platforms.  See \citep{Scharpf2011} for details regarding the
42 42
   methodology implemented in \crlmm{} for copy number analysis.  In
43 43
   addition, a compendium describing copy number analysis using the
44 44
   \crlmm{} package is available from the author's website:
... ...
@@ -80,7 +80,9 @@ the \crlmm{} package currently requires at least 10 samples per batch.
80 80
 The parameter estimates for copy number and the corresponding
81 81
 estimates of raw copy number will tend to be more noisy for batches
82 82
 with small sample sizes (e.g., $<$ 50).  Chemistry plate or scan date
83
-are often useful surrogates for batch.
83
+are often useful surrogates for batch.  Samples that were processed at
84
+similar times (e.g., in the same month) can be grouped together in the
85
+same batch.
84 86
 
85 87
 \section{Quality control}
86 88
 
... ...
@@ -98,7 +100,7 @@ quality.  The following code chunk makes a histogram of the SNR values
98 100
 for the HapMap samples.
99 101
 
100 102
 <<snr,fig=TRUE,include=FALSE,width=6, height=4>>=
101
-open(cnSet$SNR)
103
+invisible(open(cnSet$SNR))
102 104
 snr <- cnSet$SNR[]
103 105
 close(cnSet$SNR)
104 106
 print(histogram(~snr, panel=function(...){
... ...
@@ -120,13 +122,13 @@ print(histogram(~snr, panel=function(...){
120 122
 
121 123
 \section{Copy number estimation}
122 124
 
123
-As described in \cite{Scharpf2010}, the CRLMM-CopyNumber algorithm
125
+As described in \cite{Scharpf2011}, the CRLMM-CopyNumber algorithm
124 126
 fits a linear model to the normalized intensities stratified by the
125 127
 diallic genotype call.  The intercept and slope from the linear model
126 128
 are both SNP- and batch-specific.  The implementation in the \crlmm{}
127 129
 package is encapsulated by the function \Rfunction{crlmmCopynumber}
128 130
 that, using the default settings, can be called by passing a single
129
-object of class \code{CNSet}.  See the appropriate
131
+object of class \Rclass{CNSet}.  See the appropriate
130 132
 preprocessing/genotyping vignette for the construction of an object of
131 133
 class \Rclass{CNSet}.
132 134
 <<LDS_copynumber,cache=TRUE>>=
... ...
@@ -147,7 +149,7 @@ computed for subsets of the markers to reduce the required RAM. Note
147 149
 that the value returned by the \Rfunction{crlmmCopynumber} function in
148 150
 the above example is \verb+TRUE+.  The reason the function returns
149 151
 \verb+TRUE+ in the above example is that the elements of the
150
-\verb+batchStatistics+ slot have the class \Rclass{ff_matrix}.  Rather
152
+\verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}.  Rather
151 153
 than keep the statistical summaries in memory, the summaries are
152 154
 written to files on disk using protocols described in the
153 155
 \Rpackage{ff} package. Hence, while the \Robject{cnSet} object itself
... ...
@@ -169,9 +171,9 @@ slot of the class \Rclass{CNSet}.  Using the default settings for the
169 171
 Note that depends on whether the elements of the \verb+batchStatistics+
170 172
 slot are \Robject{ff} objects or ordinary matrices.  In this example,
171 173
 the elements of \verb+batchStatistics+ have the class
172
-\Rclass{ff_matrix}.
174
+\Rclass{ff\_matrix}.
173 175
 
174
-<<LDS_copynumber,cache=TRUE>>=
176
+<<classes>>=
175 177
 nms <- ls(batchStatistics(cnSet))
176 178
 cls <- rep(NA, length(nms))
177 179
 for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet)[[nms[i]]])[1]
... ...
@@ -284,6 +286,7 @@ In the following code chunk, we extract estimates of the total copy
284 286
 number at nonpolymorphic markers on chromosome X.
285 287
 
286 288
 <<npx>>=
289
+set.seed(123)
287 290
 npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet))
288 291
 M <- sample(which(cnSet$gender[]==1), 5)
289 292
 F <- sample(which(cnSet$gender[]==2), 5)
... ...
@@ -341,9 +344,9 @@ print(bwplot(cn~id,df, panel=function(x,y,...){
341 344
 \begin{figure}[t!]
342 345
  \centering
343 346
  \includegraphics[width=0.5\textwidth]{copynumber-polymorphicX}
344
-\caption{Copy number estimates for polymorphic markers on chromosome
345
-  X. crlmm assumes that the median copy number across samples at a
346
-  given marker on X is 1 for men and 2 for women.}
347
+ \caption{Copy number estimates for polymorphic markers on chromosome
348
+   X. crlmm assumes that the median copy number across samples at a
349
+   given marker on X is 1 for men and 2 for women.}
347 350
 \end{figure}
348 351
 
349 352
 \subsection{A container for raw copy number}
... ...
@@ -355,29 +358,31 @@ confidence scores, and the total copy number at each marker is the
355 358
 \Rfunction{as} (as illustrated below). Users should note that if the
356 359
 \verb+assayData+ elements in the \Rclass{CNSet} instance are
357 360
 \Rpackage{ff} objects, the \verb+assayData+ elements of the
358
-instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff} object (a
359
-new \verb+total_cn*.ff+ file will be created in the \verb+ldPath()+
360
-directory). The advantage of this approach is that the raw copy number
361
-estimates can be computed for all markers and all samples
362
-simultaneously without excessive RAM.  The disadvantage is that such a
363
-step creates additional I/O for reading and writing the raw copy
364
-number estimates from disk.  For very large data sets (1000+ samples),
365
-the time required for I/O may outweigh the benefits.  The alternative
366
-is to perform downstream processing (such as smoothing) on the total
367
-copy number for subsets of markers and/or samples.  As indicated
368
-previously, the subset `[' method applied to an object of class
369
-\Robject{CNSet} will return an object of the same class with simple
370
-matrices instead of \Rpackage{ff} objects.  Therefore, one can
371
-instantiate a \Rpackage{oligoSnpSet} object from a \Robject{CNSet}
372
-object without the I/O overhead for storing the total copy number
373
-estimates by first subsetting the \Rclass{CNSet} object.  One could
374
-then smooth the raw copy number estimates and store a summary copy
375
-number for the interval, thereby substantially reducing the
376
-dimensionality of the copy number estimates.  Examples of smoothing
377
-applications such as circular binary segmentation implemented in the
378
-\Rpackage{DNAcopy} package or a hidden Markov model implemented in the
379
-\Rpackage{VanillaICE} package are provided in the \verb+SmoothingCN+
380
-vignette.
361
+instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff}-dervied
362
+objects (a new \verb+total_cn*.ff+ file will be created in the
363
+\verb+ldPath()+ directory).
364
+
365
+%The advantage of this approach is that the raw copy number estimates
366
+%can be computed for all markers and all samples simultaneously without
367
+%excessive RAM.  The disadvantage is that such a step creates
368
+%additional I/O for reading and writing the raw copy number estimates
369
+%from disk.  For very large data sets (1000+ samples), the time
370
+%required for I/O may outweigh the benefits.  The alternative is to
371
+%perform downstream processing (such as smoothing) on the total copy
372
+%number for subsets of markers and/or samples.  As indicated
373
+%previously, the subset `[' method applied to an object of class
374
+%\Robject{CNSet} will return an object of the same class with simple
375
+%matrices instead of \Rpackage{ff} objects.  Therefore, one can
376
+%instantiate a \Rpackage{oligoSnpSet} object from a \Robject{CNSet}
377
+%object without the I/O overhead for storing the total copy number
378
+%estimates by first subsetting the \Rclass{CNSet} object.  One could
379
+%then smooth the raw copy number estimates and store a summary copy
380
+%number for the interval, thereby substantially reducing the
381
+%dimensionality of the copy number estimates.  Examples of smoothing
382
+%applications such as circular binary segmentation implemented in the
383
+%\Rpackage{DNAcopy} package or a hidden Markov model implemented in the
384
+%\Rpackage{VanillaICE} package are provided in the \verb+SmoothingCN+
385
+%vignette.
381 386
 
382 387
 <<oligoSnpSet>>=
383 388
 open(cnSet3)
... ...
@@ -388,7 +393,7 @@ class(copyNumber(oligoSet))
388 393
 
389 394
 \noindent Note that the raw copy number estimates stored in the
390 395
 \Robject{oligoSnpSet} object can be retrieved by the
391
-\Rfunction{copyNumber} accessor and is equivalanet to that returned by
396
+\Rfunction{copyNumber} accessor and is equivalent to that returned by
392 397
 the \Rfunction{totalCopynumber} function defined over the same row and
393 398
 column indices.
394 399
 
Browse code

Add vignettes for copy number analysis. Add crlmmGT2.R file

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54171 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 30/03/2011 02:41:00
Showing1 changed files
... ...
@@ -6,6 +6,7 @@
6 6
 \usepackage{graphicx}
7 7
 \usepackage{natbib}
8 8
 \usepackage{url}
9
+\usepackage{amsmath,amssymb}
9 10
 \newcommand{\Rfunction}[1]{{\texttt{#1}}}
10 11
 \newcommand{\Rmethod}[1]{{\texttt{#1}}}
11 12
 \newcommand{\Rcode}[1]{{\texttt{#1}}}
... ...
@@ -18,7 +19,7 @@
18 19
 \usepackage[margin=1in]{geometry}
19 20
 
20 21
 \begin{document}
21
-\title{Copy number estimation and genotype calling with \Rpackage{crlmm}}
22
+\title{Copy number estimation with \Rpackage{crlmm}}
22 23
 \date{\today}
23 24
 \author{Rob Scharpf}
24 25
 \maketitle
... ...
@@ -27,178 +28,202 @@
27 28
 options(continue=" ", width=70)
28 29
 @
29 30
 
30
-%\section{Estimating copy number}
31
-
32
-%At present, software for copy number estimation is provided only for the
33
-%Affymetrix 6.0 platform.
34
-
35 31
 \begin{abstract}
36
-  This vignette estimates copy number for HapMap samples on the
37
-  Affymetrix 6.0 platform.  See \citep{Scharpf2010} for details
38
-  regarding the methodology implemented in \crlmm{}.  In addition, a
39
-  compendium describing copy number analysis using the \crlmm{}
40
-  package is available from the author's website:
32
+
33
+  Copy number routines in the \crlmm{} package are available for
34
+  Affymetrix 5.0 and 6.0 platforms, as well as several Illumina
35
+  platforms.  This vignette assumes that the arrays have already been
36
+  successfully preprocessed and genotyped as per the instructions in
37
+  the \verb+AffymetrixPreprocessCN+ and \verb+IlluminaPreprocessCN+
38
+  vignettes for the Affymetrix and Illumina platforms,
39
+  respectively. While this vignette uses Affymetrix 6.0 arrays for
40
+  illustration, the steps at this point are identical for both
41
+  platforms.  See \citep{Scharpf2010} for details regarding the
42
+  methodology implemented in \crlmm{} for copy number analysis.  In
43
+  addition, a compendium describing copy number analysis using the
44
+  \crlmm{} package is available from the author's website:
41 45
   \url{http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html}.
42 46
 
43 47
 \end{abstract}
44 48
 
45
-\section{Copy number estimation}
46
-
47
-\subsection{Set up}
49
+\section{Set up}
48 50
 
49
-<<cdfname>>=
51
+<<libraries,results=hide>>=
52
+library(ff)
50 53
 library(crlmm)
54
+library(lattice)
51 55
 library(cacheSweave)
52
-@
53
-
54
-Several genotyping platforms are currently supported.  Supported
55
-platforms must have a corresponding annotation package (installed
56
-separately) that are listed below.
57
-
58
-<<annotationPackages>>=
59
-pkgs <- annotationPackages()
60
-pkgs <- pkgs[grep("Crlmm", pkgs)]
61
-pkgs
62
-@
63
-
64
-Note that 'pd.genomewidesnp6' and 'genomewidesnp6Crlmm' are both
65
-annotation packages for the Affymetrix 6.0 platform available on
66
-Bioconductor, but the 'genomewidesnp6Crlmm' annotation package must be
67
-used for copy number estimation.  The annotation package is specified
68
-through the 'cdfName' -- the identifier without the 'Crlmm' postfix.
69
-In the following code, we specify the cdf name for Affymetrix 6.0,
70
-provide the complete path to the CEL files, and indicate where
71
-intermediate files from the copy number estimation are to be saved.
72
-
73
-<<setup>>=
74
-cdfName <- "genomewidesnp6"
75
-pathToCels <- "/thumper/ctsa/snpmicroarray/hapmap/raw/affy/1m"
76 56
 if(getRversion() < "2.13.0"){
77 57
 	rpath <- getRversion()
78 58
 } else rpath <- "trunk"
79 59
 outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
80
-dir.create(outdir, recursive=TRUE, showWarnings=FALSE)
60
+@
61
+
62
+<<ram>>=
63
+ldPath(outdir)
81 64
 setCacheDir(outdir)
65
+ocProbesets(150e3)
66
+ocSamples(200)
82 67
 @
83 68
 
84
-All long computations are saved in the output directory
85
-\Robject{outdir}.  Users should change these variables as appropriate.
86
-The following code chunk should fail unless the above arguments have
87
-been set appropriately.
69
+We begin by loading the \Robject{cnSet} object created by the
70
+\verb+AffymetrixPreprocessCN+ vignette.
88 71
 
89
-<<checkSetup>>=
90
-if(!file.exists(outdir)) stop("Please specify a valid directory for storing output")
91
-if(!file.exists(pathToCels)) stop("Please specify the correct path to the CEL files")
72
+<<loadcnset>>=
73
+if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))
92 74
 @
93 75
 
94
-Processed data from codechunks requiring long computations are saved
95
-to disk by wrapping function calls in the \Robject{checkExists}
96
-function.  After running this vignette as a batch job, subsequent
97
-calls to \verb@Sweave@ will load the saved computations from disk. See
98
-the \Rfunction{checkExists} help file for additional details.
99
-
100
-\subsection{Preprocessing and genotyping.}
101
-
102
-In the following code chunk, we provide the complete path to the
103
-Affymetrix CEL files and define a 'batch' variable. The
104
-\Robject{batch} variable will be used to initialize a container for
105
-storing the normalized intensities, the genotype calls, and the
106
-parameter estimates for copy number.  Often the chemistry plate or the
107
-scan date of the array is a useful surrogate for batch. For the HapMap
108
-CEL files in our analysis, the CEPH (C) and Yoruban (Y) samples were
109
-prepared on separate chemistry plates.  In the following code chunk,
110
-we extract the population identifier from the CEL file names and
111
-assign these identifiers to the variable \Robject{batch}.
112
-
113
-<<celfiles>>=
114
-celFiles <- list.celfiles(pathToCels, full.names=TRUE, pattern=".CEL")
115
-celFiles <- celFiles[substr(basename(celFiles), 13, 13) %in% c("C", "Y")]
116
-plates <- substr(basename(celFiles), 13, 13)
117
-@
118 76
 
119
-The preprocessing steps for copy number estimation includes quantile
120
-normalization of the raw intensities for each probe and a step that
121
-summarizes the intensities of multiple probes at a single locus.  For
122
-example, the Affymetrix 6.0 platform has 3 or 4 identical probes at
123
-each polymorphic locus and the normalized intensities are summarized
124
-by a median.  For the nonpolymorphic markers on Affymetrix 6.0, only
125
-one probe per locus is available and the summarization step is not
126
-needed.
127
-
128
-After preprocessing the arrays, the \crlmm{} package estimates the
129
-genotype and provides a confidence score at each polymorphic locus.
130
-Unless the dataset is small (e.g., fewer than 50 samples), we suggest
131
-installing and loading the \R{} package \Rpackage{ff} to reduce the
132
-RAM required for preprocessing and genotyping.  Loading the
133
-\Rpackage{ff} package at this point will automatically enable large
134
-data support (LDS).
135
-
136
-The function \Rfunction{genotype} checks to see whether the
137
-\Rpackage{ff} is loaded.  If loaded, the normalized intensities and
138
-genotype are stored as \Robject{ff} objects on disk.  Otherwise, the
139
-genotypes and normalized intensities are stored in matrices.  A word
140
-of caution: the \Rfunction{genotype} function without \Rpackage{ff}
141
-requires a potentially large amount of RAM.  A more RAM-friendly
142
-approach to preprocessing and genotyping requires the \Rpackage{ff}
143
-package.  In particular, the functions \Rfunction{ocProbesets} and
144
-\Rfunction{ocSamples} can be used to manage how many probesets and
145
-samples are to processed at a time and can therefore be used to fine
146
-tune the needed RAM for a particular job.  The function
147
-\Rfunction{ldPath} indicates that \Robject{ff} objects will be stored
148
-in the directory \Robject{outdir}.
149
-
150
-<<ff>>=
151
-library(ff)
152
-ldPath(outdir)
153
-ocProbesets(100000)
154
-ocSamples(200)
155
-@
77
+\textbf{Limitations:} While a minimum number of samples is not
78
+required for preprocessing and genotyping, copy number estimation in
79
+the \crlmm{} package currently requires at least 10 samples per batch.
80
+The parameter estimates for copy number and the corresponding
81
+estimates of raw copy number will tend to be more noisy for batches
82
+with small sample sizes (e.g., $<$ 50).  Chemistry plate or scan date
83
+are often useful surrogates for batch.
156 84
 
157
-With LDS enabled, we preprocess and genotype 180 samples from the CEPH
158
-and Yoruban populations.  Users interested only in the genotypes
159
-should instead use the \R{} function \Rfunction{crlmm} or
160
-\Rfunction{crlmm2}. We wrap the call to \Rfunction{genotype} in
161
-\Rfunction{checkExists} so that subsequent calls to \verb@Sweave@ can
162
-be run interactively.
85
+\section{Quality control}
163 86
 
87
+The signal to noise ratio (SNR) estimated by the CRLMM genotyping
88
+algorithm is an overall measure of the separation of the diallelic
89
+genotype clusters at polymorphic loci and can be a useful measure of
90
+array quality.  Small SNR values can indicate possible problems with
91
+the DNA.  Depending on the size of the dataset and the number of
92
+samples with low SNR, users may wish to rerun the preprocessing and
93
+genotyping steps after excluding samples with low SNR.  The SNR is
94
+stored in the \verb+phenoData+ slot of the \Robject{CNSet} object and
95
+is available after preprocessing and genotyping. SNR values below 5
96
+for Affymetrix or below 25 for Illumina may indicate poor sample
97
+quality.  The following code chunk makes a histogram of the SNR values
98
+for the HapMap samples.
164 99
 
165
-<<LDS_genotype, cache=TRUE>>=
166
-cnSet <- genotype(celFiles, batch=plates, cdfName=cdfName)
100
+<<snr,fig=TRUE,include=FALSE,width=6, height=4>>=
101
+open(cnSet$SNR)
102
+snr <- cnSet$SNR[]
103
+close(cnSet$SNR)
104
+print(histogram(~snr, panel=function(...){
105
+	panel.histogram(...)
106
+	panel.abline(v=5, col="grey",lty=2)
107
+},
108
+		breaks=25,xlim=c(4.5,10), xlab="SNR"))
167 109
 @
110
+\begin{figure}[t!]
111
+  \begin{center}
112
+  \includegraphics[width=0.6\textwidth]{copynumber-snr.pdf}
113
+  \caption{The signal to noise ratio (SNR) for 180 HapMap samples. For
114
+    Affymetrix platforms, SNR values below 5 can indicate possible
115
+    problems with sample quality.  In some circumstances, it may be
116
+    more helpful to exclude samples with poor DNA quality.}
117
+\end{center}
118
+\end{figure}
168 119
 
169
-The value returned by genotype is an instance of the class
170
-\Robject{CNSet}.  In addition to the normalization and genotyping, the
171
-\Robject{genotype} function initializes a container that will store
172
-summary statistics for the batches and parameters needed for copy
173
-number estimation.  At this point, the batch summaries and parameters
174
-for copy number are all NA's.
175 120
 
176
-\subsection{Copy number estimation.}
121
+\section{Copy number estimation}
122
+
123
+As described in \cite{Scharpf2010}, the CRLMM-CopyNumber algorithm
124
+fits a linear model to the normalized intensities stratified by the
125
+diallic genotype call.  The intercept and slope from the linear model
126
+are both SNP- and batch-specific.  The implementation in the \crlmm{}
127
+package is encapsulated by the function \Rfunction{crlmmCopynumber}
128
+that, using the default settings, can be called by passing a single
129
+object of class \code{CNSet}.  See the appropriate
130
+preprocessing/genotyping vignette for the construction of an object of
131
+class \Rclass{CNSet}.
132
+<<LDS_copynumber,cache=TRUE>>=
133
+(cnSet.updated <- crlmmCopynumber(cnSet))
134
+@
177 135
 
178
-The \Rfunction{crlmmCopynumber} performs the following steps:
136
+The following steps were performed by the \Rfunction{crlmmCopynumber}
137
+function:
179 138
 \begin{itemize}
180
-\item computes summary statistics for each batch
181
-\item imputes unobserved genotype centers (for each batch)
182
-\item shrinks the within-genotype variances
183
-\item estimates parameters for allele-specific copy number
139
+\item sufficient statistics for the genotype clusters for
140
+  each batch
141
+\item unobserved genotype centers imputed
142
+\item posterior summaries of sufficient statistics
143
+\item intercept and slope for linear model
184 144
 \end{itemize}
185
-
186
-  With \texttt{verbose=TRUE}, the above steps for CN estimation are
187
-  displayed during the processing.
145
+Depending on the value of \verb+ocProbesets()+, these summaries are
146
+computed for subsets of the markers to reduce the required RAM. Note
147
+that the value returned by the \Rfunction{crlmmCopynumber} function in
148
+the above example is \verb+TRUE+.  The reason the function returns
149
+\verb+TRUE+ in the above example is that the elements of the
150
+\verb+batchStatistics+ slot have the class \Rclass{ff_matrix}.  Rather
151
+than keep the statistical summaries in memory, the summaries are
152
+written to files on disk using protocols described in the
153
+\Rpackage{ff} package. Hence, while the \Robject{cnSet} object itself
154
+is unchanged as a result of the \Rfunction{crlmmCopynumber} function,
155
+the data on disk is updated accordingly.  Users that are interested in
156
+accessing these low-level summaries can refer to the
157
+\verb+Infrastructure+ vignette.  Computation of the raw copy number
158
+estimates for each allele is described in the following section.
159
+
160
+For users that are interested in the analysis of a specific chromosome
161
+(subset of markers) or a s
162
+
163
+pointers to files on disk, are stored in the \verb+batchStatistics+
164
+slot of the class \Rclass{CNSet}.  Using the default settings for the
165
+\Rfunction{crlmmCopynumber} function, only an object of class
166
+\Rclass{CNSet} is required.  %(See \verb+AffyPreprocessCN+ or
167
+%\verb+IlluminaPreprocessCN+ vignettes for details.)
168
+
169
+Note that depends on whether the elements of the \verb+batchStatistics+
170
+slot are \Robject{ff} objects or ordinary matrices.  In this example,
171
+the elements of \verb+batchStatistics+ have the class
172
+\Rclass{ff_matrix}.
188 173
 
189 174
 <<LDS_copynumber,cache=TRUE>>=
175
+nms <- ls(batchStatistics(cnSet))
176
+cls <- rep(NA, length(nms))
177
+for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet)[[nms[i]]])[1]
178
+all(cls == "ff_matrix")
179
+@
180
+
181
+The batch-specific statistical summaries computed by
182
+\Robject{crlmmCopynumber} are written to files on disk using protocols
183
+described in the \R{} package \Rpackage{ff}. The value returned by
184
+\Rfunction{crlmmCopynumber} is \verb+TRUE+, indicating that the files
185
+on disk have been successfully updated.  Note that while the
186
+\Robject{cnSet} object is unchanged, the values on disk are different.
187
+
188
+
189
+\noindent On the other hand, subsetting the \Robject{cnSet} with the
190
+`[' method coerces all of the elements to class \Rclass{matrix}. The
191
+batch-specific summaries are now ordinary matrices stored in RAM. The
192
+object returned by \Robject{crlmmCopynumber} is an object of class
193
+\Rclass{CNSet} with the matrices in the \verb+batchStatistics+ slot
194
+updated.
195
+
196
+<<chr1index>>=
197
+chr1.index <- which(chromosome(cnSet) == 1)
190 198
 open(cnSet)
191
-cnSet.updated <- crlmmCopynumber(cnSet)
192 199
 @
193 200
 
194
-In an effort to reduce I/O, the \Rpackage{crlmmCopynumber} function no
195
-longer stores the allele-specific estimates of copy number as part of
196
-the object.  Rather, several functions are available that will compute
197
-relatively quickly the allele-specific estimates from the stored
198
-normalized intensities and the linear model parameters. At allele $k$,
201
+<<subset,cache=TRUE>>=
202
+cnSet2 <- cnSet[chr1.index, ]
203
+@
204
+
205
+<<valuematrix>>=
206
+close(cnSet)
207
+for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet2)[[nms[i]]])[1]
208
+all(cls == "matrix")
209
+@
210
+
211
+<<matrix_copynumber,cache=TRUE>>=
212
+cnSet3 <- crlmmCopynumber(cnSet2)
213
+class(cnSet3)
214
+@
215
+
216
+<<clean, echo=FALSE, results=hide>>=
217
+rm(cnSet2); gc()
218
+@
219
+
220
+\subsection{Raw copy number}
221
+
222
+Several functions are available that will compute relatively quickly
223
+the allele-specific, \emph{raw} copy number estimates. At allele $k$,
199 224
 marker $i$, sample $j$, and batch $p$, the estimate of allele-specific
200 225
 copy number is computed by subtracting the estimated background from
201
-the observed intensity and scaling by the slope coefficient.
226
+the normalized intensity and scaling by the slope coefficient. More formally,
202 227
 \newcommand{\A}{A} \newcommand{\B}{B}
203 228
 \begin{eqnarray}
204 229
   \label{eq:cnK}
... ...
@@ -208,30 +233,43 @@ the observed intensity and scaling by the slope coefficient.
208 233
 \end{eqnarray}
209 234
 \noindent See \cite{Scharpf2010} for details.
210 235
 
211
-For large datasets, the above computation will not be instantaneous as
212
-the I/O can be substantial.  The functions \Rfunction{CA},
213
-\Rfunction{CB}, and \Rfunction{totalCopynumber} should be used to
214
-extract CN estimates from the \Robject{CNSet} container.  The
215
-following code chunks illustrate several examples, as well as some of
216
-the useful accessors for extracting which markers are SNPs, which are
217
-chromosomes, etc.
236
+The function \Rfunction{totalCopynumber} translates the normalized
237
+intensities to an estimate of raw copy number by adding the
238
+allele-specific summaries in Equation \eqref{eq:cnK}. For large
239
+datasets, the calculation will not be instantaneous as the I/O can be
240
+substantial.  Users should specify either a subset of the markers or a
241
+subset of the samples to avoid using all of the available RAM.  For
242
+example, in the following code chunk we compute the total copy number
243
+at all markers for the first 2 samples, and the total copy number for
244
+chromosome 20 for the first 50 samples.
245
+
246
+<<totalCopynumber>>=
247
+tmp <- totalCopynumber(cnSet, i=1:nrow(cnSet), j=1:2)
248
+dim(tmp)
249
+tmp2 <- totalCopynumber(cnSet, i=which(chromosome(cnSet) == 20), j=1:50)
250
+dim(tmp2)
251
+@
252
+
253
+Alternatively, the functions \Rfunction{CA} and \Rfunction{CB} compute
254
+the allele-specific copy number.  For instance, the following code
255
+chunk computes the allele-specific summaries at all polymorphic loci.
218 256
 
219 257
 <<ca>>=
220 258
 snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)))
221 259
 ca <- CA(cnSet, i=snp.index, j=1:5)
222 260
 cb <- CB(cnSet, i=snp.index, j=1:5)
223
-ct <- ca+cb
224 261
 @
225 262
 
226
-Alternatively, total copy number can be obtained by
227
-<<totalCopynumber.snps>>=
263
+\noindent Note the equivalence of the following calculations.
264
+
265
+<<totalCopynumber>>=
266
+ct <- ca+cb
228 267
 ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5)
229 268
 stopifnot(all.equal(ct, ct2))
230 269
 @
231 270
 
232
-At nonpolymorphic loci, either the \Rfunction{CA} or
233
-\Rfunction{totalCopynumber} functions can be used to obtain estimates
234
-of total copy number.
271
+At nonpolymorphic loci, \Rfunction{CA} function returns the total copy
272
+number and, by construction, the \Rfunction{CB} function returns 0.
235 273
 
236 274
 <<nonpolymorphicAutosomes>>=
237 275
 marker.index <- which(!isSnp(cnSet))
... ...
@@ -242,392 +280,121 @@ ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5)
242 280
 stopifnot(all.equal(ct, ct2))
243 281
 @
244 282
 
283
+In the following code chunk, we extract estimates of the total copy
284
+number at nonpolymorphic markers on chromosome X.
245 285
 
246
-Nonpolymorphic markers on chromosome X:
247
-
248
-<<nonpolymorphicX, fig=TRUE, width=8, height=4>>=
286
+<<npx>>=
249 287
 npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet))
250
-M <- sample(which(cnSet$gender==1), 5)
251
-F <- sample(which(cnSet$gender==2), 5)
288
+M <- sample(which(cnSet$gender[]==1), 5)
289
+F <- sample(which(cnSet$gender[]==2), 5)
252 290
 cn.M <- CA(cnSet, i=npx.index, j=M)
253 291
 cn.F <- CA(cnSet, i=npx.index, j=F)
254
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", col="grey60", outline=FALSE)
255 292
 @
256 293
 
294
+\noindent Again, the function \Rfunction{totalCopynumber} is
295
+equivalent.
257 296
 
258
-\begin{figure}[t!]
259
- \centering
260
- \includegraphics[width=0.8\textwidth]{copynumber-nonpolymorphicX}
261
-  \caption{Copy number estimates for nonpolymorphic loci on chromosome
262
-    X (5 men, 5 women).  crlmm assumes that the median copy number
263
-    across samples at a given marker on X is 1 for men and 2 for women.}
264
-\end{figure}
297
+<<npx2>>=
298
+cnX <- cbind(cn.M, cn.F)
299
+cnX2 <- totalCopynumber(cnSet, i=npx.index, j=c(M, F))
300
+stopifnot(all.equal(cnX, cnX2))
301
+@
302
+
303
+<<nonpolymorphicX,fig=TRUE,include=FALSE, echo=FALSE, eval=FALSE>>=
304
+df <- data.frame(cn=as.numeric(cnX), id=factor(rep(sampleNames(cnSet)[c(M,F)], each=nrow(cnX)), levels=sampleNames(cnSet)[c(M,F)], ordered=T))
305
+library(RColorBrewer)
306
+cols <- brewer.pal(8, "Accent")[c(1, 5)]
307
+print(bwplot(cn~id,df, panel=function(x,y,...){
308
+	panel.grid(v=-10,h=0)
309
+	panel.bwplot(x,y, ...)
310
+	panel.abline(h=1:2, col="grey70",lwd=2)
311
+},
312
+	scales=list(x=list(rot=90)), cex=0.5, ylab="total copy number", main="Chr X SNPs",
313
+       fill=cols[cnSet$gender[c(M,F)]], key=mykey))
314
+@
315
+
316
+%\begin{figure}[t!]
317
+% \centering
318
+% \includegraphics[width=0.5\textwidth]{copynumber-nonpolymorphicX}
319
+% \caption{Copy number estimates for nonpolymorphic loci on chromosome
320
+%   X (5 men, 5 women).  We assume that the median copy number across
321
+%   samples at a given marker on X is 1 for men and 2 for women.}
322
+%\end{figure}
265 323
 
266 324
 Polymorphic markers on chromosome X:
267 325
 
268
-\begin{figure}[t!]
269
- \centering
270
-<<polymorphicX, fig=TRUE, width=8, height=4>>=
271
-## copy number estimates on X for SNPs are biased towards small values.
326
+<<polymorphicX,fig=TRUE,include=FALSE>>=
327
+library(RColorBrewer)
328
+cols <- brewer.pal(8, "Accent")[c(1, 5)]
272 329
 X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
273
-ca.M <- CA(cnSet, i=X.markers, j=M)
274
-cb.M <- CB(cnSet, i=X.markers, j=M)
275
-ca.F <- CA(cnSet, i=X.markers, j=F)
276
-cb.F <- CB(cnSet, i=X.markers, j=F)
277
-cn.M <- ca.M+cb.M
278
-cn.F <- ca.F+cb.F
279
-par(las=1)
280
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", outline=FALSE, col=cnSet$gender[c(M, F)], xaxt="n", ylim=c(0,5))
281
-legend("topleft", fill=unique(cnSet$gender[c(M,F)]), legend=c("Male", "Female"))
282
-abline(h=c(1,2))
283
-cn2 <- totalCopynumber(cnSet, i=X.markers, j=c(M, F))
284
-stopifnot(all.equal(cbind(cn.M, cn.F), cn2))
330
+cnX <- totalCopynumber(cnSet, i=X.markers, j=c(M,F))
331
+df <- data.frame(cn=as.numeric(cnX), id=factor(rep(sampleNames(cnSet)[c(M,F)], each=length(X.markers)), levels=sampleNames(cnSet)[c(M,F)], ordered=T))
332
+mykey <- simpleKey(c("male", "female"), points=FALSE,col=cols)
333
+print(bwplot(cn~id,df, panel=function(x,y,...){
334
+	panel.grid(v=-10,h=0)
335
+	panel.bwplot(x,y, ...)
336
+	panel.abline(h=1:2, col="grey70",lwd=2)
337
+},
338
+	scales=list(x=list(rot=90)), cex=0.5, ylab="total copy number", main="Chr X SNPs",
339
+       fill=cols[cnSet$gender[c(M,F)]], key=mykey))
285 340
 @
341
+\begin{figure}[t!]
342
+ \centering
343
+ \includegraphics[width=0.5\textwidth]{copynumber-polymorphicX}
286 344
 \caption{Copy number estimates for polymorphic markers on chromosome
287 345
   X. crlmm assumes that the median copy number across samples at a
288 346
   given marker on X is 1 for men and 2 for women.}
289 347
 \end{figure}
290 348
 
291
-Accessors for physical position and chromosome are also provided. In
292
-the following codechunk we extract the position and chromosome for the
293
-first 10 markers in the \Robject{cnSet} object.
294
-
295
-<<fdAccessors>>=
296
-position(cnSet)[1:10]
297
-chromosome(cnSet)[1:10]
298
-@
299
-
300
-\section{The CNSet container}
301
-
302
-The objects returned by the \Rfunction{genotype} and
303
-\Rfunction{crlmmCopynumber} have assay data elements that are pointers
304
-to \Robject{ff} objects stored in the directory \Robject{outdir}.  Had
305
-we not loaded the \Rpackage{ff} prior to running these functions, the
306
-\Robject{AssayData} elements would be ordinary matrices, though the
307
-RAM required for running the algorithm would be substantial. The
308
-functions \Rfunction{open} and \Rfunction{close} open and close the
309
-connections to the \Robject{assayData} elements. Subsetting an
310
-\Robject{ff} object pulls the data from disk into memory and should be
311
-used with caution. In particular, subsetting the \Robject{gtSet} would
312
-subset each element in the \Robject{assayData} slot, returning an
313
-object of the same class but with \Robject{assayData} elements that
314
-are matrices.  Such an operation can be exceedingly slow when
315
-performed over a network and require subsantial RAM.  The preferred
316
-approach is to extract only the assay data element that is needed. In
317
-the example below, we extract the genotype calls for the the first 50
318
-samples.
319
-
320
-<<check>>=
321
-dims(cnSet)
322
-print(object.size(cnSet), units="Mb")
323
-invisible(open(calls(cnSet)))
324
-gt <- calls(cnSet)[, 1:50]
325
-@
326
-
327
-The \Robject{CNSet} class also contains the slot
328
-\Robject{batchStatistics} that contains batch-specific summaries
329
-needed for copy number estimation. In particular, each element is a
330
-matrix (or an ff object) with R rows and C columns, correspoinding to
331
-R markers and C batches.  The summaries includes the within genotype
332
-cluster medians and median absolute deviations (mads), but also
333
-parameters estimated from the linear model.  (For unobserved
334
-genotypes, the medians are imputed and the variance is obtained the
335
-median variance (across markers) within a batch. ) The elements of the
336
-slot can be listed as follows.
337
-
338
-<<batchStatistics>>=
339
-assayDataElementNames(batchStatistics(cnSet))
340
-@
341
-
342
-Note that for the Affymetrix 6.0 platform the assay data elements each
343
-have a row dimension corresponding to the total number of polymorphic
344
-and nonpolymorphic markers interrogated by the Affymetrix 6.0
345
-platform.  A consequence of keeping the rows of the assay data
346
-elements the same for all of the statistical summaries is that the
347
-matrix used to store genotype calls is larger than necessary.  Also,
348
-note the additional overhead of some operations when using
349
-\Robject{ff} objects.  For instance, the posterior probabilities for
350
-the CRLMM genotype calls are represented as integers. The accessor
351
-\Robject{snpCallProbability} can be used to access these confidence
352
-scores.  When stored as matrices, converting the integer
353
-representation back to the probability scale is straightforward as
354
-shown below.  However, for the \Robject{ff} objects we must first
355
-convert the ff object to a matrix. One could use the function
356
-\Rfunction{[,]} but this could be slow and require a lot of RAM
357
-depending on the size of the dataset. We suggest pulling only the
358
-needed rows and columns from memory. In the following example, we
359
-convert the integer scores to probabilities for the CEPH samples.  As
360
-genotype confidence scores are not applicable to the nonpolymorphic
361
-markers, we extract only the polymorphic markers using the
362
-\Rfunction{isSnp} function.
363
-
364
-<<assayData>>=
365
-rows <- which(isSnp(cnSet))
366
-cols <- which(batch(cnSet) == "C")
367
-invisible(open(snpCallProbability(cnSet)))
368
-posterior.prob <- tryCatch(i2p(snpCallProbability(cnSet)),
369
-			   error=function(e) print("This will not work for an ff object."))
370
-@
371
-
372
-Accessors for the quantile normalized intensities for the A allele at
373
-polymorphic loci:
374
-
375
-<<accessors>>=
376
-snp.index <- which(isSnp(cnSet))
377
-np.index <- which(!isSnp(cnSet))
378
-invisible(open(A(cnSet)))
379
-a <- (A(cnSet))[snp.index, ]
380
-dim(a)
381
-@
382
-
383
-The extra set of parentheses surrounding \Robject{A(cnSet2)} above is
384
-added to emphasize the appropriate order of operations. Subsetting the
385
-entire \Robject{cnSet} object in the following, unevaluated codechunk
386
-should be avoided for large datasets.
387
-
388
-<<eval=FALSE>>=
389
-a <- A(cnSet[snp.index, ])
390
-@
391
-
392
-The quantile-normalized intensities for nonpolymorphic loci are
393
-obtained by:
394
-
395
-<<>>=
396
-npIntensities <- (A(cnSet))[np.index, ]
397
-invisible(close(A(cnSet)))
398
-@
399
-
400
-Quantile normalized intensities for the B allele at polymorphic loci:
401
-
402
-<<B.polymorphic>>=
403
-invisible(open(B(cnSet)))
404
-b.snps <- (B(cnSet))[snp.index, ]
405
-@
406
-
407
-Note that NA's are stored in the slot for normalized 'B' allele
408
-intensities:
409
-
410