Browse code

merge collab branch

* collab:
update vignettes/Makefile
update .gitignore
update table in CopyNumberViewiew
bump dependency on oligoClasses version
Update alias for "[" method for PredictionRegion objects
Replaced CopyNumberOverview and Infrastructure vignettes with those in the release branch. The versions in release have removed AffymetrixPreprocessCN, which is no longer available

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@71322 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 19/11/2012 04:49:34
Showing1 changed files
... ...
@@ -29,14 +29,13 @@
29 29
 
30 30
   This vignette provides an overview of the \Rclass{CNSet} class and a
31 31
   brief discussion of the underlying infrastructure for large data
32
-  support with the \Rpackage{ff} package.  This package instantiates
32
+  support with the \Rpackage{ff} package.  This vignette instantiates
33 33
   an object of class \Rclass{CNSet} using a trivial dataset with 3
34 34
   files.  As this sample size is too small for estimating copy number
35 35
   with the \crlmm{} package, the final section of this vignette loads
36 36
   an object created by the analysis of 180 HapMap CEL files
37
-  (Affymetrix 6.0 platform).  In particular, this object was
38
-  instantiated by running (1) the \verb+AffymetrixPreprocessCN+
39
-  vignette and (2) the \verb+copynumber+ vignette.
37
+  (Affymetrix 6.0 platform).  This object was instantiated by running
38
+  the \verb+AffyGW+ vignette.
40 39
 
41 40
 \end{abstract}
42 41
 
... ...
@@ -76,14 +75,14 @@ processing subsets of the markers and/or samples. The functions
76 75
 markers and samples to read at once.  In general, specifying smaller
77 76
 values should reduce the RAM required for a particular job.  In
78 77
 general, smaller values will increase the run-time. In the following
79
-code-chunk, we declare that \crlmm{} should process 150,000 markers at
80
-a time (when possible) and 500 samples at a time.  If our dataset
81
-contained fewer than 500 samples, the \Rfunction{ocSamples} option
78
+code-chunk, we declare that \crlmm{} should process 100,000 markers at
79
+a time (when possible) and 200 samples at a time.  If our dataset
80
+contained fewer than 200 samples, the \Rfunction{ocSamples} option
82 81
 would not have any effect.  One can view the current settings for
83 82
 these commands, by typing the functions without an argument.
84 83
 
85 84
 <<ram>>=
86
-ocProbesets(50e3)
85
+ocProbesets(100e3)
87 86
 ocSamples(200)
88 87
 @
89 88
 
... ...
@@ -96,11 +95,12 @@ methods:
96 95
 
97 96
 \begin{itemize}
98 97
 
99
-\item[Approach 1:] during the preprocessing of the raw intensities for Illumina and
100
-  Affymetrix arrays by the the functions \Rfunction{constructInf} and
101
-  \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls
102
-  the non-exported function \Rfunction{constructAffy} to initialize a
103
-  \Rclass{CNSet} object for Affymetrix platforms.)
98
+\item[Approach 1:] during the preprocessing of the raw intensities for
99
+  Illumina and Affymetrix arrays by the the functions
100
+  \Rfunction{constructInf} and \Rfunction{genotype},
101
+  respectively. (The \Rfunction{genotype} calls the function
102
+  \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for
103
+  Affymetrix platforms.)
104 104
 
105 105
 \item[Approach 2:] by subsetting an existing \Robject{CNSet} object.
106 106
   As per usual, the `[' method can be used to extract a subset of
... ...
@@ -110,17 +110,18 @@ methods:
110 110
 \end{itemize}
111 111
 
112 112
 There are important differences in the underlying data representation
113
-depending on how the object was instantiated.  In particular, objects
114
-generated by the functions \Rfunction{constructInf} and
115
-\Rfunction{genotype} store high-dimensional data on disk rather than
116
-in memory through protocols defined in the \R{} package \ff{}.  For
117
-instance, the normalized intensities and genotype calls in a
118
-\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects.
119
-By contrast, when such an objected generated by approach (1) is subset
120
-by the `[' method, an object of the same class is returned but the
121
-\Rpackage{ff}-derived objects are coerced to ordinary matrices. Note,
122
-therefore, that both approaches (1) and (2) may involve substantial
123
-I/O.
113
+depending on how the \Rclass{CNSet} object was instantiated.  In
114
+particular, objects generated by the functions
115
+\Rfunction{constructInf} and \Rfunction{genotype} store
116
+high-dimensional data on disk rather than in memory through protocols
117
+defined in the \R{} package \ff{}.  For instance, the normalized
118
+intensities and genotype calls in a \Rclass{CNSet}-instance from
119
+approach (1) are \ff{}-derived objects.  By contrast, when such an
120
+objected generated by approach (1) is subset by the `[' method, an
121
+object of the same class is returned but the \Rpackage{ff}-derived
122
+objects are coerced to ordinary matrices. Note, therefore, that both
123
+approaches (1) and (2) may involve substantial I/O and that (2) should
124
+be performed judiciously.
124 125
 
125 126
 \subsubsection{Approach 1}
126 127
 
... ...
@@ -149,7 +150,7 @@ celfiles <- list.celfiles(path, full.names=TRUE)
149 150
 Typically, an object of class \Rclass{CNSet} is instantiated as part
150 151
 of the preprocessing and genotyping by calling the function
151 152
 \Rfunction{genotype}, as illustrated in the
152
-\verb+AffymtrixPreprocessCN+ vignette.
153
+\verb+AffyGW+ vignette.
153 154
 
154 155
 <<instantiateToyExample,cache=TRUE>>=
155 156
 exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
... ...
@@ -163,11 +164,11 @@ ldPath()
163 164
 @
164 165
 
165 166
 One could also instantiate an object of class \Rclass{CNSet} without
166
-preprocessing/genotyping by calling the non-exported function
167
+preprocessing/genotyping by calling the exported function
167 168
 \Rfunction{constructAffy} directly using the \Rfunction{:::} operator.
168 169
 
169 170
 <<constructAffy,eval=FALSE>>=
170
-tmp <- crlmm:::constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
171
+tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
171 172
 @
172 173
 
173 174
 The \Rfunction{show} method provides a concise summary of the
... ...
@@ -300,15 +301,6 @@ platform.  A consequence of keeping the rows of the assay data
300 301
 elements the same for all of the statistical summaries is that the
301 302
 matrix used to store genotype calls is larger than necessary.
302 303
 
303
-% only true for affy.  for illumina, we have intensities stored in 'B'
304
-%Note that NA's are stored in the slot for normalized 'B' allele
305
-%intensities:
306
-
307
-%<<B.NAs>>=
308
-%np.index <- which(!is.snp)
309
-%stopifnot(all(is.na(B(cnSet)[np.index, ])))
310
-%@
311
-
312 304
 \subsubsection{\texttt{batch} and \texttt{batchStatistics}}
313 305
 
314 306
 As defined in Leek \textit{et al.} 2010, \textit{Batch effects are
... ...
@@ -402,73 +394,10 @@ varLabels(protocolData(exampleSet))
402 394
 protocolData(exampleSet)$ScanDate
403 395
 @
404 396
 
405
-
406
-%\section{Suggested visualizations}
407
-%
408
-%\paragraph{SNR.}
409
-%
410
-%A histogram of the signal to noise ratio for the HapMap samples:
411
-%
412
-%<<plotSnr, fig=TRUE, include=FALSE>>=
413
-%open(cnSet$SNR)
414
-%hist(cnSet$SNR[, ], xlab="SNR", main="", breaks=25, col="lightblue", xlim=c(3, max(cnSet$SNR[])))
415
-%abline(v=5, lty=2, col="grey")
416
-%text(3,5, label="SNR range for low \n quality arrays", adj=0, col="grey40")
417
-%@
418
-%\begin{figure}
419
-%  \centering
420
-%  \includegraphics[width=0.6\textwidth]{copynumber-plotSnr}
421
-%  \caption{Signal to noise ratios for the HapMap samples. SNRs below 5
422
-%   for the Affymetrix platform are often samples of lower quality.
423
-%   Such samples will tend to have much more variable estimates of copy
424
-%   number.}
425
-%\end{figure}
426
-
427
-
428
-%\paragraph{One sample at a time: locus-level estimates}
429
-%
430
-%Figure \ref{fig:oneSample} plots physical position (horizontal axis)
431
-%versus copy number (vertical axis) for the first sample.  There is
432
-%less information to estimate copy number at nonpolymorphic loci;
433
-%improvements to the univariate prediction regions at nonpolymorphic
434
-%loci are a future area of research. If the \Rpackage{SNPchip} is
435
-%available, an idiogram can be added to the existing plotting
436
-%coordinates as indicated in the following example.
437
-%
438
-%<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>=
439
-%GT.CONF.THR <- 0.8
440
-%marker.index <- which(chromosome(cnSet) == 1)
441
-%cn <- totalCopynumber(cnSet, i=marker.index, j=1)
442
-%x <- position(cnSet)[marker.index]
443
-%par(las=1, mar=c(4, 5, 4, 2))
444
-%plot(x, cn, pch=".",
445
-%     cex=2, xaxt="n", col="grey60", ylim=c(0,6),
446
-%     ylab="copy number", xlab="physical position (Mb)",
447
-%     main=paste(sampleNames(cnSet)[1], ", CHR: 1"))
448
-%axis(1, at=pretty(x), labels=pretty(x)/1e6)
449
-%require(SNPchip)
450
-%invisible(plotCytoband(1, new=FALSE, cytoband.ycoords=c(5.5, 6), label.cytoband=FALSE))
451
-%@
452
-
453
-%\begin{figure}
454
-%  \includegraphics[width=0.9\textwidth]{copynumber-oneSample}
455
-%  \caption{\label{fig:oneSample} Total copy number (y-axis) for
456
-%    chromosome 1 plotted against physical position (x-axis) for one
457
-%    sample.  Estimates at nonpolymorphic loci are plotted in light
458
-%    blue.}
459
-%\end{figure}
460
-%
461
-%\clearpage
462
-%\paragraph{One SNP at a time}
463
-%
464
-%Scatterplots of the A and B allele intensities (log-scale) can be
465
-%useful for assessing the biallelic genotype calls.  This section of
466
-%the vignette is currently under development.
467
-
468 397
 \section{Trouble shooting with a HapMap example}
469 398
 
470 399
 This section uses an object of class \Rclass{CNSet} instantiated by
471
-the \verb+AffymetrixPreprocessCN+ vignette and saved to a local path
400
+the \verb+AffyGW+ vignette and saved to a local path
472 401
 on our computing cluster indicated by the object \Robject{outdir}
473 402
 below.  The \verb+copynumber+ vignette was used to fill out the
474 403
 \verb+batchStatistics+ slot of the \Robject{cnSet} object.
... ...
@@ -490,17 +419,6 @@ invisible(open(cnSet))
490 419
 
491 420
 \subsection{Missing values}
492 421
 
493
-% There are several reasons for estimates of the allele-specific copy
494
-% number to have missing values (\texttt{NA}'s). This section briefly
495
-% elaborates on the source of missing values in the HapMap analysis
496
-% and discusses possible alternatives to reduce the number of missing
497
-% values.  Note that allele-specific copy number, 'CA' and 'CB', is
498
-% not saved in the \Robject{cnSet} object.  Rather, the respective
499
-% accessors calculate 'CA' and 'CB' on the fly from the normalized
500
-% intensities and from the marker-specific parameter estimates in the
501
-% linear model.  In general, a missing value arises when the
502
-% background or slope parameter was not estimated in the linear
503
-% model.
504 422
 Most often, missing values occur when the genotype confidence scores
505 423
 for a SNP were below the threshold used by the
506 424
 \Robject{crlmmCopynumber} function. For the HapMap analysis, we used a
Browse code

Merge branch 'collab'

* collab: (34 commits)
revert change to IlluminaPreprocessCN
fix bug in isValidCdfName
print warning when all features in a batch of probes are flagged, but allow processing to continue
add utility cleancdfnames
Add validCdfNames.Rd
export validCdfNames
imputeGender fix when chromosome Y not available
Use splitIndicesByLength(index, ocSamples/getDoParWorkers())
Can not allocate vector of size XG with genotype.Illumina. Use splitIndicesByNode() only if the length of the list is greater than the split from splitIndicesByLength(). Otherwise, split by length using ocSamples()
update .gitignore
Add make.unique for sampleSheet$Sample_ID in readIdatFiles
bug in description
ensure sample ids stored in samplesheet are unique when constructing cnSet object
update oligoClasses dependency
update unit test for genotype.Illumina
revert change in constructInf call from genotype.Illumina
Update genotype.Rd
edit ACN function
1.15.6 use make.unique(basename(arrayNames)) to allow processing of Illumina samples with duplicated barcodes
check that sample identifies are unique in crlmm function
...

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@67435 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 08/07/2012 19:00:03
Showing1 changed files
... ...
@@ -129,11 +129,9 @@ directory to store output files and setting the \Rfunction{ldPath}
129 129
 function with this path.
130 130
 
131 131
 <<path>>=
132
-if(getRversion() < "2.13.0"){
133
-	rpath <- getRversion()
134
-} else rpath <- "trunk"
135
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
132
+outdir <- paste("/local_data/r00/crlmm/", getRversion(), "/infrastructure", sep="")
136 133
 ldPath(outdir)
134
+if(!file.exists(outdir)) dir.create(outdir)
137 135
 @
138 136
 
139 137
 
... ...
@@ -258,6 +256,7 @@ is a SNP can be accessed through accessors defined for the
258 256
 \verb+featureData+.
259 257
 
260 258
 <<featuredataAccessors>>=
259
+library(Biobase)
261 260
 fvarLabels(exampleSet)
262 261
 position(exampleSet)[1:10]
263 262
 chromosome(exampleSet)[1:10]
... ...
@@ -554,7 +553,7 @@ estimates are missing for a given SNP.
554 553
 
555 554
 <<NAchromosomeX>>=
556 555
 ## start with first batch
557
-sample.index <- 1:90
556
+sample.index <- which(batch(cnSet)==batch(cnSet)[1])
558 557
 X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
559 558
 ca.X <- CA(cnSet, i=X.index, j=sample.index)
560 559
 missing.caX <- is.na(ca.X)
Browse code

Update copy number vignettes. Pass cdfName to cnrmaAffy function.

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54327 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 31/03/2011 16:52:11
Showing1 changed files
... ...
@@ -22,36 +22,27 @@
22 22
 \title{Infrastructure for copy number analysis in \crlmm{}}
23 23
 \date{\today}
24 24
 \author{Rob Scharpf}
25
+
25 26
 \maketitle
26 27
 
28
+\begin{abstract}
27 29
 
28
-\section{Set up}
30
+  This vignette provides an overview of the \Rclass{CNSet} class and a
31
+  brief discussion of the underlying infrastructure for large data
32
+  support with the \Rpackage{ff} package.  This package instantiates
33
+  an object of class \Rclass{CNSet} using a trivial dataset with 3
34
+  files.  As this sample size is too small for estimating copy number
35
+  with the \crlmm{} package, the final section of this vignette loads
36
+  an object created by the analysis of 180 HapMap CEL files
37
+  (Affymetrix 6.0 platform).  In particular, this object was
38
+  instantiated by running (1) the \verb+AffymetrixPreprocessCN+
39
+  vignette and (2) the \verb+copynumber+ vignette.
29 40
 
30
-As in the previous vignettes, we load the required libraries and
31
-specify a path for storing output files.
41
+\end{abstract}
32 42
 
33 43
 <<libraries,results=hide>>=
34 44
 library(ff)
35 45
 library(crlmm)
36
-library(cacheSweave)
37
-require(DNAcopy)
38
-require(VanillaICE)
39
-if(getRversion() < "2.13.0"){
40
-	rpath <- getRversion()
41
-} else rpath <- "trunk"
42
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
43
-@
44
-
45
-<<ram>>=
46
-ldPath(outdir)
47
-setCacheDir(outdir)
48
-@
49
-
50
-We begin by loading the \Robject{cnSet} object created by the
51
-\verb+AffymetrixPreprocessCN+ vignette.
52
-
53
-<<loadcnset>>=
54
-if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))
55 46
 @
56 47
 
57 48
 \section{Supported platforms}
... ...
@@ -98,31 +89,25 @@ ocSamples(200)
98 89
 
99 90
 \section{The \Rclass{CNSet} container}
100 91
 
101
-The \Rfunction{show} method provides a concise summary of the
102
-\Robject{cnSet} object.
92
+\subsection{Instantiating an object of class \Rclass{CNSet}}
103 93
 
104
-<<show>>=
105
-cnSet
106
-@
94
+An object of class \Rclass{CNSet} can be instantiated by one of two
95
+methods:
107 96
 
108
-We briefly outline some of the unique aspects of the
109
-\Rclass{CNSet}-class using in \crlmm{} that may differ from the more
110
-standard extensions of the virtual class \Rclass{eSet} defined in the
111
-\Rpackage{Biobase} package.
97
+\begin{itemize}
112 98
 
113
-\paragraph{Instantiating a \Rclass{CNSet}:} An object of class
114
-\Rclass{CNSet} can be instantiated by one of two methods:
115
-\begin{enumerate}
116
-\item during the preprocessing of the raw intensities for Illumina and
99
+\item[Approach 1:] during the preprocessing of the raw intensities for Illumina and
117 100
   Affymetrix arrays by the the functions \Rfunction{constructInf} and
118
-  \Rfunction{genotype}, respectively.
101
+  \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls
102
+  the non-exported function \Rfunction{constructAffy} to initialize a
103
+  \Rclass{CNSet} object for Affymetrix platforms.)
119 104
 
120
-\item by subsetting an existing \Robject{CNSet} object.  As per usual,
121
-  the `[' method can be used to extract a subset of markers $i$ as in
122
-  `[i, ]', a subset of samples $j$ as in `[, j]', or a subset of
123
-  markers $i$ and samples $j$ as in `[i, j]'.
105
+\item[Approach 2:] by subsetting an existing \Robject{CNSet} object.
106
+  As per usual, the `[' method can be used to extract a subset of
107
+  markers $i$ as in `[i, ]', a subset of samples $j$ as in `[, j]', or
108
+  a subset of markers $i$ and samples $j$ as in `[i, j]'.
124 109
 
125
-\end{enumerate}
110
+\end{itemize}
126 111
 
127 112
 There are important differences in the underlying data representation
128 113
 depending on how the object was instantiated.  In particular, objects
... ...
@@ -131,36 +116,106 @@ generated by the functions \Rfunction{constructInf} and
131 116
 in memory through protocols defined in the \R{} package \ff{}.  For
132 117
 instance, the normalized intensities and genotype calls in a
133 118
 \Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects.
134
-However, when such an object is subset by the `[' method, the data is
135
-pulled from disk to memory and stored as an ordinary matrix in \R{}.
136
-To illustrate, the \Robject{cnSet} loaded at the start of this session
137
-was created by the function \Rfunction{genotype}.  As the data is
138
-stored on disk rather than in memory, we inspect attributes of the
139
-object representing the normalized intensities for the A allele by
140
-first opening the file connection and then extracting the
141
-\ff{}-derived class of the object as well as where the data is stored
142
-on disk.
119
+By contrast, when such an objected generated by approach (1) is subset
120
+by the `[' method, an object of the same class is returned but the
121
+\Rpackage{ff}-derived objects are coerced to ordinary matrices. Note,
122
+therefore, that both approaches (1) and (2) may involve substantial
123
+I/O.
143 124
 
144
-<<approach1>>=
145
-invisible(open(cnSet))
146
-class(A(cnSet))
147
-filename(A(cnSet))
125
+\subsubsection{Approach 1}
126
+
127
+To illustrate the first approach, we begin by specifying a local
128
+directory to store output files and setting the \Rfunction{ldPath}
129
+function with this path.
130
+
131
+<<path>>=
132
+if(getRversion() < "2.13.0"){
133
+	rpath <- getRversion()
134
+} else rpath <- "trunk"
135
+outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
136
+ldPath(outdir)
148 137
 @
149 138
 
150
-Using approach (2), we construct a new \Robject{CNSet} object.
151 139
 
152
-<<approach2>>=
153
-cnset.subset <- cnSet[1:5, 1:10]
154
-invisible(close(cnSet))
155
-class(A(cnset.subset))
140
+Next we load the annotation package required, as well as the \R{}
141
+package \Rpackage{hapmapsnp6} that contains 3 example CEL files.  We
142
+use the \Rfunction{system.file} function to find the path to the CEL
143
+files.
144
+
145
+<<annotation_metadata>>=
146
+require(genomewidesnp6Crlmm) & require(hapmapsnp6)
147
+path <- system.file("celFiles", package="hapmapsnp6")
148
+celfiles <- list.celfiles(path, full.names=TRUE)
149
+@
150
+
151
+Typically, an object of class \Rclass{CNSet} is instantiated as part
152
+of the preprocessing and genotyping by calling the function
153
+\Rfunction{genotype}, as illustrated in the
154
+\verb+AffymtrixPreprocessCN+ vignette.
155
+
156
+<<instantiateToyExample,cache=TRUE>>=
157
+exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
156 158
 @
157 159
 
158
-The files with \verb+.ff+ extension should not be removed or
159
-relocated.  The safest way to move these files if necessary is to
160
-clone all of the \ff{} objects using the \Rfunction{clone}, followed
161
-by the \Rfunction{delete} function to remove the original files on
162
-disk. See the documentation in the \ff{} package for additional
163
-details.
160
+Several files with \verb+.ff+ extensions now appear in the directory
161
+indicated by the \Rfunction{ldPath} function.
162
+
163
+<<ldpath>>=
164
+ldPath()
165
+@
166
+
167
+One could also instantiate an object of class \Rclass{CNSet} without
168
+preprocessing/genotyping by calling the non-exported function
169
+\Rfunction{constructAffy} directly using the \Rfunction{:::} operator.
170
+
171
+<<constructAffy,eval=FALSE>>=
172
+tmp <- crlmm:::constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
173
+@
174
+
175
+The \Rfunction{show} method provides a concise summary of the
176
+\Robject{exampleSet} object. Note the class of the elements in the
177
+\verb+batchStatistics+ and \verb+assayData+ slots is indicated in the
178
+first line of the summary.
179
+
180
+<<show>>=
181
+invisible(open(exampleSet))
182
+exampleSet
183
+@
184
+
185
+% We briefly outline some of the unique aspects of the
186
+% \Rclass{CNSet}-class using in \crlmm{} that may differ from the more
187
+% standard extensions of the virtual class \Rclass{eSet} defined in
188
+% the \Rpackage{Biobase} package.
189
+
190
+As the \verb+assayData+ elements of the \Robject{exampleSet} object
191
+are stored on disk rather than in memory, we inspect attributes of the
192
+elements by first opening the file connection using the
193
+\Rfunction{open}. For example, in the following code we extract the
194
+normalized intensities for the A allele by first opening the object
195
+returned by the \Rfunction{A} function.  The name of the file where
196
+the data is stored on disk is provided by the
197
+\Rfunction{filename}. Finally, the normalized intensities can be
198
+pulled from disk to memory by the `[' method.  It can be useful to
199
+wrap the `[' method by the \Rfunction{as.matrix} to ensure that the
200
+output is the desired class.
201
+
202
+<<approach1>>=
203
+invisible(open(exampleSet))
204
+class(A(exampleSet))
205
+filename(A(exampleSet))
206
+as.matrix(A(exampleSet)[1:5, ])
207
+@
208
+
209
+\paragraph{Moving \texttt{*.ff} files} Ideally, the files with
210
+\verb+.ff+ extension should not be moved.  However, if this is not
211
+possible, the safest way to move these files is to clone all of the
212
+\ff{} objects using the \Rfunction{clone}, followed by the
213
+\Rfunction{delete} function to remove the original files on disk. An
214
+example of the \Rfunction{delete} function is included at the end of
215
+the \verb+IlluminaPreprocessCN+ vignette. See the documentation for
216
+the \Rfunction{clone} and \Rfunction{delete} functions in the \ff{}
217
+package for additional details.
218
+
164 219
 
165 220
 \paragraph{Order of operations:} For \Rclass{CNSet}-instances derived
166 221
 by approach (1), users should be mindful of the substantial I/O when
... ...
@@ -171,30 +226,49 @@ operation to emphasize the order of operations):
171 226
 
172 227
 <<orderOfOperations>>=
173 228
 ##inefficient
174
-A(cnSet[1:5, 1:5])
229
+##invisible(open(cnSet))
230
+A(exampleSet[1:5, ])
175 231
 ## preferred
176
-(A(cnSet))[1:5, 1:5]
232
+(A(exampleSet))[1:5, ]
177 233
 @
178 234
 
179 235
 
180
-\textbf{\texttt{featureData}}
236
+\subsubsection{Approach 2: using `['}
237
+
238
+Here we instantiate a new object of class \Rclass{CNSet} by applying the
239
+`[' method to an existing object of class \Rclass{CNSet}.
240
+
241
+<<approach2>>=
242
+cnset.subset <- exampleSet[1:5, ]
243
+@
244
+
245
+Note the class of the \verb+batchStatistics+ and \verb+assayData+
246
+elements of the \Robject{cnset.subset} object printed in the first
247
+line of the summary.
248
+
249
+<<show.subset>>=
250
+show(cnset.subset)
251
+@
252
+
253
+\subsection{Slots of class \Rclass{CNSet}}
254
+\subsubsection{\texttt{featureData}}
181 255
 
182 256
 Information on physical position, chromosome, and whether the marker
183 257
 is a SNP can be accessed through accessors defined for the
184 258
 \verb+featureData+.
185 259
 
186 260
 <<featuredataAccessors>>=
187
-fvarLabels(cnSet)
188
-position(cnSet)[1:10]
189
-chromosome(cnSet)[1:10]
190
-is.snp <- isSnp(cnSet)
261
+fvarLabels(exampleSet)
262
+position(exampleSet)[1:10]
263
+chromosome(exampleSet)[1:10]
264
+is.snp <- isSnp(exampleSet)
191 265
 table(is.snp)
192 266
 snp.index <- which(is.snp)
193 267
 np.index <- which(!is.snp)
194
-chr1.index <- which(chromosome(cnSet) == 1)
268
+chr1.index <- which(chromosome(exampleSet) == 1)
195 269
 @
196 270
 
197
-\textbf{\texttt{assayData}}
271
+\subsubsection{\texttt{assayData}}
198 272
 
199 273
 The \verb+assayData+ elements are of the class
200 274
 \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how
... ...
@@ -203,7 +277,7 @@ the \Rclass{CNSet} object was instantiated.  Elements in the
203 277
 function.
204 278
 
205 279
 <<assayData>>=
206
-ls(assayData(cnSet))
280
+ls(assayData(exampleSet))
207 281
 @
208 282
 
209 283
 The normalized intensities for the A and B alleles have names
... ...
@@ -216,7 +290,7 @@ but can be tranlated to the probability scale using the \R{} function
216 290
 \Rfunction{i2p}. For example,
217 291
 
218 292
 <<i2p>>=
219
-scores <- as.matrix(snpCallProbability(cnSet)[1:5, 1:2])
293
+scores <- as.matrix(snpCallProbability(exampleSet)[1:5, 1:2])
220 294
 i2p(scores)
221 295
 @
222 296
 
... ...
@@ -227,15 +301,16 @@ platform.  A consequence of keeping the rows of the assay data
227 301
 elements the same for all of the statistical summaries is that the
228 302
 matrix used to store genotype calls is larger than necessary.
229 303
 
230
-Note that NA's are stored in the slot for normalized 'B' allele
231
-intensities:
304
+% only true for affy.  for illumina, we have intensities stored in 'B'
305
+%Note that NA's are stored in the slot for normalized 'B' allele
306
+%intensities:
232 307
 
233
-<<B.NAs>>=
234
-np.index <- which(!is.snp)
235
-stopifnot(all(is.na(B(cnSet)[np.index, ])))
236
-@
308
+%<<B.NAs>>=
309
+%np.index <- which(!is.snp)
310
+%stopifnot(all(is.na(B(cnSet)[np.index, ])))
311
+%@
237 312
 
238
-\textbf{\texttt{batch} and \texttt{batchStatistics}}
313
+\subsubsection{\texttt{batch} and \texttt{batchStatistics}}
239 314
 
240 315
 As defined in Leek \textit{et al.} 2010, \textit{Batch effects are
241 316
   sub-groups of measurements that have qualitatively different
... ...
@@ -260,7 +335,7 @@ to access the \verb+batch+ information on the samples as in the
260 335
 following example.
261 336
 
262 337
 <<batchAccessor>>=
263
-table(batch(cnSet))
338
+batch(exampleSet)
264 339
 @
265 340
 
266 341
 For the \verb+batchStatistics+ slot, the elements in the environment
... ...
@@ -272,7 +347,7 @@ the elements in the environment can be list using the \R{} function
272 347
 \Rfunction{ls}.
273 348
 
274 349
 <<batchStatistics>>=
275
-ls(batchStatistics(cnSet))
350
+ls(batchStatistics(exampleSet))
276 351
 @
277 352
 
278 353
 Currently, the batch-specific summaries are stored to allow some
... ...
@@ -292,7 +367,7 @@ intended to be accessed directly by the user.
292 367
 %median variance (across markers) within a batch. ) The elements of the
293 368
 %slot can be listed as follows.
294 369
 
295
-\textbf{\texttt{phenoData}}
370
+\subsubsection{\texttt{phenoData}}
296 371
 
297 372
 Sample-level summaries obtained during the preprocessing/genotyping
298 373
 steps include skew (SKW), the signal to noise ratio (SNR), and gender
... ...
@@ -303,26 +378,29 @@ approach (1), these elements are of the class
303 378
 \Rclass{ff\_vector}.
304 379
 
305 380
 <<phenodataAccessors>>=
306
-varLabels(cnSet)
307
-class(cnSet$gender)
308
-invisible(open(cnSet$gender))
309
-cnSet$gender
381
+varLabels(exampleSet)
382
+class(exampleSet$gender)
383
+invisible(open(exampleSet$gender))
384
+exampleSet$gender
310 385
 @
311 386
 
312 387
 The `[' methods without arguments can be used to coerce to a vector.
313 388
 
314 389
 <<vector>>=
315
-cnSet$gender[]
316
-invisible(close(cnSet$gender))
390
+c("male", "female")[exampleSet$gender[]]
391
+invisible(close(exampleSet$gender))
317 392
 @
318 393
 
319
-\textbf{\texttt{protocolData}}
394
+
395
+
396
+
397
+\subsubsection{\texttt{protocolData}}
320 398
 
321 399
 The scan date of the arrays are stored in the \verb+protocolData+.
322 400
 
323 401
 <<protocolData>>=
324
-varLabels(protocolData(cnSet))
325
-protocolData(cnSet)$ScanDate[1:10]
402
+varLabels(protocolData(exampleSet))
403
+protocolData(exampleSet)$ScanDate
326 404
 @
327 405
 
328 406
 
... ...
@@ -388,7 +466,28 @@ protocolData(cnSet)$ScanDate[1:10]
388 466
 %useful for assessing the biallelic genotype calls.  This section of
389 467
 %the vignette is currently under development.
390 468
 
391
-\section{Trouble shooting}
469
+\section{Trouble shooting with a HapMap example}
470
+
471
+This section uses an object of class \Rclass{CNSet} instantiated by
472
+the \verb+AffymetrixPreprocessCN+ vignette and saved to a local path
473
+on our computing cluster indicated by the object \Robject{outdir}
474
+below.  The \verb+copynumber+ vignette was used to fill out the
475
+\verb+batchStatistics+ slot of the \Robject{cnSet} object.
476
+
477
+<<path>>=
478
+if(getRversion() < "2.13.0"){
479
+	rpath <- getRversion()
480
+} else rpath <- "trunk"
481
+outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
482
+@
483
+
484
+Next, we load the \Robject{cnSet} object.
485
+
486
+<<loadcnset>>=
487
+if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))
488
+invisible(open(cnSet))
489
+@
490
+
392 491
 
393 492
 \subsection{Missing values}
394 493
 
... ...
@@ -406,12 +505,15 @@ protocolData(cnSet)$ScanDate[1:10]
406 505
 Most often, missing values occur when the genotype confidence scores
407 506
 for a SNP were below the threshold used by the
408 507
 \Robject{crlmmCopynumber} function. For the HapMap analysis, we used a
409
-confidence threshold of 0.80 (the default).
508
+confidence threshold of 0.80 (the default).  In the following code, we
509
+assess NA's appearing for the raw copy number estimates for the first
510
+10 samples.
410 511
 
411 512
 <<NA>>=
412 513
 GT.CONF.THR <- 0.80
413 514
 autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23)
414
-sample.index <- which(batch(cnSet) == "C")
515
+sample.index <- 1:10
516
+ct <- totalCopynumber(cnSet, i=autosome.index, j=sample.index)
415 517
 ca <- CA(cnSet, i=autosome.index, j=sample.index)
416 518
 cb <- CB(cnSet, i=autosome.index, j=sample.index)
417 519
 missing.ca <- is.na(ca)
... ...
@@ -427,9 +529,9 @@ threshold specified in \Robject{crlmmCopynumber}.
427 529
 
428 530
 <<NAconfidenceScores>>=
429 531
 if(nmissing.ca > 0){
430
-	invisible(open(snpCallProbability(cnSet)))
532
+	##invisible(open(snpCallProbability(cnSet)))
431 533
 	gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index])
432
-	invisible(close(snpCallProbability(cnSet)))
534
+	##invisible(close(snpCallProbability(cnSet)))
433 535
 	below.thr <- gt.conf < GT.CONF.THR
434 536
 	index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index)))
435 537
 	nmissingBecauseOfGtThr <- length(index.allbelow) * length(sample.index)
... ...
@@ -451,13 +553,15 @@ missing values to the number of samples to check whether all of the
451 553
 estimates are missing for a given SNP.
452 554
 
453 555
 <<NAchromosomeX>>=
556
+## start with first batch
557
+sample.index <- 1:90
454 558
 X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
455 559
 ca.X <- CA(cnSet, i=X.index, j=sample.index)
456 560
 missing.caX <- is.na(ca.X)
457 561
 (nmissing.caX <- sum(missing.caX))
458 562
 missing.snp.index <- which(rowSums(missing.caX) == length(sample.index))
459 563
 index <- which(rowSums(missing.caX) == length(sample.index))
460
-length(index)*length(sample.index)/nmissing.caX
564
+##length(index)*length(sample.index)/nmissing.caX
461 565
 @
462 566
 
463 567
 From the above codechunk, we see that \Sexpr{length(index)} SNPs have
... ...
@@ -468,21 +572,15 @@ from SNPs in which either the men or the women had confidence scores
468 572
 that were all below the threshold.
469 573
 
470 574
 <<NAcrlmmConfidenceScore>>=
471
-if(nmissing.caX > 0){
472
-	invisible(open(snpCallProbability(cnSet)))
473
-	gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index])
474
-	invisible(close(snpCallProbability(cnSet)))
475
-	below.thr <- gt.conf < GT.CONF.THR
476
-	F <- which(cnSet$gender[sample.index] == 2)
477
-	M <- which(cnSet$gender[sample.index] == 1)
478
-	##nus <- nuA(cnSet)[X.index, "C"]
479
-	index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F)))
480
-	index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M)))
481
-	index.allbelow <- as.integer(unique(c(index.allbelowF, index.allbelowM)))
482
-	##all.equal(index, index.allbelow)
483
-	## or calculate the proportion of missing effected by low crlmm confidence
484
-	sum(index.allbelow %in% index)/length(index)
485
-}
575
+invisible(open(cnSet$gender))
576
+F <- which(cnSet$gender[sample.index] == 2)
577
+M <- which(cnSet$gender[sample.index] == 1)
578
+gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index])
579
+below.thr <- gt.conf < GT.CONF.THR
580
+index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F)))
581
+index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M)))
582
+all.equal(index.allbelowF, index.allbelowM)
583
+all.equal(as.integer(index.allbelowF), as.integer(index))
486 584
 @
487 585
 
488 586
 For nonpolymorphic loci, the genotype confidence scores are irrelevant
... ...
@@ -519,8 +617,9 @@ detection and removal may be explored in the future.
519 617
 Copy number estimates for other chromosomes, such as mitochondrial and
520 618
 chromosome Y, are not currently available in \crlmm{}.
521 619
 
522
-<<echo=FALSE>>=
620
+<<close>>=
523 621
 invisible(close(cnSet))
622
+invisible(close(cnSet$gender))
524 623
 @
525 624
 
526 625
 
Browse code

Add empty vignettes for copy number vignettes to inst/doc. Update VignetteIndexEntry.

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54326 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 31/03/2011 16:51:58
Showing1 changed files
... ...
@@ -1,4 +1,4 @@
1
-%\VignetteIndexEntry{Infrastructure for crlmm copy number Vignette}
1
+%\VignetteIndexEntry{Infrastructure for copy number analysis}
2 2
 %\VignetteDepends{crlmm, genomewidesnp6Crlmm}
3 3
 %\VignetteKeywords{crlmm, SNP 6}
4 4
 %\VignettePackage{crlmm}
Browse code

Add vignettes for copy number analysis. Add crlmmGT2.R file

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54171 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 30/03/2011 02:41:00
Showing1 changed files
1 1
new file mode 100644
... ...
@@ -0,0 +1,540 @@
1
+%\VignetteIndexEntry{Infrastructure for crlmm copy number Vignette}
2
+%\VignetteDepends{crlmm, genomewidesnp6Crlmm}
3
+%\VignetteKeywords{crlmm, SNP 6}
4
+%\VignettePackage{crlmm}
5
+\documentclass{article}
6
+\usepackage{graphicx}
7
+\usepackage{natbib}
8
+\usepackage{url}
9
+\newcommand{\Rfunction}[1]{{\texttt{#1}}}
10
+\newcommand{\Rmethod}[1]{{\texttt{#1}}}
11
+\newcommand{\Rcode}[1]{{\texttt{#1}}}
12
+\newcommand{\Robject}[1]{{\texttt{#1}}}
13
+\newcommand{\Rpackage}[1]{{\textsf{#1}}}
14
+\newcommand{\Rclass}[1]{{\textit{#1}}}
15
+\newcommand{\oligo}{\Rpackage{oligo }}
16
+\newcommand{\R}{\textsf{R}}
17
+\newcommand{\crlmm}{\Rpackage{crlmm}}
18
+\newcommand{\ff}{\Rpackage{ff}}
19
+\usepackage[margin=1in]{geometry}
20
+
21
+\begin{document}
22
+\title{Infrastructure for copy number analysis in \crlmm{}}
23
+\date{\today}
24
+\author{Rob Scharpf}
25
+\maketitle
26
+
27
+
28
+\section{Set up}
29
+
30
+As in the previous vignettes, we load the required libraries and
31
+specify a path for storing output files.
32
+
33
+<<libraries,results=hide>>=
34
+library(ff)
35
+library(crlmm)
36
+library(cacheSweave)
37
+require(DNAcopy)
38
+require(VanillaICE)
39
+if(getRversion() < "2.13.0"){
40
+	rpath <- getRversion()
41
+} else rpath <- "trunk"
42
+outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="")
43
+@
44
+
45
+<<ram>>=
46
+ldPath(outdir)
47
+setCacheDir(outdir)
48
+@
49
+
50
+We begin by loading the \Robject{cnSet} object created by the
51
+\verb+AffymetrixPreprocessCN+ vignette.
52
+
53
+<<loadcnset>>=
54
+if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda"))
55
+@
56
+
57
+\section{Supported platforms}
58
+
59
+The supported Affymetrix and Infinium platforms are those for which a
60
+corresponding annotation package is available.  The annotation
61
+packages contain information on the markers, such as physical position
62
+and chromosome, as well as pre-computed parameters estimated from
63
+HapMap used during the preprocessing and genotyping steps. For
64
+Affymetrix, the 5.0 and 6.0 platforms are supported and the
65
+corresponding annotation packages are \Rpackage{genomewidesnp5Crlmm}
66
+and \Rpackage{genomewidesnp6Crlmm}.  Supported Infinium platforms are
67
+listed in the following code chunk.
68
+
69
+<<supportedPlatforms>>=
70
+pkgs <- annotationPackages()
71
+crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)]
72
+crlmm.pkgs[grep("human", crlmm.pkgs)]
73
+@
74
+
75
+\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint
76
+for copy number estimation, we require the \Rpackage{ff}.  The
77
+\Rpackage{ff} package provides infrastructure for accessing and
78
+writing data to disk instead of keeping data in memory.  As the
79
+functions for preprocessing, genotyping, and copy number estimation do
80
+not simultaneously require all samples and all probes in memory,
81
+memory-usage by \crlmm{} can be fine-tuned by reading in and
82
+processing subsets of the markers and/or samples. The functions
83
+\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the
84
+\Rpackage{oligoClasses} package can be used to declare how many
85
+markers and samples to read at once.  In general, specifying smaller
86
+values should reduce the RAM required for a particular job.  In
87
+general, smaller values will increase the run-time. In the following
88
+code-chunk, we declare that \crlmm{} should process 150,000 markers at
89
+a time (when possible) and 500 samples at a time.  If our dataset
90
+contained fewer than 500 samples, the \Rfunction{ocSamples} option
91
+would not have any effect.  One can view the current settings for
92
+these commands, by typing the functions without an argument.
93
+
94
+<<ram>>=
95
+ocProbesets(50e3)
96
+ocSamples(200)
97
+@
98
+
99
+\section{The \Rclass{CNSet} container}
100
+
101
+The \Rfunction{show} method provides a concise summary of the
102
+\Robject{cnSet} object.
103
+
104
+<<show>>=
105
+cnSet
106
+@
107
+
108
+We briefly outline some of the unique aspects of the
109
+\Rclass{CNSet}-class using in \crlmm{} that may differ from the more
110
+standard extensions of the virtual class \Rclass{eSet} defined in the
111
+\Rpackage{Biobase} package.
112
+
113
+\paragraph{Instantiating a \Rclass{CNSet}:} An object of class
114
+\Rclass{CNSet} can be instantiated by one of two methods:
115
+\begin{enumerate}
116
+\item during the preprocessing of the raw intensities for Illumina and
117
+  Affymetrix arrays by the the functions \Rfunction{constructInf} and
118
+  \Rfunction{genotype}, respectively.
119
+
120
+\item by subsetting an existing \Robject{CNSet} object.  As per usual,
121
+  the `[' method can be used to extract a subset of markers $i$ as in
122
+  `[i, ]', a subset of samples $j$ as in `[, j]', or a subset of
123
+  markers $i$ and samples $j$ as in `[i, j]'.
124
+
125
+\end{enumerate}
126
+
127
+There are important differences in the underlying data representation
128
+depending on how the object was instantiated.  In particular, objects
129
+generated by the functions \Rfunction{constructInf} and
130
+\Rfunction{genotype} store high-dimensional data on disk rather than
131
+in memory through protocols defined in the \R{} package \ff{}.  For
132
+instance, the normalized intensities and genotype calls in a
133
+\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects.
134
+However, when such an object is subset by the `[' method, the data is
135
+pulled from disk to memory and stored as an ordinary matrix in \R{}.
136
+To illustrate, the \Robject{cnSet} loaded at the start of this session
137
+was created by the function \Rfunction{genotype}.  As the data is
138
+stored on disk rather than in memory, we inspect attributes of the
139
+object representing the normalized intensities for the A allele by
140
+first opening the file connection and then extracting the
141
+\ff{}-derived class of the object as well as where the data is stored
142
+on disk.
143
+
144
+<<approach1>>=
145
+invisible(open(cnSet))
146
+class(A(cnSet))
147
+filename(A(cnSet))
148
+@
149
+
150
+Using approach (2), we construct a new \Robject{CNSet} object.
151
+
152
+<<approach2>>=
153
+cnset.subset <- cnSet[1:5, 1:10]
154
+invisible(close(cnSet))
155
+class(A(cnset.subset))
156
+@
157
+
158
+The files with \verb+.ff+ extension should not be removed or
159
+relocated.  The safest way to move these files if necessary is to
160
+clone all of the \ff{} objects using the \Rfunction{clone}, followed
161
+by the \Rfunction{delete} function to remove the original files on
162
+disk. See the documentation in the \ff{} package for additional
163
+details.
164
+
165
+\paragraph{Order of operations:} For \Rclass{CNSet}-instances derived
166
+by approach (1), users should be mindful of the substantial I/O when
167
+using accessors to extract data from the class.  For example, the
168
+following 2 methods would extract identical results, with the latter
169
+being much more efficient (extra parentheses are added to the second
170
+operation to emphasize the order of operations):
171
+
172
+<<orderOfOperations>>=
173
+##inefficient
174
+A(cnSet[1:5, 1:5])
175
+## preferred
176
+(A(cnSet))[1:5, 1:5]
177
+@
178
+
179
+
180
+\textbf{\texttt{featureData}}
181
+
182
+Information on physical position, chromosome, and whether the marker
183
+is a SNP can be accessed through accessors defined for the
184
+\verb+featureData+.
185
+
186
+<<featuredataAccessors>>=
187
+fvarLabels(cnSet)
188
+position(cnSet)[1:10]
189
+chromosome(cnSet)[1:10]
190
+is.snp <- isSnp(cnSet)
191
+table(is.snp)
192
+snp.index <- which(is.snp)
193
+np.index <- which(!is.snp)
194
+chr1.index <- which(chromosome(cnSet) == 1)
195
+@
196
+
197
+\textbf{\texttt{assayData}}
198
+
199
+The \verb+assayData+ elements are of the class
200
+\Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how
201
+the \Rclass{CNSet} object was instantiated.  Elements in the
202
+\verb+assayData+ environment can be listed using the \Rfunction{ls}
203
+function.
204
+
205
+<<assayData>>=
206
+ls(assayData(cnSet))
207
+@
208
+
209
+The normalized intensities for the A and B alleles have names
210
+\verb+alleleA+ and \verb+alleleB+ and can be accessed by the methods
211
+\Rfunction{A} and \Rfunction{B}, respectively.  Genotype calls and
212
+confidence scores can be accessed by \Rfunction{snpCall} and
213
+\Rfunction{snpCallProbability}, respectively.  Note that the
214
+confidence scores are represented as integers to reduce the filesize,
215
+but can be tranlated to the probability scale using the \R{} function
216
+\Rfunction{i2p}. For example,
217
+
218
+<<i2p>>=
219
+scores <- as.matrix(snpCallProbability(cnSet)[1:5, 1:2])
220
+i2p(scores)
221
+@
222
+
223
+Note that for the Affymetrix 6.0 platform the assay data elements each
224
+have a row dimension corresponding to the total number of polymorphic
225
+and nonpolymorphic markers interrogated by the Affymetrix 6.0
226
+platform.  A consequence of keeping the rows of the assay data
227
+elements the same for all of the statistical summaries is that the
228
+matrix used to store genotype calls is larger than necessary.
229
+
230
+Note that NA's are stored in the slot for normalized 'B' allele
231
+intensities:
232
+
233
+<<B.NAs>>=
234
+np.index <- which(!is.snp)
235
+stopifnot(all(is.na(B(cnSet)[np.index, ])))
236
+@
237
+
238
+\textbf{\texttt{batch} and \texttt{batchStatistics}}
239
+
240
+As defined in Leek \textit{et al.} 2010, \textit{Batch effects are
241
+  sub-groups of measurements that have qualitatively different
242
+  behaviour across conditions and are unrelated to the biological or
243
+  scientific variables in a study}.  The \verb+batchStatistics+ slot
244
+is an environment used to store SNP- and batch-specific summaries,
245
+such as the sufficient statistics for the genotype clusters and the
246
+linear model parameters used for copy number estimation.  The
247
+\verb+batch+ slot is used to store the 'batch name' for each
248
+array. For small studies in which the samples were processed at
249
+similar times (e.g., within a month), all the samples can be
250
+considered to be in the same batch.  For large studies in which the
251
+samples were processed over several months, users should the scan date
252
+of the array or the chemistry plate are useful surrogates. The only
253
+constraint on the \verb+batch+ variable is that it must be a character
254
+vector that is the same length as the number of samples to be
255
+processed.  The \verb+batch+ is specified as an argument to the \R{}
256
+functions \Rfunction{constructInf} and \Rfunction{genotype} that
257
+instantiate \Rclass{CNSet} objects for the Illumina and Affymetrix
258
+platforms, respectively.  The \Rfunction{batch} function can be used
259
+to access the \verb+batch+ information on the samples as in the
260
+following example.
261
+
262
+<<batchAccessor>>=
263
+table(batch(cnSet))
264
+@
265
+
266
+For the \verb+batchStatistics+ slot, the elements in the environment
267
+have the class \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix},
268
+depending on how the \Rclass{CNSet} object was instantiated.  The
269
+dimension of each element is the number of markers (SNPs +
270
+nonpolymorphic markers) $\times$ the number of batches. The names of
271
+the elements in the environment can be list using the \R{} function
272
+\Rfunction{ls}.
273
+
274
+<<batchStatistics>>=
275
+ls(batchStatistics(cnSet))
276
+@
277
+
278
+Currently, the batch-specific summaries are stored to allow some
279
+flexibility in the choice of downstream analyses of copy number and
280
+visual assessments of model fit.  Documentation for such applications
281
+will be expanded in future versions of \crlmm{}, and are currently not
282
+intended to be accessed directly by the user.
283
+
284
+%The \Robject{CNSet} class also contains the slot
285
+%\Robject{batchStatistics} that contains batch-specific summaries
286
+%needed for copy number estimation. In particular, each element is a
287
+%matrix (or an ff object) with R rows and C columns, correspoinding to
288
+%R markers and C batches.  The summaries includes the within genotype
289
+%cluster medians and median absolute deviations (mads), but also
290
+%parameters estimated from the linear model.  (For unobserved
291
+%genotypes, the medians are imputed and the variance is obtained the
292
+%median variance (across markers) within a batch. ) The elements of the
293
+%slot can be listed as follows.
294
+
295
+\textbf{\texttt{phenoData}}
296
+
297
+Sample-level summaries obtained during the preprocessing/genotyping
298
+steps include skew (SKW), the signal to noise ratio (SNR), and gender
299
+(1=male, 2=female) are stored in the \verb+phenoData+ slot.  As for
300
+other \Rclass{eSet} extensions, the \verb+$+ method can be used to
301
+extract these summaries.  For \Rclass{CNSet} objects generated by
302
+approach (1), these elements are of the class
303
+\Rclass{ff\_vector}.
304
+
305
+<<phenodataAccessors>>=
306
+varLabels(cnSet)
307
+class(cnSet$gender)
308
+invisible(open(cnSet$gender))
309
+cnSet$gender
310
+@
311
+
312
+The `[' methods without arguments can be used to coerce to a vector.
313
+
314
+<<vector>>=
315
+cnSet$gender[]
316
+invisible(close(cnSet$gender))
317
+@
318
+
319
+\textbf{\texttt{protocolData}}
320
+
321
+The scan date of the arrays are stored in the \verb+protocolData+.
322
+
323
+<<protocolData>>=
324
+varLabels(protocolData(cnSet))
325
+protocolData(cnSet)$ScanDate[1:10]
326
+@
327
+
328
+
329
+%\section{Suggested visualizations}
330
+%
331
+%\paragraph{SNR.}
332
+%
333
+%A histogram of the signal to noise ratio for the HapMap samples:
334
+%
335
+%<<plotSnr, fig=TRUE, include=FALSE>>=
336
+%open(cnSet$SNR)
337
+%hist(cnSet$SNR[, ], xlab="SNR", main="", breaks=25, col="lightblue", xlim=c(3, max(cnSet$SNR[])))
338
+%abline(v=5, lty=2, col="grey")
339
+%text(3,5, label="SNR range for low \n quality arrays", adj=0, col="grey40")
340
+%@
341
+%\begin{figure}
342
+%  \centering
343
+%  \includegraphics[width=0.6\textwidth]{copynumber-plotSnr}
344
+%  \caption{Signal to noise ratios for the HapMap samples. SNRs below 5
345
+%   for the Affymetrix platform are often samples of lower quality.
346
+%   Such samples will tend to have much more variable estimates of copy
347
+%   number.}
348
+%\end{figure}
349
+
350
+
351
+%\paragraph{One sample at a time: locus-level estimates}
352
+%
353
+%Figure \ref{fig:oneSample} plots physical position (horizontal axis)
354
+%versus copy number (vertical axis) for the first sample.  There is
355
+%less information to estimate copy number at nonpolymorphic loci;
356
+%improvements to the univariate prediction regions at nonpolymorphic
357
+%loci are a future area of research. If the \Rpackage{SNPchip} is
358
+%available, an idiogram can be added to the existing plotting
359
+%coordinates as indicated in the following example.
360
+%
361
+%<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>=
362
+%GT.CONF.THR <- 0.8
363
+%marker.index <- which(chromosome(cnSet) == 1)
364
+%cn <- totalCopynumber(cnSet, i=marker.index, j=1)
365
+%x <- position(cnSet)[marker.index]
366
+%par(las=1, mar=c(4, 5, 4, 2))
367
+%plot(x, cn, pch=".",
368
+%     cex=2, xaxt="n", col="grey60", ylim=c(0,6),
369
+%     ylab="copy number", xlab="physical position (Mb)",
370
+%     main=paste(sampleNames(cnSet)[1], ", CHR: 1"))
371
+%axis(1, at=pretty(x), labels=pretty(x)/1e6)
372
+%require(SNPchip)
373
+%invisible(plotCytoband(1, new=FALSE, cytoband.ycoords=c(5.5, 6), label.cytoband=FALSE))
374
+%@
375
+
376
+%\begin{figure}
377
+%  \includegraphics[width=0.9\textwidth]{copynumber-oneSample}
378
+%  \caption{\label{fig:oneSample} Total copy number (y-axis) for
379
+%    chromosome 1 plotted against physical position (x-axis) for one
380
+%    sample.  Estimates at nonpolymorphic loci are plotted in light
381
+%    blue.}
382
+%\end{figure}
383
+%
384
+%\clearpage
385
+%\paragraph{One SNP at a time}
386
+%
387
+%Scatterplots of the A and B allele intensities (log-scale) can be
388
+%useful for assessing the biallelic genotype calls.  This section of
389
+%the vignette is currently under development.
390
+
391
+\section{Trouble shooting}
392
+
393
+\subsection{Missing values}
394
+
395
+% There are several reasons for estimates of the allele-specific copy
396
+% number to have missing values (\texttt{NA}'s). This section briefly
397
+% elaborates on the source of missing values in the HapMap analysis
398
+% and discusses possible alternatives to reduce the number of missing
399
+% values.  Note that allele-specific copy number, 'CA' and 'CB', is
400
+% not saved in the \Robject{cnSet} object.  Rather, the respective
401
+% accessors calculate 'CA' and 'CB' on the fly from the normalized
402
+% intensities and from the marker-specific parameter estimates in the
403
+% linear model.  In general, a missing value arises when the
404
+% background or slope parameter was not estimated in the linear
405
+% model.
406
+Most often, missing values occur when the genotype confidence scores
407
+for a SNP were below the threshold used by the
408
+\Robject{crlmmCopynumber} function. For the HapMap analysis, we used a
409
+confidence threshold of 0.80 (the default).
410
+
411
+<<NA>>=
412
+GT.CONF.THR <- 0.80
413
+autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23)
414
+sample.index <- which(batch(cnSet) == "C")
415
+ca <- CA(cnSet, i=autosome.index, j=sample.index)
416
+cb <- CB(cnSet, i=autosome.index, j=sample.index)
417
+missing.ca <- is.na(ca)
418
+missing.cb <- is.na(cb)
419
+(nmissing.ca <- sum(missing.ca))
420
+(nmissing.cb <- sum(missing.cb))
421
+identical(nmissing.ca, nmissing.cb)
422
+@
423
+
424
+If \Robject{nmissing.ca} is nonzero, check the genotype confidence
425
+scores provided by the crlmm genotyping algorithm against the
426
+threshold specified in \Robject{crlmmCopynumber}.
427
+
428
+<<NAconfidenceScores>>=
429
+if(nmissing.ca > 0){
430
+	invisible(open(snpCallProbability(cnSet)))
431
+	gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index])
432
+	invisible(close(snpCallProbability(cnSet)))
433
+	below.thr <- gt.conf < GT.CONF.THR
434
+	index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index)))
435
+	nmissingBecauseOfGtThr <- length(index.allbelow) * length(sample.index)
436
+	stopifnot(identical(nmissingBecauseOfGtThr, nmissing.ca))
437
+	## or calculate the proportion of missing effected by low crlmm confidence
438
+	length(index.allbelow) * length(sample.index)/nmissing.ca
439
+}
440
+@
441
+
442
+One could inspect the cluster plots for the 'low confidence' calls.
443
+
444
+<<clusterPlotsLowConfidence>>=
445
+## TODO
446
+@
447
+
448
+We repeat the above check for missing values at polymorphic loci on
449
+chromosome X.  In this case, we compare the \Robject{rowSums} of the
450
+missing values to the number of samples to check whether all of the
451
+estimates are missing for a given SNP.
452
+
453
+<<NAchromosomeX>>=
454
+X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
455
+ca.X <- CA(cnSet, i=X.index, j=sample.index)
456
+missing.caX <- is.na(ca.X)
457
+(nmissing.caX <- sum(missing.caX))
458
+missing.snp.index <- which(rowSums(missing.caX) == length(sample.index))
459
+index <- which(rowSums(missing.caX) == length(sample.index))
460
+length(index)*length(sample.index)/nmissing.caX
461
+@
462
+
463
+From the above codechunk, we see that \Sexpr{length(index)} SNPs have
464
+NAs for all the samples. Next, we tally the number of NAs for
465
+polymorphic markers on chromosome X that are below the confidence
466
+threshold.  For the HapMap analysis, all of the missing values arose
467
+from SNPs in which either the men or the women had confidence scores
468
+that were all below the threshold.
469
+
470
+<<NAcrlmmConfidenceScore>>=
471
+if(nmissing.caX > 0){
472
+	invisible(open(snpCallProbability(cnSet)))
473
+	gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index])
474
+	invisible(close(snpCallProbability(cnSet)))
475
+	below.thr <- gt.conf < GT.CONF.THR
476
+	F <- which(cnSet$gender[sample.index] == 2)
477
+	M <- which(cnSet$gender[sample.index] == 1)
478
+	##nus <- nuA(cnSet)[X.index, "C"]
479
+	index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F)))
480
+	index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M)))
481
+	index.allbelow <- as.integer(unique(c(index.allbelowF, index.allbelowM)))
482
+	##all.equal(index, index.allbelow)
483
+	## or calculate the proportion of missing effected by low crlmm confidence
484
+	sum(index.allbelow %in% index)/length(index)
485
+}
486
+@
487
+
488
+For nonpolymorphic loci, the genotype confidence scores are irrelevant
489
+and estimates are available at most markers.
490
+
491
+<<NAnonpolymorphic.autosome>>=
492
+np.index <- which(!isSnp(cnSet) & chromosome(cnSet)==23)
493
+ca.F <- CA(cnSet, i=np.index, j=F)
494
+ca.M <- CA(cnSet, i=np.index, j=M)
495
+## NAs for one marker
496
+ca.F <- ca.F[-match("CN_974939", rownames(ca.F)), ]
497
+ca.M <- ca.M[-match("CN_974939", rownames(ca.M)), ]
498
+sum(is.na(ca.F))
499
+sum(is.na(ca.M))
500
+@
501
+
502
+%TODO: marker CN\_974939 has NAs for the normalized intensities.  This
503
+%is because CN\_974939 is not in the \Robject{npProbesFid} file in
504
+%\Rpackage{genomewidesnp6Crlmm}.  The \Robject{npProbesFid} file should
505
+%be updated in the next \Rpackage{genomewidesnp6Crlmm} release.
506
+
507
+In total, there were \Sexpr{length(missing.snp.index)} polymorphic
508
+markers on chromosome X for which copy number estimates are not
509
+available.  Lowering the confidence threshold would permit estimation
510
+of copy number at most of these loci.  A confidence threshold is
511
+included as a parameter for the copy number estimation as an approach
512
+to reduce the sensitivity of genotype-specific summary statistics,
513
+such as the within-genotype median, to intensities from samples that
514
+do not clearly fall into one of the biallelic genotype clusters. There
515
+are drawbacks to this approach, including variance estimates that can
516
+be a bit optimistic at some loci.  More direct approaches for outlier
517
+detection and removal may be explored in the future.
518
+
519
+Copy number estimates for other chromosomes, such as mitochondrial and
520
+chromosome Y, are not currently available in \crlmm{}.
521
+
522
+<<echo=FALSE>>=
523
+invisible(close(cnSet))
524
+@
525
+
526
+
527
+\section{Session information}
528
+<<sessionInfo, results=tex>>=
529
+toLatex(sessionInfo())
530
+@
531
+
532
+%\section*{References}
533
+%
534
+%%\begin{bibliography}
535
+%  \bibliographystyle{plain}
536
+%  \bibliography{refs}
537
+%\end{bibliography}
538
+
539
+
540
+\end{document}