Browse code

Update CopyNumberOverview and Infratructure vignettes

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_11/madman/Rpacks/crlmm@71165 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 13/11/2012 19:59:49
Showing3 changed files

... ...
@@ -1,7 +1,7 @@
1 1
 Package: crlmm
2 2
 Type: Package
3 3
 Title: Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays.
4
-Version: 1.16.4
4
+Version: 1.16.5
5 5
 Author: Benilton S Carvalho, Robert Scharpf, Matt Ritchie, Ingo Ruczinski, Rafael A Irizarry
6 6
 Maintainer: Benilton S Carvalho <Benilton.Carvalho@cancer.org.uk>, Robert Scharpf <rscharpf@jhsph.edu>, Matt Ritchie <mritchie@wehi.EDU.AU>
7 7
 Description: Faster implementation of CRLMM specific to SNP 5.0 and 6.0 arrays, as well as a copy number tool specific to 5.0, 6.0, and Illumina platforms
... ...
@@ -29,14 +29,13 @@
29 29
 
30 30
   This vignette provides an overview of the \Rclass{CNSet} class and a
31 31
   brief discussion of the underlying infrastructure for large data
32
-  support with the \Rpackage{ff} package.  This package instantiates
32
+  support with the \Rpackage{ff} package.  This vignette instantiates
33 33
   an object of class \Rclass{CNSet} using a trivial dataset with 3
34 34
   files.  As this sample size is too small for estimating copy number
35 35
   with the \crlmm{} package, the final section of this vignette loads
36 36
   an object created by the analysis of 180 HapMap CEL files
37
-  (Affymetrix 6.0 platform).  In particular, this object was
38
-  instantiated by running (1) the \verb+AffymetrixPreprocessCN+
39
-  vignette and (2) the \verb+copynumber+ vignette.
37
+  (Affymetrix 6.0 platform).  This object was instantiated by running
38
+  the \verb+AffyGW+ vignette.
40 39
 
41 40
 \end{abstract}
42 41
 
... ...
@@ -76,14 +75,14 @@ processing subsets of the markers and/or samples. The functions
76 75
 markers and samples to read at once.  In general, specifying smaller
77 76
 values should reduce the RAM required for a particular job.  In
78 77
 general, smaller values will increase the run-time. In the following
79
-code-chunk, we declare that \crlmm{} should process 150,000 markers at
80
-a time (when possible) and 500 samples at a time.  If our dataset
81
-contained fewer than 500 samples, the \Rfunction{ocSamples} option
78
+code-chunk, we declare that \crlmm{} should process 100,000 markers at
79
+a time (when possible) and 200 samples at a time.  If our dataset
80
+contained fewer than 200 samples, the \Rfunction{ocSamples} option
82 81
 would not have any effect.  One can view the current settings for
83 82
 these commands, by typing the functions without an argument.
84 83
 
85 84
 <<ram>>=
86
-ocProbesets(50e3)
85
+ocProbesets(100e3)
87 86
 ocSamples(200)
88 87
 @
89 88
 
... ...
@@ -96,11 +95,12 @@ methods:
96 95
 
97 96
 \begin{itemize}
98 97
 
99
-\item[Approach 1:] during the preprocessing of the raw intensities for Illumina and
100
-  Affymetrix arrays by the the functions \Rfunction{constructInf} and
101
-  \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls
102
-  the non-exported function \Rfunction{constructAffy} to initialize a
103
-  \Rclass{CNSet} object for Affymetrix platforms.)
98
+\item[Approach 1:] during the preprocessing of the raw intensities for
99
+  Illumina and Affymetrix arrays by the the functions
100
+  \Rfunction{constructInf} and \Rfunction{genotype},
101
+  respectively. (The \Rfunction{genotype} calls the function
102
+  \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for
103
+  Affymetrix platforms.)
104 104
 
105 105
 \item[Approach 2:] by subsetting an existing \Robject{CNSet} object.
106 106
   As per usual, the `[' method can be used to extract a subset of
... ...
@@ -110,17 +110,18 @@ methods:
110 110
 \end{itemize}
111 111
 
112 112
 There are important differences in the underlying data representation
113
-depending on how the object was instantiated.  In particular, objects
114
-generated by the functions \Rfunction{constructInf} and
115
-\Rfunction{genotype} store high-dimensional data on disk rather than
116
-in memory through protocols defined in the \R{} package \ff{}.  For
117
-instance, the normalized intensities and genotype calls in a
118
-\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects.
119
-By contrast, when such an objected generated by approach (1) is subset
120
-by the `[' method, an object of the same class is returned but the
121
-\Rpackage{ff}-derived objects are coerced to ordinary matrices. Note,
122
-therefore, that both approaches (1) and (2) may involve substantial
123
-I/O.
113
+depending on how the \Rclass{CNSet} object was instantiated.  In
114
+particular, objects generated by the functions
115
+\Rfunction{constructInf} and \Rfunction{genotype} store
116
+high-dimensional data on disk rather than in memory through protocols
117
+defined in the \R{} package \ff{}.  For instance, the normalized
118
+intensities and genotype calls in a \Rclass{CNSet}-instance from
119
+approach (1) are \ff{}-derived objects.  By contrast, when such an
120
+objected generated by approach (1) is subset by the `[' method, an
121
+object of the same class is returned but the \Rpackage{ff}-derived
122
+objects are coerced to ordinary matrices. Note, therefore, that both
123
+approaches (1) and (2) may involve substantial I/O and that (2) should
124
+be performed judiciously.
124 125
 
125 126
 \subsubsection{Approach 1}
126 127
 
... ...
@@ -149,7 +150,7 @@ celfiles <- list.celfiles(path, full.names=TRUE)
149 150
 Typically, an object of class \Rclass{CNSet} is instantiated as part
150 151
 of the preprocessing and genotyping by calling the function
151 152
 \Rfunction{genotype}, as illustrated in the
152
-\verb+AffymtrixPreprocessCN+ vignette.
153
+\verb+AffyGW+ vignette.
153 154
 
154 155
 <<instantiateToyExample,cache=TRUE>>=
155 156
 exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
... ...
@@ -163,11 +164,11 @@ ldPath()
163 164
 @
164 165
 
165 166
 One could also instantiate an object of class \Rclass{CNSet} without
166
-preprocessing/genotyping by calling the non-exported function
167
+preprocessing/genotyping by calling the exported function
167 168
 \Rfunction{constructAffy} directly using the \Rfunction{:::} operator.
168 169
 
169 170
 <<constructAffy,eval=FALSE>>=
170
-tmp <- crlmm:::constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
171
+tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6")
171 172
 @
172 173
 
173 174
 The \Rfunction{show} method provides a concise summary of the
... ...
@@ -300,15 +301,6 @@ platform.  A consequence of keeping the rows of the assay data
300 301
 elements the same for all of the statistical summaries is that the
301 302
 matrix used to store genotype calls is larger than necessary.
302 303
 
303
-% only true for affy.  for illumina, we have intensities stored in 'B'
304
-%Note that NA's are stored in the slot for normalized 'B' allele
305
-%intensities:
306
-
307
-%<<B.NAs>>=
308
-%np.index <- which(!is.snp)
309
-%stopifnot(all(is.na(B(cnSet)[np.index, ])))
310
-%@
311
-
312 304
 \subsubsection{\texttt{batch} and \texttt{batchStatistics}}
313 305
 
314 306
 As defined in Leek \textit{et al.} 2010, \textit{Batch effects are
... ...
@@ -402,73 +394,10 @@ varLabels(protocolData(exampleSet))
402 394
 protocolData(exampleSet)$ScanDate
403 395
 @
404 396
 
405
-
406
-%\section{Suggested visualizations}
407
-%
408
-%\paragraph{SNR.}
409
-%
410
-%A histogram of the signal to noise ratio for the HapMap samples:
411
-%
412
-%<<plotSnr, fig=TRUE, include=FALSE>>=
413
-%open(cnSet$SNR)
414
-%hist(cnSet$SNR[, ], xlab="SNR", main="", breaks=25, col="lightblue", xlim=c(3, max(cnSet$SNR[])))
415
-%abline(v=5, lty=2, col="grey")
416
-%text(3,5, label="SNR range for low \n quality arrays", adj=0, col="grey40")
417
-%@
418
-%\begin{figure}
419
-%  \centering
420
-%  \includegraphics[width=0.6\textwidth]{copynumber-plotSnr}
421
-%  \caption{Signal to noise ratios for the HapMap samples. SNRs below 5
422
-%   for the Affymetrix platform are often samples of lower quality.
423
-%   Such samples will tend to have much more variable estimates of copy
424
-%   number.}
425
-%\end{figure}
426
-
427
-
428
-%\paragraph{One sample at a time: locus-level estimates}
429
-%
430
-%Figure \ref{fig:oneSample} plots physical position (horizontal axis)
431
-%versus copy number (vertical axis) for the first sample.  There is
432
-%less information to estimate copy number at nonpolymorphic loci;
433
-%improvements to the univariate prediction regions at nonpolymorphic
434
-%loci are a future area of research. If the \Rpackage{SNPchip} is
435
-%available, an idiogram can be added to the existing plotting
436
-%coordinates as indicated in the following example.
437
-%
438
-%<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>=
439
-%GT.CONF.THR <- 0.8
440
-%marker.index <- which(chromosome(cnSet) == 1)
441
-%cn <- totalCopynumber(cnSet, i=marker.index, j=1)
442
-%x <- position(cnSet)[marker.index]
443
-%par(las=1, mar=c(4, 5, 4, 2))
444
-%plot(x, cn, pch=".",
445
-%     cex=2, xaxt="n", col="grey60", ylim=c(0,6),
446
-%     ylab="copy number", xlab="physical position (Mb)",
447
-%     main=paste(sampleNames(cnSet)[1], ", CHR: 1"))
448
-%axis(1, at=pretty(x), labels=pretty(x)/1e6)
449
-%require(SNPchip)
450
-%invisible(plotCytoband(1, new=FALSE, cytoband.ycoords=c(5.5, 6), label.cytoband=FALSE))
451
-%@
452
-
453
-%\begin{figure}
454
-%  \includegraphics[width=0.9\textwidth]{copynumber-oneSample}
455
-%  \caption{\label{fig:oneSample} Total copy number (y-axis) for
456
-%    chromosome 1 plotted against physical position (x-axis) for one
457
-%    sample.  Estimates at nonpolymorphic loci are plotted in light
458
-%    blue.}
459
-%\end{figure}
460
-%
461
-%\clearpage
462
-%\paragraph{One SNP at a time}
463
-%
464
-%Scatterplots of the A and B allele intensities (log-scale) can be
465
-%useful for assessing the biallelic genotype calls.  This section of
466
-%the vignette is currently under development.
467
-
468 397
 \section{Trouble shooting with a HapMap example}
469 398
 
470 399
 This section uses an object of class \Rclass{CNSet} instantiated by
471
-the \verb+AffymetrixPreprocessCN+ vignette and saved to a local path
400
+the \verb+AffyGW+ vignette and saved to a local path
472 401
 on our computing cluster indicated by the object \Robject{outdir}
473 402
 below.  The \verb+copynumber+ vignette was used to fill out the
474 403
 \verb+batchStatistics+ slot of the \Robject{cnSet} object.
... ...
@@ -490,17 +419,6 @@ invisible(open(cnSet))
490 419
 
491 420
 \subsection{Missing values}
492 421
 
493
-% There are several reasons for estimates of the allele-specific copy
494
-% number to have missing values (\texttt{NA}'s). This section briefly
495
-% elaborates on the source of missing values in the HapMap analysis
496
-% and discusses possible alternatives to reduce the number of missing
497
-% values.  Note that allele-specific copy number, 'CA' and 'CB', is
498
-% not saved in the \Robject{cnSet} object.  Rather, the respective
499
-% accessors calculate 'CA' and 'CB' on the fly from the normalized
500
-% intensities and from the marker-specific parameter estimates in the
501
-% linear model.  In general, a missing value arises when the
502
-% background or slope parameter was not estimated in the linear
503
-% model.
504 422
 Most often, missing values occur when the genotype confidence scores
505 423
 for a SNP were below the threshold used by the
506 424
 \Robject{crlmmCopynumber} function. For the HapMap analysis, we used a
... ...
@@ -44,40 +44,17 @@ appropriate.
44 44
 \hline
45 45
  Vignette                &  Platform            &  Annotation package                        &  Scope                                               \\
46 46
 \hline
47
-Infrastructure          & Affy/Illumina                    &
48
-&  The CNSet container / large data support using the \Rpackage{ff} package            \\
49
- AffymetrixPreprocessCN  &  Affy 5.0, 6.0       &  genomewidesnp5Crlmm, genomewidesnp6Crlmm  &  Preprocessing and genotyping                        \\
50
- IlluminaPreprocessCN    &  Illumina  &  several$^\dagger$                                   &  Preprocessing and genotyping                        \\
51
- copynumber              &  Affy/Illumina       &  N/A                                       &  raw copy number estimates                           \\
52
-% SmoothingRawCN          &  Affy/Illumina       &  N/A                                       &  smoothing via segmentation or hidden Markov models  \\
47
+Infrastructure          & Affy/Illumina                    & &  The CNSet container / large data support using the \Rpackage{ff} package            \\
48
+ AffyGW  &  Affy 5.0, 6.0       &  genomewidesnp5Crlmm, genomewidesnp6Crlmm  &  Preprocessing, genotyping, CN estimation\\
49
+ IlluminaPreprocessCN    &  Illumina  &  several$^\dagger$ &  Preprocessing, genotyping, CN estimation
53 50
 \hline
54 51
 \end{tabular}
55 52
 \end{center}
56 53
 \caption{\label{overview} Vignettes for copy number
57
-  estimation. $^\dagger$ Annotation packages available for the
58
-  Illumina platform include  \Rpackage{human370v1cCrlmm}, \Rpackage{human370quadv3cCrlmm},
59
-  \Rpackage{human550v3bCrlmm}, \Rpackage{human650v3aCrlmm}, \Rpackage{human610quadv1bCrlmm},
60
-  \Rpackage{human660quadv1aCrlmm}, \Rpackage{human1mduov3bCrlmm}, and \Rpackage{humanomni1quadv1bCrlmm}}
54
+  estimation. $^\dagger$ See \texttt{annotationPackages()} for a
55
+  complete listing of supported Illumina/Affy platforms}
61 56
 \end{table}
62 57
 
63
-%We make use of the \R{} package \Rpackage{cacheSweave} for cacheing
64
-%code chunks that are computationally intensive.  In addition, we
65
-%indicate that the cached files should be stored in the directory
66
-%\verb+outdir+.
67
-
68
-In general, the workflow is
69
-\begin{enumerate}
70
-\item preprocess and genotype the arrays
71
-  (\verb+AffymetrixPreprocessCN+ for Affymetrix and
72
-  \verb+IlluminaPreprocessCN+ vignettes for Illumina)
73
-\item copy number estimation (\verb+copynumber+ vignette)
74
-%\item inferring regions of copy number gain and loss
75
-%  (\verb+SmoothingRawCN+ vignette)
76
-\end{enumerate}
77
-%The \verb+SmoothingRawCN+ vignette illustrates one approach for
78
-%interfacing with packages such as \Rpackage{DNAcopy} and
79
-%\Rpackage{VanillaICE} for identifying regions of copy number gain or
80
-%loss.
81 58
 The \verb+Infrastructure+ vignette provides additional details on the
82 59
 \Rclass{CNSet} container used to organize the processed data as well
83 60
 as a brief discussion regarding large data support through the \ff{}