git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_11/madman/Rpacks/crlmm@71165 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -1,7 +1,7 @@ |
1 | 1 |
Package: crlmm |
2 | 2 |
Type: Package |
3 | 3 |
Title: Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays. |
4 |
-Version: 1.16.4 |
|
4 |
+Version: 1.16.5 |
|
5 | 5 |
Author: Benilton S Carvalho, Robert Scharpf, Matt Ritchie, Ingo Ruczinski, Rafael A Irizarry |
6 | 6 |
Maintainer: Benilton S Carvalho <Benilton.Carvalho@cancer.org.uk>, Robert Scharpf <rscharpf@jhsph.edu>, Matt Ritchie <mritchie@wehi.EDU.AU> |
7 | 7 |
Description: Faster implementation of CRLMM specific to SNP 5.0 and 6.0 arrays, as well as a copy number tool specific to 5.0, 6.0, and Illumina platforms |
... | ... |
@@ -29,14 +29,13 @@ |
29 | 29 |
|
30 | 30 |
This vignette provides an overview of the \Rclass{CNSet} class and a |
31 | 31 |
brief discussion of the underlying infrastructure for large data |
32 |
- support with the \Rpackage{ff} package. This package instantiates |
|
32 |
+ support with the \Rpackage{ff} package. This vignette instantiates |
|
33 | 33 |
an object of class \Rclass{CNSet} using a trivial dataset with 3 |
34 | 34 |
files. As this sample size is too small for estimating copy number |
35 | 35 |
with the \crlmm{} package, the final section of this vignette loads |
36 | 36 |
an object created by the analysis of 180 HapMap CEL files |
37 |
- (Affymetrix 6.0 platform). In particular, this object was |
|
38 |
- instantiated by running (1) the \verb+AffymetrixPreprocessCN+ |
|
39 |
- vignette and (2) the \verb+copynumber+ vignette. |
|
37 |
+ (Affymetrix 6.0 platform). This object was instantiated by running |
|
38 |
+ the \verb+AffyGW+ vignette. |
|
40 | 39 |
|
41 | 40 |
\end{abstract} |
42 | 41 |
|
... | ... |
@@ -76,14 +75,14 @@ processing subsets of the markers and/or samples. The functions |
76 | 75 |
markers and samples to read at once. In general, specifying smaller |
77 | 76 |
values should reduce the RAM required for a particular job. In |
78 | 77 |
general, smaller values will increase the run-time. In the following |
79 |
-code-chunk, we declare that \crlmm{} should process 150,000 markers at |
|
80 |
-a time (when possible) and 500 samples at a time. If our dataset |
|
81 |
-contained fewer than 500 samples, the \Rfunction{ocSamples} option |
|
78 |
+code-chunk, we declare that \crlmm{} should process 100,000 markers at |
|
79 |
+a time (when possible) and 200 samples at a time. If our dataset |
|
80 |
+contained fewer than 200 samples, the \Rfunction{ocSamples} option |
|
82 | 81 |
would not have any effect. One can view the current settings for |
83 | 82 |
these commands, by typing the functions without an argument. |
84 | 83 |
|
85 | 84 |
<<ram>>= |
86 |
-ocProbesets(50e3) |
|
85 |
+ocProbesets(100e3) |
|
87 | 86 |
ocSamples(200) |
88 | 87 |
@ |
89 | 88 |
|
... | ... |
@@ -96,11 +95,12 @@ methods: |
96 | 95 |
|
97 | 96 |
\begin{itemize} |
98 | 97 |
|
99 |
-\item[Approach 1:] during the preprocessing of the raw intensities for Illumina and |
|
100 |
- Affymetrix arrays by the the functions \Rfunction{constructInf} and |
|
101 |
- \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls |
|
102 |
- the non-exported function \Rfunction{constructAffy} to initialize a |
|
103 |
- \Rclass{CNSet} object for Affymetrix platforms.) |
|
98 |
+\item[Approach 1:] during the preprocessing of the raw intensities for |
|
99 |
+ Illumina and Affymetrix arrays by the the functions |
|
100 |
+ \Rfunction{constructInf} and \Rfunction{genotype}, |
|
101 |
+ respectively. (The \Rfunction{genotype} calls the function |
|
102 |
+ \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for |
|
103 |
+ Affymetrix platforms.) |
|
104 | 104 |
|
105 | 105 |
\item[Approach 2:] by subsetting an existing \Robject{CNSet} object. |
106 | 106 |
As per usual, the `[' method can be used to extract a subset of |
... | ... |
@@ -110,17 +110,18 @@ methods: |
110 | 110 |
\end{itemize} |
111 | 111 |
|
112 | 112 |
There are important differences in the underlying data representation |
113 |
-depending on how the object was instantiated. In particular, objects |
|
114 |
-generated by the functions \Rfunction{constructInf} and |
|
115 |
-\Rfunction{genotype} store high-dimensional data on disk rather than |
|
116 |
-in memory through protocols defined in the \R{} package \ff{}. For |
|
117 |
-instance, the normalized intensities and genotype calls in a |
|
118 |
-\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects. |
|
119 |
-By contrast, when such an objected generated by approach (1) is subset |
|
120 |
-by the `[' method, an object of the same class is returned but the |
|
121 |
-\Rpackage{ff}-derived objects are coerced to ordinary matrices. Note, |
|
122 |
-therefore, that both approaches (1) and (2) may involve substantial |
|
123 |
-I/O. |
|
113 |
+depending on how the \Rclass{CNSet} object was instantiated. In |
|
114 |
+particular, objects generated by the functions |
|
115 |
+\Rfunction{constructInf} and \Rfunction{genotype} store |
|
116 |
+high-dimensional data on disk rather than in memory through protocols |
|
117 |
+defined in the \R{} package \ff{}. For instance, the normalized |
|
118 |
+intensities and genotype calls in a \Rclass{CNSet}-instance from |
|
119 |
+approach (1) are \ff{}-derived objects. By contrast, when such an |
|
120 |
+objected generated by approach (1) is subset by the `[' method, an |
|
121 |
+object of the same class is returned but the \Rpackage{ff}-derived |
|
122 |
+objects are coerced to ordinary matrices. Note, therefore, that both |
|
123 |
+approaches (1) and (2) may involve substantial I/O and that (2) should |
|
124 |
+be performed judiciously. |
|
124 | 125 |
|
125 | 126 |
\subsubsection{Approach 1} |
126 | 127 |
|
... | ... |
@@ -149,7 +150,7 @@ celfiles <- list.celfiles(path, full.names=TRUE) |
149 | 150 |
Typically, an object of class \Rclass{CNSet} is instantiated as part |
150 | 151 |
of the preprocessing and genotyping by calling the function |
151 | 152 |
\Rfunction{genotype}, as illustrated in the |
152 |
-\verb+AffymtrixPreprocessCN+ vignette. |
|
153 |
+\verb+AffyGW+ vignette. |
|
153 | 154 |
|
154 | 155 |
<<instantiateToyExample,cache=TRUE>>= |
155 | 156 |
exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
... | ... |
@@ -163,11 +164,11 @@ ldPath() |
163 | 164 |
@ |
164 | 165 |
|
165 | 166 |
One could also instantiate an object of class \Rclass{CNSet} without |
166 |
-preprocessing/genotyping by calling the non-exported function |
|
167 |
+preprocessing/genotyping by calling the exported function |
|
167 | 168 |
\Rfunction{constructAffy} directly using the \Rfunction{:::} operator. |
168 | 169 |
|
169 | 170 |
<<constructAffy,eval=FALSE>>= |
170 |
-tmp <- crlmm:::constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
|
171 |
+tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
|
171 | 172 |
@ |
172 | 173 |
|
173 | 174 |
The \Rfunction{show} method provides a concise summary of the |
... | ... |
@@ -300,15 +301,6 @@ platform. A consequence of keeping the rows of the assay data |
300 | 301 |
elements the same for all of the statistical summaries is that the |
301 | 302 |
matrix used to store genotype calls is larger than necessary. |
302 | 303 |
|
303 |
-% only true for affy. for illumina, we have intensities stored in 'B' |
|
304 |
-%Note that NA's are stored in the slot for normalized 'B' allele |
|
305 |
-%intensities: |
|
306 |
- |
|
307 |
-%<<B.NAs>>= |
|
308 |
-%np.index <- which(!is.snp) |
|
309 |
-%stopifnot(all(is.na(B(cnSet)[np.index, ]))) |
|
310 |
-%@ |
|
311 |
- |
|
312 | 304 |
\subsubsection{\texttt{batch} and \texttt{batchStatistics}} |
313 | 305 |
|
314 | 306 |
As defined in Leek \textit{et al.} 2010, \textit{Batch effects are |
... | ... |
@@ -402,73 +394,10 @@ varLabels(protocolData(exampleSet)) |
402 | 394 |
protocolData(exampleSet)$ScanDate |
403 | 395 |
@ |
404 | 396 |
|
405 |
- |
|
406 |
-%\section{Suggested visualizations} |
|
407 |
-% |
|
408 |
-%\paragraph{SNR.} |
|
409 |
-% |
|
410 |
-%A histogram of the signal to noise ratio for the HapMap samples: |
|
411 |
-% |
|
412 |
-%<<plotSnr, fig=TRUE, include=FALSE>>= |
|
413 |
-%open(cnSet$SNR) |
|
414 |
-%hist(cnSet$SNR[, ], xlab="SNR", main="", breaks=25, col="lightblue", xlim=c(3, max(cnSet$SNR[]))) |
|
415 |
-%abline(v=5, lty=2, col="grey") |
|
416 |
-%text(3,5, label="SNR range for low \n quality arrays", adj=0, col="grey40") |
|
417 |
-%@ |
|
418 |
-%\begin{figure} |
|
419 |
-% \centering |
|
420 |
-% \includegraphics[width=0.6\textwidth]{copynumber-plotSnr} |
|
421 |
-% \caption{Signal to noise ratios for the HapMap samples. SNRs below 5 |
|
422 |
-% for the Affymetrix platform are often samples of lower quality. |
|
423 |
-% Such samples will tend to have much more variable estimates of copy |
|
424 |
-% number.} |
|
425 |
-%\end{figure} |
|
426 |
- |
|
427 |
- |
|
428 |
-%\paragraph{One sample at a time: locus-level estimates} |
|
429 |
-% |
|
430 |
-%Figure \ref{fig:oneSample} plots physical position (horizontal axis) |
|
431 |
-%versus copy number (vertical axis) for the first sample. There is |
|
432 |
-%less information to estimate copy number at nonpolymorphic loci; |
|
433 |
-%improvements to the univariate prediction regions at nonpolymorphic |
|
434 |
-%loci are a future area of research. If the \Rpackage{SNPchip} is |
|
435 |
-%available, an idiogram can be added to the existing plotting |
|
436 |
-%coordinates as indicated in the following example. |
|
437 |
-% |
|
438 |
-%<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>= |
|
439 |
-%GT.CONF.THR <- 0.8 |
|
440 |
-%marker.index <- which(chromosome(cnSet) == 1) |
|
441 |
-%cn <- totalCopynumber(cnSet, i=marker.index, j=1) |
|
442 |
-%x <- position(cnSet)[marker.index] |
|
443 |
-%par(las=1, mar=c(4, 5, 4, 2)) |
|
444 |
-%plot(x, cn, pch=".", |
|
445 |
-% cex=2, xaxt="n", col="grey60", ylim=c(0,6), |
|
446 |
-% ylab="copy number", xlab="physical position (Mb)", |
|
447 |
-% main=paste(sampleNames(cnSet)[1], ", CHR: 1")) |
|
448 |
-%axis(1, at=pretty(x), labels=pretty(x)/1e6) |
|
449 |
-%require(SNPchip) |
|
450 |
-%invisible(plotCytoband(1, new=FALSE, cytoband.ycoords=c(5.5, 6), label.cytoband=FALSE)) |
|
451 |
-%@ |
|
452 |
- |
|
453 |
-%\begin{figure} |
|
454 |
-% \includegraphics[width=0.9\textwidth]{copynumber-oneSample} |
|
455 |
-% \caption{\label{fig:oneSample} Total copy number (y-axis) for |
|
456 |
-% chromosome 1 plotted against physical position (x-axis) for one |
|
457 |
-% sample. Estimates at nonpolymorphic loci are plotted in light |
|
458 |
-% blue.} |
|
459 |
-%\end{figure} |
|
460 |
-% |
|
461 |
-%\clearpage |
|
462 |
-%\paragraph{One SNP at a time} |
|
463 |
-% |
|
464 |
-%Scatterplots of the A and B allele intensities (log-scale) can be |
|
465 |
-%useful for assessing the biallelic genotype calls. This section of |
|
466 |
-%the vignette is currently under development. |
|
467 |
- |
|
468 | 397 |
\section{Trouble shooting with a HapMap example} |
469 | 398 |
|
470 | 399 |
This section uses an object of class \Rclass{CNSet} instantiated by |
471 |
-the \verb+AffymetrixPreprocessCN+ vignette and saved to a local path |
|
400 |
+the \verb+AffyGW+ vignette and saved to a local path |
|
472 | 401 |
on our computing cluster indicated by the object \Robject{outdir} |
473 | 402 |
below. The \verb+copynumber+ vignette was used to fill out the |
474 | 403 |
\verb+batchStatistics+ slot of the \Robject{cnSet} object. |
... | ... |
@@ -490,17 +419,6 @@ invisible(open(cnSet)) |
490 | 419 |
|
491 | 420 |
\subsection{Missing values} |
492 | 421 |
|
493 |
-% There are several reasons for estimates of the allele-specific copy |
|
494 |
-% number to have missing values (\texttt{NA}'s). This section briefly |
|
495 |
-% elaborates on the source of missing values in the HapMap analysis |
|
496 |
-% and discusses possible alternatives to reduce the number of missing |
|
497 |
-% values. Note that allele-specific copy number, 'CA' and 'CB', is |
|
498 |
-% not saved in the \Robject{cnSet} object. Rather, the respective |
|
499 |
-% accessors calculate 'CA' and 'CB' on the fly from the normalized |
|
500 |
-% intensities and from the marker-specific parameter estimates in the |
|
501 |
-% linear model. In general, a missing value arises when the |
|
502 |
-% background or slope parameter was not estimated in the linear |
|
503 |
-% model. |
|
504 | 422 |
Most often, missing values occur when the genotype confidence scores |
505 | 423 |
for a SNP were below the threshold used by the |
506 | 424 |
\Robject{crlmmCopynumber} function. For the HapMap analysis, we used a |
... | ... |
@@ -44,40 +44,17 @@ appropriate. |
44 | 44 |
\hline |
45 | 45 |
Vignette & Platform & Annotation package & Scope \\ |
46 | 46 |
\hline |
47 |
-Infrastructure & Affy/Illumina & |
|
48 |
-& The CNSet container / large data support using the \Rpackage{ff} package \\ |
|
49 |
- AffymetrixPreprocessCN & Affy 5.0, 6.0 & genomewidesnp5Crlmm, genomewidesnp6Crlmm & Preprocessing and genotyping \\ |
|
50 |
- IlluminaPreprocessCN & Illumina & several$^\dagger$ & Preprocessing and genotyping \\ |
|
51 |
- copynumber & Affy/Illumina & N/A & raw copy number estimates \\ |
|
52 |
-% SmoothingRawCN & Affy/Illumina & N/A & smoothing via segmentation or hidden Markov models \\ |
|
47 |
+Infrastructure & Affy/Illumina & & The CNSet container / large data support using the \Rpackage{ff} package \\ |
|
48 |
+ AffyGW & Affy 5.0, 6.0 & genomewidesnp5Crlmm, genomewidesnp6Crlmm & Preprocessing, genotyping, CN estimation\\ |
|
49 |
+ IlluminaPreprocessCN & Illumina & several$^\dagger$ & Preprocessing, genotyping, CN estimation |
|
53 | 50 |
\hline |
54 | 51 |
\end{tabular} |
55 | 52 |
\end{center} |
56 | 53 |
\caption{\label{overview} Vignettes for copy number |
57 |
- estimation. $^\dagger$ Annotation packages available for the |
|
58 |
- Illumina platform include \Rpackage{human370v1cCrlmm}, \Rpackage{human370quadv3cCrlmm}, |
|
59 |
- \Rpackage{human550v3bCrlmm}, \Rpackage{human650v3aCrlmm}, \Rpackage{human610quadv1bCrlmm}, |
|
60 |
- \Rpackage{human660quadv1aCrlmm}, \Rpackage{human1mduov3bCrlmm}, and \Rpackage{humanomni1quadv1bCrlmm}} |
|
54 |
+ estimation. $^\dagger$ See \texttt{annotationPackages()} for a |
|
55 |
+ complete listing of supported Illumina/Affy platforms} |
|
61 | 56 |
\end{table} |
62 | 57 |
|
63 |
-%We make use of the \R{} package \Rpackage{cacheSweave} for cacheing |
|
64 |
-%code chunks that are computationally intensive. In addition, we |
|
65 |
-%indicate that the cached files should be stored in the directory |
|
66 |
-%\verb+outdir+. |
|
67 |
- |
|
68 |
-In general, the workflow is |
|
69 |
-\begin{enumerate} |
|
70 |
-\item preprocess and genotype the arrays |
|
71 |
- (\verb+AffymetrixPreprocessCN+ for Affymetrix and |
|
72 |
- \verb+IlluminaPreprocessCN+ vignettes for Illumina) |
|
73 |
-\item copy number estimation (\verb+copynumber+ vignette) |
|
74 |
-%\item inferring regions of copy number gain and loss |
|
75 |
-% (\verb+SmoothingRawCN+ vignette) |
|
76 |
-\end{enumerate} |
|
77 |
-%The \verb+SmoothingRawCN+ vignette illustrates one approach for |
|
78 |
-%interfacing with packages such as \Rpackage{DNAcopy} and |
|
79 |
-%\Rpackage{VanillaICE} for identifying regions of copy number gain or |
|
80 |
-%loss. |
|
81 | 58 |
The \verb+Infrastructure+ vignette provides additional details on the |
82 | 59 |
\Rclass{CNSet} container used to organize the processed data as well |
83 | 60 |
as a brief discussion regarding large data support through the \ff{} |