* collab:
update vignettes/Makefile
update .gitignore
update table in CopyNumberViewiew
bump dependency on oligoClasses version
Update alias for "[" method for PredictionRegion objects
Replaced CopyNumberOverview and Infrastructure vignettes with those in the release branch. The versions in release have removed AffymetrixPreprocessCN, which is no longer available
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@71322 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -29,14 +29,13 @@ |
29 | 29 |
|
30 | 30 |
This vignette provides an overview of the \Rclass{CNSet} class and a |
31 | 31 |
brief discussion of the underlying infrastructure for large data |
32 |
- support with the \Rpackage{ff} package. This package instantiates |
|
32 |
+ support with the \Rpackage{ff} package. This vignette instantiates |
|
33 | 33 |
an object of class \Rclass{CNSet} using a trivial dataset with 3 |
34 | 34 |
files. As this sample size is too small for estimating copy number |
35 | 35 |
with the \crlmm{} package, the final section of this vignette loads |
36 | 36 |
an object created by the analysis of 180 HapMap CEL files |
37 |
- (Affymetrix 6.0 platform). In particular, this object was |
|
38 |
- instantiated by running (1) the \verb+AffymetrixPreprocessCN+ |
|
39 |
- vignette and (2) the \verb+copynumber+ vignette. |
|
37 |
+ (Affymetrix 6.0 platform). This object was instantiated by running |
|
38 |
+ the \verb+AffyGW+ vignette. |
|
40 | 39 |
|
41 | 40 |
\end{abstract} |
42 | 41 |
|
... | ... |
@@ -76,14 +75,14 @@ processing subsets of the markers and/or samples. The functions |
76 | 75 |
markers and samples to read at once. In general, specifying smaller |
77 | 76 |
values should reduce the RAM required for a particular job. In |
78 | 77 |
general, smaller values will increase the run-time. In the following |
79 |
-code-chunk, we declare that \crlmm{} should process 150,000 markers at |
|
80 |
-a time (when possible) and 500 samples at a time. If our dataset |
|
81 |
-contained fewer than 500 samples, the \Rfunction{ocSamples} option |
|
78 |
+code-chunk, we declare that \crlmm{} should process 100,000 markers at |
|
79 |
+a time (when possible) and 200 samples at a time. If our dataset |
|
80 |
+contained fewer than 200 samples, the \Rfunction{ocSamples} option |
|
82 | 81 |
would not have any effect. One can view the current settings for |
83 | 82 |
these commands, by typing the functions without an argument. |
84 | 83 |
|
85 | 84 |
<<ram>>= |
86 |
-ocProbesets(50e3) |
|
85 |
+ocProbesets(100e3) |
|
87 | 86 |
ocSamples(200) |
88 | 87 |
@ |
89 | 88 |
|
... | ... |
@@ -96,11 +95,12 @@ methods: |
96 | 95 |
|
97 | 96 |
\begin{itemize} |
98 | 97 |
|
99 |
-\item[Approach 1:] during the preprocessing of the raw intensities for Illumina and |
|
100 |
- Affymetrix arrays by the the functions \Rfunction{constructInf} and |
|
101 |
- \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls |
|
102 |
- the non-exported function \Rfunction{constructAffy} to initialize a |
|
103 |
- \Rclass{CNSet} object for Affymetrix platforms.) |
|
98 |
+\item[Approach 1:] during the preprocessing of the raw intensities for |
|
99 |
+ Illumina and Affymetrix arrays by the the functions |
|
100 |
+ \Rfunction{constructInf} and \Rfunction{genotype}, |
|
101 |
+ respectively. (The \Rfunction{genotype} calls the function |
|
102 |
+ \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for |
|
103 |
+ Affymetrix platforms.) |
|
104 | 104 |
|
105 | 105 |
\item[Approach 2:] by subsetting an existing \Robject{CNSet} object. |
106 | 106 |
As per usual, the `[' method can be used to extract a subset of |
... | ... |
@@ -110,17 +110,18 @@ methods: |
110 | 110 |
\end{itemize} |
111 | 111 |
|
112 | 112 |
There are important differences in the underlying data representation |
113 |
-depending on how the object was instantiated. In particular, objects |
|
114 |
-generated by the functions \Rfunction{constructInf} and |
|
115 |
-\Rfunction{genotype} store high-dimensional data on disk rather than |
|
116 |
-in memory through protocols defined in the \R{} package \ff{}. For |
|
117 |
-instance, the normalized intensities and genotype calls in a |
|
118 |
-\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects. |
|
119 |
-By contrast, when such an objected generated by approach (1) is subset |
|
120 |
-by the `[' method, an object of the same class is returned but the |
|
121 |
-\Rpackage{ff}-derived objects are coerced to ordinary matrices. Note, |
|
122 |
-therefore, that both approaches (1) and (2) may involve substantial |
|
123 |
-I/O. |
|
113 |
+depending on how the \Rclass{CNSet} object was instantiated. In |
|
114 |
+particular, objects generated by the functions |
|
115 |
+\Rfunction{constructInf} and \Rfunction{genotype} store |
|
116 |
+high-dimensional data on disk rather than in memory through protocols |
|
117 |
+defined in the \R{} package \ff{}. For instance, the normalized |
|
118 |
+intensities and genotype calls in a \Rclass{CNSet}-instance from |
|
119 |
+approach (1) are \ff{}-derived objects. By contrast, when such an |
|
120 |
+objected generated by approach (1) is subset by the `[' method, an |
|
121 |
+object of the same class is returned but the \Rpackage{ff}-derived |
|
122 |
+objects are coerced to ordinary matrices. Note, therefore, that both |
|
123 |
+approaches (1) and (2) may involve substantial I/O and that (2) should |
|
124 |
+be performed judiciously. |
|
124 | 125 |
|
125 | 126 |
\subsubsection{Approach 1} |
126 | 127 |
|
... | ... |
@@ -149,7 +150,7 @@ celfiles <- list.celfiles(path, full.names=TRUE) |
149 | 150 |
Typically, an object of class \Rclass{CNSet} is instantiated as part |
150 | 151 |
of the preprocessing and genotyping by calling the function |
151 | 152 |
\Rfunction{genotype}, as illustrated in the |
152 |
-\verb+AffymtrixPreprocessCN+ vignette. |
|
153 |
+\verb+AffyGW+ vignette. |
|
153 | 154 |
|
154 | 155 |
<<instantiateToyExample,cache=TRUE>>= |
155 | 156 |
exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
... | ... |
@@ -163,11 +164,11 @@ ldPath() |
163 | 164 |
@ |
164 | 165 |
|
165 | 166 |
One could also instantiate an object of class \Rclass{CNSet} without |
166 |
-preprocessing/genotyping by calling the non-exported function |
|
167 |
+preprocessing/genotyping by calling the exported function |
|
167 | 168 |
\Rfunction{constructAffy} directly using the \Rfunction{:::} operator. |
168 | 169 |
|
169 | 170 |
<<constructAffy,eval=FALSE>>= |
170 |
-tmp <- crlmm:::constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
|
171 |
+tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
|
171 | 172 |
@ |
172 | 173 |
|
173 | 174 |
The \Rfunction{show} method provides a concise summary of the |
... | ... |
@@ -300,15 +301,6 @@ platform. A consequence of keeping the rows of the assay data |
300 | 301 |
elements the same for all of the statistical summaries is that the |
301 | 302 |
matrix used to store genotype calls is larger than necessary. |
302 | 303 |
|
303 |
-% only true for affy. for illumina, we have intensities stored in 'B' |
|
304 |
-%Note that NA's are stored in the slot for normalized 'B' allele |
|
305 |
-%intensities: |
|
306 |
- |
|
307 |
-%<<B.NAs>>= |
|
308 |
-%np.index <- which(!is.snp) |
|
309 |
-%stopifnot(all(is.na(B(cnSet)[np.index, ]))) |
|
310 |
-%@ |
|
311 |
- |
|
312 | 304 |
\subsubsection{\texttt{batch} and \texttt{batchStatistics}} |
313 | 305 |
|
314 | 306 |
As defined in Leek \textit{et al.} 2010, \textit{Batch effects are |
... | ... |
@@ -402,73 +394,10 @@ varLabels(protocolData(exampleSet)) |
402 | 394 |
protocolData(exampleSet)$ScanDate |
403 | 395 |
@ |
404 | 396 |
|
405 |
- |
|
406 |
-%\section{Suggested visualizations} |
|
407 |
-% |
|
408 |
-%\paragraph{SNR.} |
|
409 |
-% |
|
410 |
-%A histogram of the signal to noise ratio for the HapMap samples: |
|
411 |
-% |
|
412 |
-%<<plotSnr, fig=TRUE, include=FALSE>>= |
|
413 |
-%open(cnSet$SNR) |
|
414 |
-%hist(cnSet$SNR[, ], xlab="SNR", main="", breaks=25, col="lightblue", xlim=c(3, max(cnSet$SNR[]))) |
|
415 |
-%abline(v=5, lty=2, col="grey") |
|
416 |
-%text(3,5, label="SNR range for low \n quality arrays", adj=0, col="grey40") |
|
417 |
-%@ |
|
418 |
-%\begin{figure} |
|
419 |
-% \centering |
|
420 |
-% \includegraphics[width=0.6\textwidth]{copynumber-plotSnr} |
|
421 |
-% \caption{Signal to noise ratios for the HapMap samples. SNRs below 5 |
|
422 |
-% for the Affymetrix platform are often samples of lower quality. |
|
423 |
-% Such samples will tend to have much more variable estimates of copy |
|
424 |
-% number.} |
|
425 |
-%\end{figure} |
|
426 |
- |
|
427 |
- |
|
428 |
-%\paragraph{One sample at a time: locus-level estimates} |
|
429 |
-% |
|
430 |
-%Figure \ref{fig:oneSample} plots physical position (horizontal axis) |
|
431 |
-%versus copy number (vertical axis) for the first sample. There is |
|
432 |
-%less information to estimate copy number at nonpolymorphic loci; |
|
433 |
-%improvements to the univariate prediction regions at nonpolymorphic |
|
434 |
-%loci are a future area of research. If the \Rpackage{SNPchip} is |
|
435 |
-%available, an idiogram can be added to the existing plotting |
|
436 |
-%coordinates as indicated in the following example. |
|
437 |
-% |
|
438 |
-%<<oneSample, fig=TRUE, width=8, height=4, include=FALSE>>= |
|
439 |
-%GT.CONF.THR <- 0.8 |
|
440 |
-%marker.index <- which(chromosome(cnSet) == 1) |
|
441 |
-%cn <- totalCopynumber(cnSet, i=marker.index, j=1) |
|
442 |
-%x <- position(cnSet)[marker.index] |
|
443 |
-%par(las=1, mar=c(4, 5, 4, 2)) |
|
444 |
-%plot(x, cn, pch=".", |
|
445 |
-% cex=2, xaxt="n", col="grey60", ylim=c(0,6), |
|
446 |
-% ylab="copy number", xlab="physical position (Mb)", |
|
447 |
-% main=paste(sampleNames(cnSet)[1], ", CHR: 1")) |
|
448 |
-%axis(1, at=pretty(x), labels=pretty(x)/1e6) |
|
449 |
-%require(SNPchip) |
|
450 |
-%invisible(plotCytoband(1, new=FALSE, cytoband.ycoords=c(5.5, 6), label.cytoband=FALSE)) |
|
451 |
-%@ |
|
452 |
- |
|
453 |
-%\begin{figure} |
|
454 |
-% \includegraphics[width=0.9\textwidth]{copynumber-oneSample} |
|
455 |
-% \caption{\label{fig:oneSample} Total copy number (y-axis) for |
|
456 |
-% chromosome 1 plotted against physical position (x-axis) for one |
|
457 |
-% sample. Estimates at nonpolymorphic loci are plotted in light |
|
458 |
-% blue.} |
|
459 |
-%\end{figure} |
|
460 |
-% |
|
461 |
-%\clearpage |
|
462 |
-%\paragraph{One SNP at a time} |
|
463 |
-% |
|
464 |
-%Scatterplots of the A and B allele intensities (log-scale) can be |
|
465 |
-%useful for assessing the biallelic genotype calls. This section of |
|
466 |
-%the vignette is currently under development. |
|
467 |
- |
|
468 | 397 |
\section{Trouble shooting with a HapMap example} |
469 | 398 |
|
470 | 399 |
This section uses an object of class \Rclass{CNSet} instantiated by |
471 |
-the \verb+AffymetrixPreprocessCN+ vignette and saved to a local path |
|
400 |
+the \verb+AffyGW+ vignette and saved to a local path |
|
472 | 401 |
on our computing cluster indicated by the object \Robject{outdir} |
473 | 402 |
below. The \verb+copynumber+ vignette was used to fill out the |
474 | 403 |
\verb+batchStatistics+ slot of the \Robject{cnSet} object. |
... | ... |
@@ -490,17 +419,6 @@ invisible(open(cnSet)) |
490 | 419 |
|
491 | 420 |
\subsection{Missing values} |
492 | 421 |
|
493 |
-% There are several reasons for estimates of the allele-specific copy |
|
494 |
-% number to have missing values (\texttt{NA}'s). This section briefly |
|
495 |
-% elaborates on the source of missing values in the HapMap analysis |
|
496 |
-% and discusses possible alternatives to reduce the number of missing |
|
497 |
-% values. Note that allele-specific copy number, 'CA' and 'CB', is |
|
498 |
-% not saved in the \Robject{cnSet} object. Rather, the respective |
|
499 |
-% accessors calculate 'CA' and 'CB' on the fly from the normalized |
|
500 |
-% intensities and from the marker-specific parameter estimates in the |
|
501 |
-% linear model. In general, a missing value arises when the |
|
502 |
-% background or slope parameter was not estimated in the linear |
|
503 |
-% model. |
|
504 | 422 |
Most often, missing values occur when the genotype confidence scores |
505 | 423 |
for a SNP were below the threshold used by the |
506 | 424 |
\Robject{crlmmCopynumber} function. For the HapMap analysis, we used a |
* collab: (34 commits)
revert change to IlluminaPreprocessCN
fix bug in isValidCdfName
print warning when all features in a batch of probes are flagged, but allow processing to continue
add utility cleancdfnames
Add validCdfNames.Rd
export validCdfNames
imputeGender fix when chromosome Y not available
Use splitIndicesByLength(index, ocSamples/getDoParWorkers())
Can not allocate vector of size XG with genotype.Illumina. Use splitIndicesByNode() only if the length of the list is greater than the split from splitIndicesByLength(). Otherwise, split by length using ocSamples()
update .gitignore
Add make.unique for sampleSheet$Sample_ID in readIdatFiles
bug in description
ensure sample ids stored in samplesheet are unique when constructing cnSet object
update oligoClasses dependency
update unit test for genotype.Illumina
revert change in constructInf call from genotype.Illumina
Update genotype.Rd
edit ACN function
1.15.6 use make.unique(basename(arrayNames)) to allow processing of Illumina samples with duplicated barcodes
check that sample identifies are unique in crlmm function
...
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@67435 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -129,11 +129,9 @@ directory to store output files and setting the \Rfunction{ldPath} |
129 | 129 |
function with this path. |
130 | 130 |
|
131 | 131 |
<<path>>= |
132 |
-if(getRversion() < "2.13.0"){ |
|
133 |
- rpath <- getRversion() |
|
134 |
-} else rpath <- "trunk" |
|
135 |
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
|
132 |
+outdir <- paste("/local_data/r00/crlmm/", getRversion(), "/infrastructure", sep="") |
|
136 | 133 |
ldPath(outdir) |
134 |
+if(!file.exists(outdir)) dir.create(outdir) |
|
137 | 135 |
@ |
138 | 136 |
|
139 | 137 |
|
... | ... |
@@ -258,6 +256,7 @@ is a SNP can be accessed through accessors defined for the |
258 | 256 |
\verb+featureData+. |
259 | 257 |
|
260 | 258 |
<<featuredataAccessors>>= |
259 |
+library(Biobase) |
|
261 | 260 |
fvarLabels(exampleSet) |
262 | 261 |
position(exampleSet)[1:10] |
263 | 262 |
chromosome(exampleSet)[1:10] |
... | ... |
@@ -554,7 +553,7 @@ estimates are missing for a given SNP. |
554 | 553 |
|
555 | 554 |
<<NAchromosomeX>>= |
556 | 555 |
## start with first batch |
557 |
-sample.index <- 1:90 |
|
556 |
+sample.index <- which(batch(cnSet)==batch(cnSet)[1]) |
|
558 | 557 |
X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23) |
559 | 558 |
ca.X <- CA(cnSet, i=X.index, j=sample.index) |
560 | 559 |
missing.caX <- is.na(ca.X) |
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54327 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -22,36 +22,27 @@ |
22 | 22 |
\title{Infrastructure for copy number analysis in \crlmm{}} |
23 | 23 |
\date{\today} |
24 | 24 |
\author{Rob Scharpf} |
25 |
+ |
|
25 | 26 |
\maketitle |
26 | 27 |
|
28 |
+\begin{abstract} |
|
27 | 29 |
|
28 |
-\section{Set up} |
|
30 |
+ This vignette provides an overview of the \Rclass{CNSet} class and a |
|
31 |
+ brief discussion of the underlying infrastructure for large data |
|
32 |
+ support with the \Rpackage{ff} package. This package instantiates |
|
33 |
+ an object of class \Rclass{CNSet} using a trivial dataset with 3 |
|
34 |
+ files. As this sample size is too small for estimating copy number |
|
35 |
+ with the \crlmm{} package, the final section of this vignette loads |
|
36 |
+ an object created by the analysis of 180 HapMap CEL files |
|
37 |
+ (Affymetrix 6.0 platform). In particular, this object was |
|
38 |
+ instantiated by running (1) the \verb+AffymetrixPreprocessCN+ |
|
39 |
+ vignette and (2) the \verb+copynumber+ vignette. |
|
29 | 40 |
|
30 |
-As in the previous vignettes, we load the required libraries and |
|
31 |
-specify a path for storing output files. |
|
41 |
+\end{abstract} |
|
32 | 42 |
|
33 | 43 |
<<libraries,results=hide>>= |
34 | 44 |
library(ff) |
35 | 45 |
library(crlmm) |
36 |
-library(cacheSweave) |
|
37 |
-require(DNAcopy) |
|
38 |
-require(VanillaICE) |
|
39 |
-if(getRversion() < "2.13.0"){ |
|
40 |
- rpath <- getRversion() |
|
41 |
-} else rpath <- "trunk" |
|
42 |
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
|
43 |
-@ |
|
44 |
- |
|
45 |
-<<ram>>= |
|
46 |
-ldPath(outdir) |
|
47 |
-setCacheDir(outdir) |
|
48 |
-@ |
|
49 |
- |
|
50 |
-We begin by loading the \Robject{cnSet} object created by the |
|
51 |
-\verb+AffymetrixPreprocessCN+ vignette. |
|
52 |
- |
|
53 |
-<<loadcnset>>= |
|
54 |
-if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda")) |
|
55 | 46 |
@ |
56 | 47 |
|
57 | 48 |
\section{Supported platforms} |
... | ... |
@@ -98,31 +89,25 @@ ocSamples(200) |
98 | 89 |
|
99 | 90 |
\section{The \Rclass{CNSet} container} |
100 | 91 |
|
101 |
-The \Rfunction{show} method provides a concise summary of the |
|
102 |
-\Robject{cnSet} object. |
|
92 |
+\subsection{Instantiating an object of class \Rclass{CNSet}} |
|
103 | 93 |
|
104 |
-<<show>>= |
|
105 |
-cnSet |
|
106 |
-@ |
|
94 |
+An object of class \Rclass{CNSet} can be instantiated by one of two |
|
95 |
+methods: |
|
107 | 96 |
|
108 |
-We briefly outline some of the unique aspects of the |
|
109 |
-\Rclass{CNSet}-class using in \crlmm{} that may differ from the more |
|
110 |
-standard extensions of the virtual class \Rclass{eSet} defined in the |
|
111 |
-\Rpackage{Biobase} package. |
|
97 |
+\begin{itemize} |
|
112 | 98 |
|
113 |
-\paragraph{Instantiating a \Rclass{CNSet}:} An object of class |
|
114 |
-\Rclass{CNSet} can be instantiated by one of two methods: |
|
115 |
-\begin{enumerate} |
|
116 |
-\item during the preprocessing of the raw intensities for Illumina and |
|
99 |
+\item[Approach 1:] during the preprocessing of the raw intensities for Illumina and |
|
117 | 100 |
Affymetrix arrays by the the functions \Rfunction{constructInf} and |
118 |
- \Rfunction{genotype}, respectively. |
|
101 |
+ \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls |
|
102 |
+ the non-exported function \Rfunction{constructAffy} to initialize a |
|
103 |
+ \Rclass{CNSet} object for Affymetrix platforms.) |
|
119 | 104 |
|
120 |
-\item by subsetting an existing \Robject{CNSet} object. As per usual, |
|
121 |
- the `[' method can be used to extract a subset of markers $i$ as in |
|
122 |
- `[i, ]', a subset of samples $j$ as in `[, j]', or a subset of |
|
123 |
- markers $i$ and samples $j$ as in `[i, j]'. |
|
105 |
+\item[Approach 2:] by subsetting an existing \Robject{CNSet} object. |
|
106 |
+ As per usual, the `[' method can be used to extract a subset of |
|
107 |
+ markers $i$ as in `[i, ]', a subset of samples $j$ as in `[, j]', or |
|
108 |
+ a subset of markers $i$ and samples $j$ as in `[i, j]'. |
|
124 | 109 |
|
125 |
-\end{enumerate} |
|
110 |
+\end{itemize} |
|
126 | 111 |
|
127 | 112 |
There are important differences in the underlying data representation |
128 | 113 |
depending on how the object was instantiated. In particular, objects |
... | ... |
@@ -131,36 +116,106 @@ generated by the functions \Rfunction{constructInf} and |
131 | 116 |
in memory through protocols defined in the \R{} package \ff{}. For |
132 | 117 |
instance, the normalized intensities and genotype calls in a |
133 | 118 |
\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects. |
134 |
-However, when such an object is subset by the `[' method, the data is |
|
135 |
-pulled from disk to memory and stored as an ordinary matrix in \R{}. |
|
136 |
-To illustrate, the \Robject{cnSet} loaded at the start of this session |
|
137 |
-was created by the function \Rfunction{genotype}. As the data is |
|
138 |
-stored on disk rather than in memory, we inspect attributes of the |
|
139 |
-object representing the normalized intensities for the A allele by |
|
140 |
-first opening the file connection and then extracting the |
|
141 |
-\ff{}-derived class of the object as well as where the data is stored |
|
142 |
-on disk. |
|
119 |
+By contrast, when such an objected generated by approach (1) is subset |
|
120 |
+by the `[' method, an object of the same class is returned but the |
|
121 |
+\Rpackage{ff}-derived objects are coerced to ordinary matrices. Note, |
|
122 |
+therefore, that both approaches (1) and (2) may involve substantial |
|
123 |
+I/O. |
|
143 | 124 |
|
144 |
-<<approach1>>= |
|
145 |
-invisible(open(cnSet)) |
|
146 |
-class(A(cnSet)) |
|
147 |
-filename(A(cnSet)) |
|
125 |
+\subsubsection{Approach 1} |
|
126 |
+ |
|
127 |
+To illustrate the first approach, we begin by specifying a local |
|
128 |
+directory to store output files and setting the \Rfunction{ldPath} |
|
129 |
+function with this path. |
|
130 |
+ |
|
131 |
+<<path>>= |
|
132 |
+if(getRversion() < "2.13.0"){ |
|
133 |
+ rpath <- getRversion() |
|
134 |
+} else rpath <- "trunk" |
|
135 |
+outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
|
136 |
+ldPath(outdir) |
|
148 | 137 |
@ |
149 | 138 |
|
150 |
-Using approach (2), we construct a new \Robject{CNSet} object. |
|
151 | 139 |
|
152 |
-<<approach2>>= |
|
153 |
-cnset.subset <- cnSet[1:5, 1:10] |
|
154 |
-invisible(close(cnSet)) |
|
155 |
-class(A(cnset.subset)) |
|
140 |
+Next we load the annotation package required, as well as the \R{} |
|
141 |
+package \Rpackage{hapmapsnp6} that contains 3 example CEL files. We |
|
142 |
+use the \Rfunction{system.file} function to find the path to the CEL |
|
143 |
+files. |
|
144 |
+ |
|
145 |
+<<annotation_metadata>>= |
|
146 |
+require(genomewidesnp6Crlmm) & require(hapmapsnp6) |
|
147 |
+path <- system.file("celFiles", package="hapmapsnp6") |
|
148 |
+celfiles <- list.celfiles(path, full.names=TRUE) |
|
149 |
+@ |
|
150 |
+ |
|
151 |
+Typically, an object of class \Rclass{CNSet} is instantiated as part |
|
152 |
+of the preprocessing and genotyping by calling the function |
|
153 |
+\Rfunction{genotype}, as illustrated in the |
|
154 |
+\verb+AffymtrixPreprocessCN+ vignette. |
|
155 |
+ |
|
156 |
+<<instantiateToyExample,cache=TRUE>>= |
|
157 |
+exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
|
156 | 158 |
@ |
157 | 159 |
|
158 |
-The files with \verb+.ff+ extension should not be removed or |
|
159 |
-relocated. The safest way to move these files if necessary is to |
|
160 |
-clone all of the \ff{} objects using the \Rfunction{clone}, followed |
|
161 |
-by the \Rfunction{delete} function to remove the original files on |
|
162 |
-disk. See the documentation in the \ff{} package for additional |
|
163 |
-details. |
|
160 |
+Several files with \verb+.ff+ extensions now appear in the directory |
|
161 |
+indicated by the \Rfunction{ldPath} function. |
|
162 |
+ |
|
163 |
+<<ldpath>>= |
|
164 |
+ldPath() |
|
165 |
+@ |
|
166 |
+ |
|
167 |
+One could also instantiate an object of class \Rclass{CNSet} without |
|
168 |
+preprocessing/genotyping by calling the non-exported function |
|
169 |
+\Rfunction{constructAffy} directly using the \Rfunction{:::} operator. |
|
170 |
+ |
|
171 |
+<<constructAffy,eval=FALSE>>= |
|
172 |
+tmp <- crlmm:::constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") |
|
173 |
+@ |
|
174 |
+ |
|
175 |
+The \Rfunction{show} method provides a concise summary of the |
|
176 |
+\Robject{exampleSet} object. Note the class of the elements in the |
|
177 |
+\verb+batchStatistics+ and \verb+assayData+ slots is indicated in the |
|
178 |
+first line of the summary. |
|
179 |
+ |
|
180 |
+<<show>>= |
|
181 |
+invisible(open(exampleSet)) |
|
182 |
+exampleSet |
|
183 |
+@ |
|
184 |
+ |
|
185 |
+% We briefly outline some of the unique aspects of the |
|
186 |
+% \Rclass{CNSet}-class using in \crlmm{} that may differ from the more |
|
187 |
+% standard extensions of the virtual class \Rclass{eSet} defined in |
|
188 |
+% the \Rpackage{Biobase} package. |
|
189 |
+ |
|
190 |
+As the \verb+assayData+ elements of the \Robject{exampleSet} object |
|
191 |
+are stored on disk rather than in memory, we inspect attributes of the |
|
192 |
+elements by first opening the file connection using the |
|
193 |
+\Rfunction{open}. For example, in the following code we extract the |
|
194 |
+normalized intensities for the A allele by first opening the object |
|
195 |
+returned by the \Rfunction{A} function. The name of the file where |
|
196 |
+the data is stored on disk is provided by the |
|
197 |
+\Rfunction{filename}. Finally, the normalized intensities can be |
|
198 |
+pulled from disk to memory by the `[' method. It can be useful to |
|
199 |
+wrap the `[' method by the \Rfunction{as.matrix} to ensure that the |
|
200 |
+output is the desired class. |
|
201 |
+ |
|
202 |
+<<approach1>>= |
|
203 |
+invisible(open(exampleSet)) |
|
204 |
+class(A(exampleSet)) |
|
205 |
+filename(A(exampleSet)) |
|
206 |
+as.matrix(A(exampleSet)[1:5, ]) |
|
207 |
+@ |
|
208 |
+ |
|
209 |
+\paragraph{Moving \texttt{*.ff} files} Ideally, the files with |
|
210 |
+\verb+.ff+ extension should not be moved. However, if this is not |
|
211 |
+possible, the safest way to move these files is to clone all of the |
|
212 |
+\ff{} objects using the \Rfunction{clone}, followed by the |
|
213 |
+\Rfunction{delete} function to remove the original files on disk. An |
|
214 |
+example of the \Rfunction{delete} function is included at the end of |
|
215 |
+the \verb+IlluminaPreprocessCN+ vignette. See the documentation for |
|
216 |
+the \Rfunction{clone} and \Rfunction{delete} functions in the \ff{} |
|
217 |
+package for additional details. |
|
218 |
+ |
|
164 | 219 |
|
165 | 220 |
\paragraph{Order of operations:} For \Rclass{CNSet}-instances derived |
166 | 221 |
by approach (1), users should be mindful of the substantial I/O when |
... | ... |
@@ -171,30 +226,49 @@ operation to emphasize the order of operations): |
171 | 226 |
|
172 | 227 |
<<orderOfOperations>>= |
173 | 228 |
##inefficient |
174 |
-A(cnSet[1:5, 1:5]) |
|
229 |
+##invisible(open(cnSet)) |
|
230 |
+A(exampleSet[1:5, ]) |
|
175 | 231 |
## preferred |
176 |
-(A(cnSet))[1:5, 1:5] |
|
232 |
+(A(exampleSet))[1:5, ] |
|
177 | 233 |
@ |
178 | 234 |
|
179 | 235 |
|
180 |
-\textbf{\texttt{featureData}} |
|
236 |
+\subsubsection{Approach 2: using `['} |
|
237 |
+ |
|
238 |
+Here we instantiate a new object of class \Rclass{CNSet} by applying the |
|
239 |
+`[' method to an existing object of class \Rclass{CNSet}. |
|
240 |
+ |
|
241 |
+<<approach2>>= |
|
242 |
+cnset.subset <- exampleSet[1:5, ] |
|
243 |
+@ |
|
244 |
+ |
|
245 |
+Note the class of the \verb+batchStatistics+ and \verb+assayData+ |
|
246 |
+elements of the \Robject{cnset.subset} object printed in the first |
|
247 |
+line of the summary. |
|
248 |
+ |
|
249 |
+<<show.subset>>= |
|
250 |
+show(cnset.subset) |
|
251 |
+@ |
|
252 |
+ |
|
253 |
+\subsection{Slots of class \Rclass{CNSet}} |
|
254 |
+\subsubsection{\texttt{featureData}} |
|
181 | 255 |
|
182 | 256 |
Information on physical position, chromosome, and whether the marker |
183 | 257 |
is a SNP can be accessed through accessors defined for the |
184 | 258 |
\verb+featureData+. |
185 | 259 |
|
186 | 260 |
<<featuredataAccessors>>= |
187 |
-fvarLabels(cnSet) |
|
188 |
-position(cnSet)[1:10] |
|
189 |
-chromosome(cnSet)[1:10] |
|
190 |
-is.snp <- isSnp(cnSet) |
|
261 |
+fvarLabels(exampleSet) |
|
262 |
+position(exampleSet)[1:10] |
|
263 |
+chromosome(exampleSet)[1:10] |
|
264 |
+is.snp <- isSnp(exampleSet) |
|
191 | 265 |
table(is.snp) |
192 | 266 |
snp.index <- which(is.snp) |
193 | 267 |
np.index <- which(!is.snp) |
194 |
-chr1.index <- which(chromosome(cnSet) == 1) |
|
268 |
+chr1.index <- which(chromosome(exampleSet) == 1) |
|
195 | 269 |
@ |
196 | 270 |
|
197 |
-\textbf{\texttt{assayData}} |
|
271 |
+\subsubsection{\texttt{assayData}} |
|
198 | 272 |
|
199 | 273 |
The \verb+assayData+ elements are of the class |
200 | 274 |
\Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how |
... | ... |
@@ -203,7 +277,7 @@ the \Rclass{CNSet} object was instantiated. Elements in the |
203 | 277 |
function. |
204 | 278 |
|
205 | 279 |
<<assayData>>= |
206 |
-ls(assayData(cnSet)) |
|
280 |
+ls(assayData(exampleSet)) |
|
207 | 281 |
@ |
208 | 282 |
|
209 | 283 |
The normalized intensities for the A and B alleles have names |
... | ... |
@@ -216,7 +290,7 @@ but can be tranlated to the probability scale using the \R{} function |
216 | 290 |
\Rfunction{i2p}. For example, |
217 | 291 |
|
218 | 292 |
<<i2p>>= |
219 |
-scores <- as.matrix(snpCallProbability(cnSet)[1:5, 1:2]) |
|
293 |
+scores <- as.matrix(snpCallProbability(exampleSet)[1:5, 1:2]) |
|
220 | 294 |
i2p(scores) |
221 | 295 |
@ |
222 | 296 |
|
... | ... |
@@ -227,15 +301,16 @@ platform. A consequence of keeping the rows of the assay data |
227 | 301 |
elements the same for all of the statistical summaries is that the |
228 | 302 |
matrix used to store genotype calls is larger than necessary. |
229 | 303 |
|
230 |
-Note that NA's are stored in the slot for normalized 'B' allele |
|
231 |
-intensities: |
|
304 |
+% only true for affy. for illumina, we have intensities stored in 'B' |
|
305 |
+%Note that NA's are stored in the slot for normalized 'B' allele |
|
306 |
+%intensities: |
|
232 | 307 |
|
233 |
-<<B.NAs>>= |
|
234 |
-np.index <- which(!is.snp) |
|
235 |
-stopifnot(all(is.na(B(cnSet)[np.index, ]))) |
|
236 |
-@ |
|
308 |
+%<<B.NAs>>= |
|
309 |
+%np.index <- which(!is.snp) |
|
310 |
+%stopifnot(all(is.na(B(cnSet)[np.index, ]))) |
|
311 |
+%@ |
|
237 | 312 |
|
238 |
-\textbf{\texttt{batch} and \texttt{batchStatistics}} |
|
313 |
+\subsubsection{\texttt{batch} and \texttt{batchStatistics}} |
|
239 | 314 |
|
240 | 315 |
As defined in Leek \textit{et al.} 2010, \textit{Batch effects are |
241 | 316 |
sub-groups of measurements that have qualitatively different |
... | ... |
@@ -260,7 +335,7 @@ to access the \verb+batch+ information on the samples as in the |
260 | 335 |
following example. |
261 | 336 |
|
262 | 337 |
<<batchAccessor>>= |
263 |
-table(batch(cnSet)) |
|
338 |
+batch(exampleSet) |
|
264 | 339 |
@ |
265 | 340 |
|
266 | 341 |
For the \verb+batchStatistics+ slot, the elements in the environment |
... | ... |
@@ -272,7 +347,7 @@ the elements in the environment can be list using the \R{} function |
272 | 347 |
\Rfunction{ls}. |
273 | 348 |
|
274 | 349 |
<<batchStatistics>>= |
275 |
-ls(batchStatistics(cnSet)) |
|
350 |
+ls(batchStatistics(exampleSet)) |
|
276 | 351 |
@ |
277 | 352 |
|
278 | 353 |
Currently, the batch-specific summaries are stored to allow some |
... | ... |
@@ -292,7 +367,7 @@ intended to be accessed directly by the user. |
292 | 367 |
%median variance (across markers) within a batch. ) The elements of the |
293 | 368 |
%slot can be listed as follows. |
294 | 369 |
|
295 |
-\textbf{\texttt{phenoData}} |
|
370 |
+\subsubsection{\texttt{phenoData}} |
|
296 | 371 |
|
297 | 372 |
Sample-level summaries obtained during the preprocessing/genotyping |
298 | 373 |
steps include skew (SKW), the signal to noise ratio (SNR), and gender |
... | ... |
@@ -303,26 +378,29 @@ approach (1), these elements are of the class |
303 | 378 |
\Rclass{ff\_vector}. |
304 | 379 |
|
305 | 380 |
<<phenodataAccessors>>= |
306 |
-varLabels(cnSet) |
|
307 |
-class(cnSet$gender) |
|
308 |
-invisible(open(cnSet$gender)) |
|
309 |
-cnSet$gender |
|
381 |
+varLabels(exampleSet) |
|
382 |
+class(exampleSet$gender) |
|
383 |
+invisible(open(exampleSet$gender)) |
|
384 |
+exampleSet$gender |
|
310 | 385 |
@ |
311 | 386 |
|
312 | 387 |
The `[' methods without arguments can be used to coerce to a vector. |
313 | 388 |
|
314 | 389 |
<<vector>>= |
315 |
-cnSet$gender[] |
|
316 |
-invisible(close(cnSet$gender)) |
|
390 |
+c("male", "female")[exampleSet$gender[]] |
|
391 |
+invisible(close(exampleSet$gender)) |
|
317 | 392 |
@ |
318 | 393 |
|
319 |
-\textbf{\texttt{protocolData}} |
|
394 |
+ |
|
395 |
+ |
|
396 |
+ |
|
397 |
+\subsubsection{\texttt{protocolData}} |
|
320 | 398 |
|
321 | 399 |
The scan date of the arrays are stored in the \verb+protocolData+. |
322 | 400 |
|
323 | 401 |
<<protocolData>>= |
324 |
-varLabels(protocolData(cnSet)) |
|
325 |
-protocolData(cnSet)$ScanDate[1:10] |
|
402 |
+varLabels(protocolData(exampleSet)) |
|
403 |
+protocolData(exampleSet)$ScanDate |
|
326 | 404 |
@ |
327 | 405 |
|
328 | 406 |
|
... | ... |
@@ -388,7 +466,28 @@ protocolData(cnSet)$ScanDate[1:10] |
388 | 466 |
%useful for assessing the biallelic genotype calls. This section of |
389 | 467 |
%the vignette is currently under development. |
390 | 468 |
|
391 |
-\section{Trouble shooting} |
|
469 |
+\section{Trouble shooting with a HapMap example} |
|
470 |
+ |
|
471 |
+This section uses an object of class \Rclass{CNSet} instantiated by |
|
472 |
+the \verb+AffymetrixPreprocessCN+ vignette and saved to a local path |
|
473 |
+on our computing cluster indicated by the object \Robject{outdir} |
|
474 |
+below. The \verb+copynumber+ vignette was used to fill out the |
|
475 |
+\verb+batchStatistics+ slot of the \Robject{cnSet} object. |
|
476 |
+ |
|
477 |
+<<path>>= |
|
478 |
+if(getRversion() < "2.13.0"){ |
|
479 |
+ rpath <- getRversion() |
|
480 |
+} else rpath <- "trunk" |
|
481 |
+outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
|
482 |
+@ |
|
483 |
+ |
|
484 |
+Next, we load the \Robject{cnSet} object. |
|
485 |
+ |
|
486 |
+<<loadcnset>>= |
|
487 |
+if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda")) |
|
488 |
+invisible(open(cnSet)) |
|
489 |
+@ |
|
490 |
+ |
|
392 | 491 |
|
393 | 492 |
\subsection{Missing values} |
394 | 493 |
|
... | ... |
@@ -406,12 +505,15 @@ protocolData(cnSet)$ScanDate[1:10] |
406 | 505 |
Most often, missing values occur when the genotype confidence scores |
407 | 506 |
for a SNP were below the threshold used by the |
408 | 507 |
\Robject{crlmmCopynumber} function. For the HapMap analysis, we used a |
409 |
-confidence threshold of 0.80 (the default). |
|
508 |
+confidence threshold of 0.80 (the default). In the following code, we |
|
509 |
+assess NA's appearing for the raw copy number estimates for the first |
|
510 |
+10 samples. |
|
410 | 511 |
|
411 | 512 |
<<NA>>= |
412 | 513 |
GT.CONF.THR <- 0.80 |
413 | 514 |
autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23) |
414 |
-sample.index <- which(batch(cnSet) == "C") |
|
515 |
+sample.index <- 1:10 |
|
516 |
+ct <- totalCopynumber(cnSet, i=autosome.index, j=sample.index) |
|
415 | 517 |
ca <- CA(cnSet, i=autosome.index, j=sample.index) |
416 | 518 |
cb <- CB(cnSet, i=autosome.index, j=sample.index) |
417 | 519 |
missing.ca <- is.na(ca) |
... | ... |
@@ -427,9 +529,9 @@ threshold specified in \Robject{crlmmCopynumber}. |
427 | 529 |
|
428 | 530 |
<<NAconfidenceScores>>= |
429 | 531 |
if(nmissing.ca > 0){ |
430 |
- invisible(open(snpCallProbability(cnSet))) |
|
532 |
+ ##invisible(open(snpCallProbability(cnSet))) |
|
431 | 533 |
gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index]) |
432 |
- invisible(close(snpCallProbability(cnSet))) |
|
534 |
+ ##invisible(close(snpCallProbability(cnSet))) |
|
433 | 535 |
below.thr <- gt.conf < GT.CONF.THR |
434 | 536 |
index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index))) |
435 | 537 |
nmissingBecauseOfGtThr <- length(index.allbelow) * length(sample.index) |
... | ... |
@@ -451,13 +553,15 @@ missing values to the number of samples to check whether all of the |
451 | 553 |
estimates are missing for a given SNP. |
452 | 554 |
|
453 | 555 |
<<NAchromosomeX>>= |
556 |
+## start with first batch |
|
557 |
+sample.index <- 1:90 |
|
454 | 558 |
X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23) |
455 | 559 |
ca.X <- CA(cnSet, i=X.index, j=sample.index) |
456 | 560 |
missing.caX <- is.na(ca.X) |
457 | 561 |
(nmissing.caX <- sum(missing.caX)) |
458 | 562 |
missing.snp.index <- which(rowSums(missing.caX) == length(sample.index)) |
459 | 563 |
index <- which(rowSums(missing.caX) == length(sample.index)) |
460 |
-length(index)*length(sample.index)/nmissing.caX |
|
564 |
+##length(index)*length(sample.index)/nmissing.caX |
|
461 | 565 |
@ |
462 | 566 |
|
463 | 567 |
From the above codechunk, we see that \Sexpr{length(index)} SNPs have |
... | ... |
@@ -468,21 +572,15 @@ from SNPs in which either the men or the women had confidence scores |
468 | 572 |
that were all below the threshold. |
469 | 573 |
|
470 | 574 |
<<NAcrlmmConfidenceScore>>= |
471 |
-if(nmissing.caX > 0){ |
|
472 |
- invisible(open(snpCallProbability(cnSet))) |
|
473 |
- gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index]) |
|
474 |
- invisible(close(snpCallProbability(cnSet))) |
|
475 |
- below.thr <- gt.conf < GT.CONF.THR |
|
476 |
- F <- which(cnSet$gender[sample.index] == 2) |
|
477 |
- M <- which(cnSet$gender[sample.index] == 1) |
|
478 |
- ##nus <- nuA(cnSet)[X.index, "C"] |
|
479 |
- index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F))) |
|
480 |
- index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M))) |
|
481 |
- index.allbelow <- as.integer(unique(c(index.allbelowF, index.allbelowM))) |
|
482 |
- ##all.equal(index, index.allbelow) |
|
483 |
- ## or calculate the proportion of missing effected by low crlmm confidence |
|
484 |
- sum(index.allbelow %in% index)/length(index) |
|
485 |
-} |
|
575 |
+invisible(open(cnSet$gender)) |
|
576 |
+F <- which(cnSet$gender[sample.index] == 2) |
|
577 |
+M <- which(cnSet$gender[sample.index] == 1) |
|
578 |
+gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index]) |
|
579 |
+below.thr <- gt.conf < GT.CONF.THR |
|
580 |
+index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F))) |
|
581 |
+index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M))) |
|
582 |
+all.equal(index.allbelowF, index.allbelowM) |
|
583 |
+all.equal(as.integer(index.allbelowF), as.integer(index)) |
|
486 | 584 |
@ |
487 | 585 |
|
488 | 586 |
For nonpolymorphic loci, the genotype confidence scores are irrelevant |
... | ... |
@@ -519,8 +617,9 @@ detection and removal may be explored in the future. |
519 | 617 |
Copy number estimates for other chromosomes, such as mitochondrial and |
520 | 618 |
chromosome Y, are not currently available in \crlmm{}. |
521 | 619 |
|
522 |
-<<echo=FALSE>>= |
|
620 |
+<<close>>= |
|
523 | 621 |
invisible(close(cnSet)) |
622 |
+invisible(close(cnSet$gender)) |
|
524 | 623 |
@ |
525 | 624 |
|
526 | 625 |
|
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54326 bc3139a8-67e5-0310-9ffc-ced21a209358
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54171 bc3139a8-67e5-0310-9ffc-ced21a209358
1 | 1 |
new file mode 100644 |
... | ... |
@@ -0,0 +1,540 @@ |
1 |
+%\VignetteIndexEntry{Infrastructure for crlmm copy number Vignette} |
|
2 |
+%\VignetteDepends{crlmm, genomewidesnp6Crlmm} |
|
3 |
+%\VignetteKeywords{crlmm, SNP 6} |
|
4 |
+%\VignettePackage{crlmm} |
|
5 |
+\documentclass{article} |
|
6 |
+\usepackage{graphicx} |
|
7 |
+\usepackage{natbib} |
|
8 |
+\usepackage{url} |
|
9 |
+\newcommand{\Rfunction}[1]{{\texttt{#1}}} |
|
10 |
+\newcommand{\Rmethod}[1]{{\texttt{#1}}} |
|
11 |
+\newcommand{\Rcode}[1]{{\texttt{#1}}} |
|
12 |
+\newcommand{\Robject}[1]{{\texttt{#1}}} |
|
13 |
+\newcommand{\Rpackage}[1]{{\textsf{#1}}} |
|
14 |
+\newcommand{\Rclass}[1]{{\textit{#1}}} |
|
15 |
+\newcommand{\oligo}{\Rpackage{oligo }} |
|
16 |
+\newcommand{\R}{\textsf{R}} |
|
17 |
+\newcommand{\crlmm}{\Rpackage{crlmm}} |
|
18 |
+\newcommand{\ff}{\Rpackage{ff}} |
|
19 |
+\usepackage[margin=1in]{geometry} |
|
20 |
+ |
|
21 |
+\begin{document} |
|
22 |
+\title{Infrastructure for copy number analysis in \crlmm{}} |
|
23 |
+\date{\today} |
|
24 |
+\author{Rob Scharpf} |
|
25 |
+\maketitle |
|
26 |
+ |
|
27 |
+ |
|
28 |
+\section{Set up} |
|
29 |
+ |
|
30 |
+As in the previous vignettes, we load the required libraries and |
|
31 |
+specify a path for storing output files. |
|
32 |
+ |
|
33 |
+<<libraries,results=hide>>= |
|
34 |
+library(ff) |
|
35 |
+library(crlmm) |
|
36 |
+library(cacheSweave) |
|
37 |
+require(DNAcopy) |
|
38 |
+require(VanillaICE) |
|
39 |
+if(getRversion() < "2.13.0"){ |
|
40 |
+ rpath <- getRversion() |
|
41 |
+} else rpath <- "trunk" |
|
42 |
+outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
|
43 |
+@ |
|
44 |
+ |
|
45 |
+<<ram>>= |
|
46 |
+ldPath(outdir) |
|
47 |
+setCacheDir(outdir) |
|
48 |
+@ |
|
49 |
+ |
|
50 |
+We begin by loading the \Robject{cnSet} object created by the |
|
51 |
+\verb+AffymetrixPreprocessCN+ vignette. |
|
52 |
+ |
|
53 |
+<<loadcnset>>= |
|
54 |
+if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda")) |
|
55 |
+@ |
|
56 |
+ |
|
57 |
+\section{Supported platforms} |
|
58 |
+ |
|
59 |
+The supported Affymetrix and Infinium platforms are those for which a |
|
60 |
+corresponding annotation package is available. The annotation |
|
61 |
+packages contain information on the markers, such as physical position |
|
62 |
+and chromosome, as well as pre-computed parameters estimated from |
|
63 |
+HapMap used during the preprocessing and genotyping steps. For |
|
64 |
+Affymetrix, the 5.0 and 6.0 platforms are supported and the |
|
65 |
+corresponding annotation packages are \Rpackage{genomewidesnp5Crlmm} |
|
66 |
+and \Rpackage{genomewidesnp6Crlmm}. Supported Infinium platforms are |
|
67 |
+listed in the following code chunk. |
|
68 |
+ |
|
69 |
+<<supportedPlatforms>>= |
|
70 |
+pkgs <- annotationPackages() |
|
71 |
+crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)] |
|
72 |
+crlmm.pkgs[grep("human", crlmm.pkgs)] |
|
73 |
+@ |
|
74 |
+ |
|
75 |
+\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint |
|
76 |
+for copy number estimation, we require the \Rpackage{ff}. The |
|
77 |
+\Rpackage{ff} package provides infrastructure for accessing and |
|
78 |
+writing data to disk instead of keeping data in memory. As the |
|
79 |
+functions for preprocessing, genotyping, and copy number estimation do |
|
80 |
+not simultaneously require all samples and all probes in memory, |
|
81 |
+memory-usage by \crlmm{} can be fine-tuned by reading in and |
|
82 |
+processing subsets of the markers and/or samples. The functions |
|
83 |
+\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the |
|
84 |
+\Rpackage{oligoClasses} package can be used to declare how many |
|
85 |
+markers and samples to read at once. In general, specifying smaller |
|
86 |
+values should reduce the RAM required for a particular job. In |
|
87 |
+general, smaller values will increase the run-time. In the following |
|
88 |
+code-chunk, we declare that \crlmm{} should process 150,000 markers at |
|
89 |
+a time (when possible) and 500 samples at a time. If our dataset |
|
90 |
+contained fewer than 500 samples, the \Rfunction{ocSamples} option |
|
91 |
+would not have any effect. One can view the current settings for |
|
92 |
+these commands, by typing the functions without an argument. |
|
93 |
+ |
|
94 |
+<<ram>>= |
|
95 |
+ocProbesets(50e3) |
|
96 |
+ocSamples(200) |
|
97 |
+@ |
|
98 |
+ |
|
99 |
+\section{The \Rclass{CNSet} container} |
|
100 |
+ |
|
101 |
+The \Rfunction{show} method provides a concise summary of the |
|
102 |
+\Robject{cnSet} object. |
|
103 |
+ |
|
104 |
+<<show>>= |
|
105 |
+cnSet |
|
106 |
+@ |
|
107 |
+ |
|
108 |
+We briefly outline some of the unique aspects of the |
|
109 |
+\Rclass{CNSet}-class using in \crlmm{} that may differ from the more |
|
110 |
+standard extensions of the virtual class \Rclass{eSet} defined in the |
|
111 |
+\Rpackage{Biobase} package. |
|
112 |
+ |
|
113 |
+\paragraph{Instantiating a \Rclass{CNSet}:} An object of class |
|
114 |
+\Rclass{CNSet} can be instantiated by one of two methods: |
|
115 |
+\begin{enumerate} |
|
116 |
+\item during the preprocessing of the raw intensities for Illumina and |
|
117 |
+ Affymetrix arrays by the the functions \Rfunction{constructInf} and |
|
118 |
+ \Rfunction{genotype}, respectively. |
|
119 |
+ |
|
120 |
+\item by subsetting an existing \Robject{CNSet} object. As per usual, |
|
121 |
+ the `[' method can be used to extract a subset of markers $i$ as in |
|
122 |
+ `[i, ]', a subset of samples $j$ as in `[, j]', or a subset of |
|
123 |
+ markers $i$ and samples $j$ as in `[i, j]'. |
|
124 |
+ |
|
125 |
+\end{enumerate} |
|
126 |
+ |
|
127 |
+There are important differences in the underlying data representation |
|
128 |
+depending on how the object was instantiated. In particular, objects |
|
129 |
+generated by the functions \Rfunction{constructInf} and |
|
130 |
+\Rfunction{genotype} store high-dimensional data on disk rather than |
|
131 |
+in memory through protocols defined in the \R{} package \ff{}. For |
|
132 |
+instance, the normalized intensities and genotype calls in a |
|
133 |
+\Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects. |
|