git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54171 bc3139a8-67e5-0310-9ffc-ced21a209358
1 | 1 |
deleted file mode 100755 |
... | ... |
@@ -1,287 +0,0 @@ |
1 |
-%\VignetteIndexEntry{crlmm copy number Vignette for Illumina} |
|
2 |
-%\VignetteDepends{crlmm} |
|
3 |
-%\VignetteKeywords{crlmm, illumina} |
|
4 |
-%\VignettePackage{crlmm} |
|
5 |
-\documentclass{article} |
|
6 |
-\usepackage{graphicx} |
|
7 |
-\usepackage{natbib} |
|
8 |
-\usepackage[margin=1in]{geometry} |
|
9 |
-\newcommand{\crlmm}{\Rpackage{crlmm}} |
|
10 |
-\newcommand{\Rfunction}[1]{{\texttt{#1}}} |
|
11 |
-\newcommand{\Rmethod}[1]{{\texttt{#1}}} |
|
12 |
-\newcommand{\Rcode}[1]{{\texttt{#1}}} |
|
13 |
-\newcommand{\Robject}[1]{{\texttt{#1}}} |
|
14 |
-\newcommand{\Rpackage}[1]{{\textsf{#1}}} |
|
15 |
-\newcommand{\Rclass}[1]{{\textit{#1}}} |
|
16 |
-\newcommand{\oligo}{\Rpackage{oligo }} |
|
17 |
-\newcommand{\R}{\textsf{R}} |
|
18 |
-\newcommand{\ff}{\Rpackage{ff}} |
|
19 |
- |
|
20 |
-\begin{document} |
|
21 |
-\title{Using \crlmm{} for copy number estimation and genotype calling |
|
22 |
- with Illumina platforms} |
|
23 |
- |
|
24 |
-\date{\today} |
|
25 |
- |
|
26 |
-\author{Rob Scharpf} |
|
27 |
-\maketitle |
|
28 |
- |
|
29 |
- |
|
30 |
-\begin{abstract} |
|
31 |
- This vignette illustrates the steps required prior to copy number |
|
32 |
- analysis for Infinium platforms. Specifically, we require |
|
33 |
- construction of a container to store processed forms of the raw |
|
34 |
- data, preprocessing to normalize the arrays, and genotyping using |
|
35 |
- the CRLMM algorithm. After completing these steps, users can refer |
|
36 |
- to the copy number vignette. |
|
37 |
-\end{abstract} |
|
38 |
- |
|
39 |
-\section{Setup} |
|
40 |
- |
|
41 |
-<<crlmm, results=hide, echo=FALSE>>= |
|
42 |
-library(crlmm) |
|
43 |
-options(width=70) |
|
44 |
-options(continue=" ") |
|
45 |
-@ |
|
46 |
- |
|
47 |
-%\textbf{Supported platforms:} The supported Infinium platforms are |
|
48 |
-%those for which a corresponding annotation package is available. The |
|
49 |
-%annotation packages contain information on the markers, such as |
|
50 |
-%physical position and chromosome, as well as pre-computed parameters |
|
51 |
-%estimated from HapMap used during the preprocessing and genotyping |
|
52 |
-%steps. Currently supported Infinium platforms are listed in the |
|
53 |
-%following code chunk. |
|
54 |
-The following codechunk declares a directory for saving \Robject{ff} |
|
55 |
-files that will contain the normalized intensities and the genotype |
|
56 |
-calls. |
|
57 |
- |
|
58 |
-<<ldpath,results=hide>>= |
|
59 |
-library(ff) |
|
60 |
-if(getRversion() < "2.13.0"){ |
|
61 |
- rpath <- getRversion() |
|
62 |
-} else rpath <- "trunk" |
|
63 |
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/illumina_vignette", sep="") |
|
64 |
-ldPath(outdir) |
|
65 |
-dir.create(outdir, recursive=TRUE, showWarnings=FALSE) |
|
66 |
-@ |
|
67 |
- |
|
68 |
-We will also store cached computations in the directory \verb+outdir+. |
|
69 |
- |
|
70 |
-<<cacheSweave, echo=FALSE, results=hide>>= |
|
71 |
-library(cacheSweave) |
|
72 |
-setCacheDir(outdir) |
|
73 |
-@ |
|
74 |
- |
|
75 |
-We declare that \crlmm{} should process 150,000 markers at a time |
|
76 |
-(when possible) and/or 500 samples at a time. As our example dataset |
|
77 |
-in this vignette contains fewer than 500 samples, all samples will be |
|
78 |
-processed simultaneously. |
|
79 |
- |
|
80 |
-<<ram>>= |
|
81 |
-ocProbesets(150e3) |
|
82 |
-ocSamples(500) |
|
83 |
-@ |
|
84 |
- |
|
85 |
-\textbf{Limitations:} There is no minimum number of samples required |
|
86 |
-for preprocessing and genotyping. However, for copy number estimation |
|
87 |
-the \crlmm{} package currently requires at least 10 samples per batch. |
|
88 |
-The parameter estimates for copy number and the corresponding |
|
89 |
-estimates of raw copy number will tend to be more noisy for batches |
|
90 |
-with small sample sizes (e.g., $<$ 50). Chemistry plate or scan date |
|
91 |
-are often useful surrogates for batch. |
|
92 |
- |
|
93 |
-\section{Initializing a container for storing processed data} |
|
94 |
- |
|
95 |
-This section will initialize a container for storing processed forms |
|
96 |
-of the data, including the normalized intensities for the A and B |
|
97 |
-alleles and the CRLMM genotype calls and confidence scores. In |
|
98 |
-addition, the container will store information on the markers |
|
99 |
-(physical position, chromosome, and a SNP indicator), the batch, and |
|
100 |
-the samples (e.g., gender). To construct this container for Infinium |
|
101 |
-platforms, several steps are required. |
|
102 |
- |
|
103 |
-We begin by specifying the path containing the raw IDAT files for a |
|
104 |
-set of samples from the Infinium 370k platform. |
|
105 |
- |
|
106 |
-<<datadir>>= |
|
107 |
-datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k" |
|
108 |
-@ |
|
109 |
- |
|
110 |
-For Infinium platforms, an Illumina sample sheet containing |
|
111 |
-information for reading the raw IDAT files is required. Please refer |
|
112 |
-to the BeadStudio Genotyping guide, Appendix A, for additional |
|
113 |
-information. The following code reads in the samplesheet for the IDAT |
|
114 |
-files on our local server. For our dataset, the file extensions are |
|
115 |
-`Grn.dat' and `Red.idat'. We store the complete path to the filename |
|
116 |
-without the file extension in the \Robject{arrayNames} and check that |
|
117 |
-all of the green and red IDAT files exists. |
|
118 |
- |
|
119 |
-<<samplesheet>>= |
|
120 |
-samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE) |
|
121 |
-samplesheet <- samplesheet[-c(28:46,61:75,78:79), ] |
|
122 |
-arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"])) |
|
123 |
-all(file.exists(paste(arrayNames, "_Grn.idat", sep=""))) |
|
124 |
-all(file.exists(paste(arrayNames, "_Red.idat", sep=""))) |
|
125 |
-arrayInfo <- list(barcode=NULL, position="SentrixPosition") |
|
126 |
-@ |
|
127 |
- |
|
128 |
-As discussed previously, all supported platforms have a corresponding |
|
129 |
-annotation package. The appropriate annotation package is specified |
|
130 |
-by the platform identifier without the \verb+Crlmm+ postfix. |
|
131 |
- |
|
132 |
-<<cdfname>>= |
|
133 |
-cdfName <- "human370v1c" |
|
134 |
-@ |
|
135 |
- |
|
136 |
-Next, we construct a character vector that specifies the batch for |
|
137 |
-each of the \Sexpr{length(arrayNames)} arrays. Here, we have a small |
|
138 |
-dataset and process the samples in a single batch. Processing the |
|
139 |
-samples as a single batch is generally reasonable if the samples were |
|
140 |
-processed at similar times (e.g., within a few weeks). |
|
141 |
- |
|
142 |
-<<batch>>= |
|
143 |
-batch <- rep("1", nrow(samplesheet)) |
|
144 |
-@ |
|
145 |
- |
|
146 |
-Finally, we initialize an object of class \Robject{CNSet} using the |
|
147 |
-function \Rfunction{constructInf}. |
|
148 |
- |
|
149 |
-<<container,cache=TRUE>>= |
|
150 |
-cnSet <- constructInf(sampleSheet=samplesheet, |
|
151 |
- arrayNames=arrayNames, |
|
152 |
- batch=batch, |
|
153 |
- arrayInfoColNames=arrayInfo, |
|
154 |
- cdfName=cdfName, |
|
155 |
- verbose=TRUE, |
|
156 |
- saveDate=TRUE) |
|
157 |
-@ |
|
158 |
- |
|
159 |
-A concise summary of the object's contents can be viewed with the |
|
160 |
-\Rfunction{print} function. |
|
161 |
- |
|
162 |
-<<cnset>>= |
|
163 |
-print(cnSet) |
|
164 |
-@ |
|
165 |
- |
|
166 |
-Note that the above object does not yet contain any processed data |
|
167 |
-(only \verb+NA+'s) and several \verb+.ff+ files now appear in the |
|
168 |
-\verb+outdir+. Again, these files should not be removed. |
|
169 |
- |
|
170 |
-<<listff>>= |
|
171 |
-list.files(outdir, pattern=".ff")[1:5] |
|
172 |
-@ |
|
173 |
- |
|
174 |
-Finally, note that the elements of the \verb+assayData+ slot are |
|
175 |
-\Robject{ff} objects and not matrices. For the most part, the |
|
176 |
-\emph{appearance} that the data is stored in memory is preserved. |
|
177 |
- |
|
178 |
- |
|
179 |
-<<assaydataelements>>= |
|
180 |
-sapply(assayData(cnSet), function(x) class(x)[1]) |
|
181 |
-@ |
|
182 |
- |
|
183 |
-\section{Preprocessing} |
|
184 |
- |
|
185 |
-The raw intensities from the Infinium IDAT files are read and |
|
186 |
-normalized using the function \Rfunction{preprocessInf}. The function |
|
187 |
-\Rfunction{preprocessInf} returns a \Robject{ff} object containing the |
|
188 |
-parameters for the mixture model used by the CRLMM genotyping |
|
189 |
-algorithm. |
|
190 |
- |
|
191 |
-<<preprocess,cache=TRUE>>= |
|
192 |
-mixtureParams <- preprocessInf(cnSet=cnSet, sampleSheet=samplesheet, arrayNames=arrayNames, arrayInfoColNames=arrayInfo) |
|
193 |
-invisible(open(mixtureParams)) |
|
194 |
-str(mixtureParams[]) |
|
195 |
-invisible(close(mixtureParams)) |
|
196 |
-@ |
|
197 |
- |
|
198 |
-Note that the normalized intensities for the A and B alleles are no |
|
199 |
-longer \verb+NA+s and can now be accessed and inspected using the |
|
200 |
-methods \Rfunction{A} and \Rfunction{B}, respectively. |
|
201 |
- |
|
202 |
-<<intensities>>= |
|
203 |
-invisible(open(A(cnSet))) |
|
204 |
-invisible(open(B(cnSet))) |
|
205 |
-as.matrix(A(cnSet)[1:5, 1:5]) |
|
206 |
-as.matrix(B(cnSet)[1:5, 1:5]) |
|
207 |
-invisible(close(A(cnSet))) |
|
208 |
-invisible(close(B(cnSet))) |
|
209 |
-@ |
|
210 |
- |
|
211 |
- |
|
212 |
-\section{Genotyping} |
|
213 |
- |
|
214 |
- |
|
215 |
-CRLMM genotype calls and confidence scores are estimated using the |
|
216 |
-function \Rfunction{genotypeInf}. |
|
217 |
- |
|
218 |
-<<genotype,cache=TRUE>>= |
|
219 |
-updated <- genotypeInf(cnSet, mixtureParams=mixtureParams) |
|
220 |
-print(updated) |
|
221 |
-@ |
|
222 |
- |
|
223 |
-The posterior probabilities for the |
|
224 |
-genotype calls in the \verb+callProbability+ element of the |
|
225 |
-\verb+assayData+ are stored as integers to reduce the file size on |
|
226 |
-disk. However, the scores can be easily transformed back to the |
|
227 |
-probability scale using the \Rfunction{i2p} function as illustrated in |
|
228 |
-the following code chunk. |
|
229 |
- |
|
230 |
-<<callprobs>>= |
|
231 |
-invisible(open(snpCallProbability(cnSet))) |
|
232 |
-callProbs <- as.matrix(snpCallProbability(cnSet)[1:5, 1:5]) |
|
233 |
-i2p(callProbs) |
|
234 |
-invisible(close(snpCallProbability(cnSet))) |
|
235 |
-@ |
|
236 |
- |
|
237 |
-%As described in the \texttt{copynumber} vignette, copy number |
|
238 |
-%estimation in \crlmm{} works best when there are a sufficient number |
|
239 |
-%of samples such that AA, AB, and BB genotypes are observed at most |
|
240 |
-%loci. For small studies (e.g., fewer than 50 samples), there will be a |
|
241 |
-%large number of SNPs that are monomorphic. For monomorphic SNPs, the |
|
242 |
-%estimation problem becomes more difficult and alternative strategies |
|
243 |
-%that estimate the relative total copy number may be preferable. In |
|
244 |
-%addition to installing \crlmm{}, one must also install the appropriate |
|
245 |
-%annotation package for the Illumina platform. In the following code, |
|
246 |
-%we list the platforms for which annotation packages are currently |
|
247 |
-%available. Next we create a directory where output files will be |
|
248 |
-%stored and indicate the directory that contains the IDAT files that |
|
249 |
-%will be used in our analysis. |
|
250 |
- |
|
251 |
-\textbf{Wrapper:} The construction of a \Rclass{CNSet} object, |
|
252 |
-preprocessing, and genotype calling are wrapped into a convenience |
|
253 |
-function called \Rfunction{genotype.Illumina}. |
|
254 |
- |
|
255 |
-<<genotype.Illumina,cache=TRUE,results=hide>>= |
|
256 |
-cnSet2 <- genotype.Illumina(sampleSheet=samplesheet, |
|
257 |
- arrayNames=arrayNames, |
|
258 |
- arrayInfoColNames=arrayInfo, |
|
259 |
- cdfName="human370v1c", |
|
260 |
- batch=batch) |
|
261 |
-@ |
|
262 |
- |
|
263 |
-<<checkIdenticalCalls>>= |
|
264 |
-invisible(open(calls(cnSet))) |
|
265 |
-invisible(open(calls(cnSet2))) |
|
266 |
-snp.index <- which(isSnp(cnSet)) |
|
267 |
-identical(calls(cnSet)[snp.index, 1:20], calls(cnSet2)[snp.index, 1:20]) |
|
268 |
-invisible(close(calls(cnSet))) |
|
269 |
-invisible(close(calls(cnSet2))) |
|
270 |
-@ |
|
271 |
- |
|
272 |
-To fully remove the data associated with the \Robject{cnSet2} object, |
|
273 |
-one should use the \Rfunction{delete} function in the \Rpackage{ff} |
|
274 |
-package followed by the \Rfunction{rm} function. The following code |
|
275 |
-is not evaluated is it would change the results of the cached |
|
276 |
-computations in the previous code chunk. |
|
277 |
- |
|
278 |
-<<delete, eval=FALSE>>= |
|
279 |
-lapply(assayData(cnSet2), delete) |
|
280 |
-lapply(batchStatistics(cnSet2), delete) |
|
281 |
-delete(cnSet2$gender) |
|
282 |
-delete(cnSet2$SNR) |
|
283 |
-delete(cnSet2$SKW) |
|
284 |
-rm(cnSet2) |
|
285 |
-@ |
|
286 |
- |
|
287 |
-\end{document} |
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54170 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -36,7 +36,7 @@ |
36 | 36 |
to the copy number vignette. |
37 | 37 |
\end{abstract} |
38 | 38 |
|
39 |
-\section{Infrastructure} |
|
39 |
+\section{Setup} |
|
40 | 40 |
|
41 | 41 |
<<crlmm, results=hide, echo=FALSE>>= |
42 | 42 |
library(crlmm) |
... | ... |
@@ -44,29 +44,16 @@ options(width=70) |
44 | 44 |
options(continue=" ") |
45 | 45 |
@ |
46 | 46 |
|
47 |
-\textbf{Supported platforms:} The supported Infinium platforms are |
|
48 |
-those for which a corresponding annotation package is available. The |
|
49 |
-annotation packages contain information on the markers, such as |
|
50 |
-physical position and chromosome, as well as pre-computed parameters |
|
51 |
-estimated from HapMap used during the preprocessing and genotyping |
|
52 |
-steps. Currently supported Infinium platforms are listed in the |
|
53 |
-following code chunk. |
|
54 |
- |
|
55 |
-<<supportedPlatforms>>= |
|
56 |
-pkgs <- annotationPackages() |
|
57 |
-crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)] |
|
58 |
-crlmm.pkgs[grep("human", crlmm.pkgs)] |
|
59 |
-@ |
|
60 |
- |
|
61 |
-\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint, |
|
62 |
-we require the \Rpackage{ff} for copy number estimation. The |
|
63 |
-\Rpackage{ff} package provides infrastructure for accessing and |
|
64 |
-writing data to disk instead of keeping data in memory. Each element |
|
65 |
-of the \verb+assayData+ and \verb+batchStatistics+ slot of the |
|
66 |
-\Rclass{CNSet} class are \Robject{ff} objects. \Robject{ff} objects |
|
67 |
-in the \R{} workspace contain pointers to several files with the |
|
68 |
-\verb+.ff+ extension. The location of where the data is stored on |
|
69 |
-disk is specified by use of the \Rfunction{ldPath} function. |
|
47 |
+%\textbf{Supported platforms:} The supported Infinium platforms are |
|
48 |
+%those for which a corresponding annotation package is available. The |
|
49 |
+%annotation packages contain information on the markers, such as |
|
50 |
+%physical position and chromosome, as well as pre-computed parameters |
|
51 |
+%estimated from HapMap used during the preprocessing and genotyping |
|
52 |
+%steps. Currently supported Infinium platforms are listed in the |
|
53 |
+%following code chunk. |
|
54 |
+The following codechunk declares a directory for saving \Robject{ff} |
|
55 |
+files that will contain the normalized intensities and the genotype |
|
56 |
+calls. |
|
70 | 57 |
|
71 | 58 |
<<ldpath,results=hide>>= |
72 | 59 |
library(ff) |
... | ... |
@@ -78,41 +65,23 @@ ldPath(outdir) |
78 | 65 |
dir.create(outdir, recursive=TRUE, showWarnings=FALSE) |
79 | 66 |
@ |
80 | 67 |
|
81 |
-Users should not move or rename this directory. If only output files |
|
82 |
-are stored in \verb+outdir+, one can either remove the entire |
|
83 |
-directory prior to rerunning the analysis or all of the '.ff' files. |
|
84 |
-Otherwise, one would accumulate a large number of '.ff' files on disk |
|
85 |
-that are no longer in use. |
|
86 |
- |
|
87 |
-As none of the functions for preprocessing, genotyping, or copy number |
|
88 |
-estimation simultaneously require all samples and all probes in |
|
89 |
-memory, memory-usage by \crlmm{} can be fine tuned by reading in and |
|
90 |
-processing subsets of the markers and/or samples. The functions |
|
91 |
-\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the |
|
92 |
-\Rpackage{oligoClasses} package can be used to declare how many |
|
93 |
-markers and samples to read at once. In general, specifying smaller |
|
94 |
-values should reduce the RAM required for a particular job, but would |
|
95 |
-be expected to have an increased run-time. In the following |
|
96 |
-code-chunk, we declare that \crlmm{} should process 150,000 markers at |
|
97 |
-a time (when possible) and 500 samples at a time. (As our dataset in |
|
98 |
-this vignette only contains 43 samples, the \Rfunction{ocSamples} |
|
99 |
-option would not have any effect.) One can view the current settings |
|
100 |
-for these commands, by typing the functions without an argument. |
|
101 |
- |
|
102 |
-<<ram>>= |
|
103 |
-ocSamples() |
|
104 |
-ocProbesets() |
|
105 |
-ocProbesets(150e3) |
|
106 |
-ocSamples(500) |
|
107 |
-ocSamples() |
|
108 |
-ocProbesets() |
|
109 |
-@ |
|
68 |
+We will also store cached computations in the directory \verb+outdir+. |
|
110 | 69 |
|
111 | 70 |
<<cacheSweave, echo=FALSE, results=hide>>= |
112 | 71 |
library(cacheSweave) |
113 | 72 |
setCacheDir(outdir) |
114 | 73 |
@ |
115 | 74 |
|
75 |
+We declare that \crlmm{} should process 150,000 markers at a time |
|
76 |
+(when possible) and/or 500 samples at a time. As our example dataset |
|
77 |
+in this vignette contains fewer than 500 samples, all samples will be |
|
78 |
+processed simultaneously. |
|
79 |
+ |
|
80 |
+<<ram>>= |
|
81 |
+ocProbesets(150e3) |
|
82 |
+ocSamples(500) |
|
83 |
+@ |
|
84 |
+ |
|
116 | 85 |
\textbf{Limitations:} There is no minimum number of samples required |
117 | 86 |
for preprocessing and genotyping. However, for copy number estimation |
118 | 87 |
the \crlmm{} package currently requires at least 10 samples per batch. |
... | ... |
@@ -134,7 +103,7 @@ platforms, several steps are required. |
134 | 103 |
We begin by specifying the path containing the raw IDAT files for a |
135 | 104 |
set of samples from the Infinium 370k platform. |
136 | 105 |
|
137 |
-<libraries>>= |
|
106 |
+<<datadir>>= |
|
138 | 107 |
datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k" |
139 | 108 |
@ |
140 | 109 |
|
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54164 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -15,6 +15,7 @@ |
15 | 15 |
\newcommand{\Rclass}[1]{{\textit{#1}}} |
16 | 16 |
\newcommand{\oligo}{\Rpackage{oligo }} |
17 | 17 |
\newcommand{\R}{\textsf{R}} |
18 |
+\newcommand{\ff}{\Rpackage{ff}} |
|
18 | 19 |
|
19 | 20 |
\begin{document} |
20 | 21 |
\title{Using \crlmm{} for copy number estimation and genotype calling |
... | ... |
@@ -27,619 +28,291 @@ |
27 | 28 |
|
28 | 29 |
|
29 | 30 |
\begin{abstract} |
30 |
- This vignette illustrates the steps necessary for obtaining |
|
31 |
- marker-level estimates of allele-specific copy number from the raw |
|
32 |
- Illumina IDAT files. Hidden markov models or segmentation |
|
33 |
- algorithms can be applied to the marker-level estimates. Examples |
|
34 |
- of both are illustrated. |
|
31 |
+ This vignette illustrates the steps required prior to copy number |
|
32 |
+ analysis for Infinium platforms. Specifically, we require |
|
33 |
+ construction of a container to store processed forms of the raw |
|
34 |
+ data, preprocessing to normalize the arrays, and genotyping using |
|
35 |
+ the CRLMM algorithm. After completing these steps, users can refer |
|
36 |
+ to the copy number vignette. |
|
35 | 37 |
\end{abstract} |
36 | 38 |
|
37 |
-\section{About this vignette} |
|
39 |
+\section{Infrastructure} |
|
40 |
+ |
|
38 | 41 |
<<crlmm, results=hide, echo=FALSE>>= |
39 | 42 |
library(crlmm) |
43 |
+options(width=70) |
|
44 |
+options(continue=" ") |
|
40 | 45 |
@ |
41 | 46 |
|
42 |
-%The vignette for copy number estimation for illumina platforms is |
|
43 |
-%under development. Please use the latest stable release of crlmm. |
|
44 |
- |
|
45 |
-%\end{document} |
|
47 |
+\textbf{Supported platforms:} The supported Infinium platforms are |
|
48 |
+those for which a corresponding annotation package is available. The |
|
49 |
+annotation packages contain information on the markers, such as |
|
50 |
+physical position and chromosome, as well as pre-computed parameters |
|
51 |
+estimated from HapMap used during the preprocessing and genotyping |
|
52 |
+steps. Currently supported Infinium platforms are listed in the |
|
53 |
+following code chunk. |
|
46 | 54 |
|
47 |
-Allele-specific copy number estimation in the crlmm package is |
|
48 |
-available for several Illumina platforms. As described in the |
|
49 |
-\texttt{copynumber} vignette, copy number estimation in \crlmm{} works |
|
50 |
-best when there are a sufficient number of samples such that AA, AB, |
|
51 |
-and BB genotypes are observed at most loci. For small studies (e.g., |
|
52 |
-fewer than 50 samples), there will be a large number of SNPs that are |
|
53 |
-monomorphic. For monomorphic SNPs, the estimation problem becomes more |
|
54 |
-difficult and alternative strategies that estimate the relative total |
|
55 |
-copy number may be preferable. In addition to installing \crlmm{}, |
|
56 |
-one must also install the appropriate annotation package for the |
|
57 |
-Illumina platform. In the following code, we list the platforms for |
|
58 |
-which annotation packages are currently available. Next we create a |
|
59 |
-directory where output files will be stored and indicate the directory |
|
60 |
-that contains the IDAT files that will be used in our analysis. |
|
61 |
- |
|
62 |
- |
|
63 |
-<<setup, echo=FALSE, results=hide>>= |
|
64 |
-options(width=70) |
|
65 |
-options(continue=" ") |
|
66 |
-library(cacheSweave) |
|
55 |
+<<supportedPlatforms>>= |
|
56 |
+pkgs <- annotationPackages() |
|
57 |
+crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)] |
|
58 |
+crlmm.pkgs[grep("human", crlmm.pkgs)] |
|
67 | 59 |
@ |
68 | 60 |
|
69 |
-<<libraries>>= |
|
61 |
+\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint, |
|
62 |
+we require the \Rpackage{ff} for copy number estimation. The |
|
63 |
+\Rpackage{ff} package provides infrastructure for accessing and |
|
64 |
+writing data to disk instead of keeping data in memory. Each element |
|
65 |
+of the \verb+assayData+ and \verb+batchStatistics+ slot of the |
|
66 |
+\Rclass{CNSet} class are \Robject{ff} objects. \Robject{ff} objects |
|
67 |
+in the \R{} workspace contain pointers to several files with the |
|
68 |
+\verb+.ff+ extension. The location of where the data is stored on |
|
69 |
+disk is specified by use of the \Rfunction{ldPath} function. |
|
70 |
+ |
|
71 |
+<<ldpath,results=hide>>= |
|
70 | 72 |
library(ff) |
71 |
-pkgs <- annotationPackages() |
|
72 |
-pkgs[grep("Crlmm", pkgs)] |
|
73 | 73 |
if(getRversion() < "2.13.0"){ |
74 | 74 |
rpath <- getRversion() |
75 | 75 |
} else rpath <- "trunk" |
76 | 76 |
outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/illumina_vignette", sep="") |
77 |
+ldPath(outdir) |
|
77 | 78 |
dir.create(outdir, recursive=TRUE, showWarnings=FALSE) |
78 |
-datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k" |
|
79 |
-setCacheDir(outdir) |
|
80 | 79 |
@ |
81 | 80 |
|
82 |
-%Options for controlling RAM with the \Rpackage{ff} package. |
|
83 |
- |
|
84 |
-<<ldOptions>>= |
|
85 |
-ldPath(outdir) |
|
81 |
+Users should not move or rename this directory. If only output files |
|
82 |
+are stored in \verb+outdir+, one can either remove the entire |
|
83 |
+directory prior to rerunning the analysis or all of the '.ff' files. |
|
84 |
+Otherwise, one would accumulate a large number of '.ff' files on disk |
|
85 |
+that are no longer in use. |
|
86 |
+ |
|
87 |
+As none of the functions for preprocessing, genotyping, or copy number |
|
88 |
+estimation simultaneously require all samples and all probes in |
|
89 |
+memory, memory-usage by \crlmm{} can be fine tuned by reading in and |
|
90 |
+processing subsets of the markers and/or samples. The functions |
|
91 |
+\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the |
|
92 |
+\Rpackage{oligoClasses} package can be used to declare how many |
|
93 |
+markers and samples to read at once. In general, specifying smaller |
|
94 |
+values should reduce the RAM required for a particular job, but would |
|
95 |
+be expected to have an increased run-time. In the following |
|
96 |
+code-chunk, we declare that \crlmm{} should process 150,000 markers at |
|
97 |
+a time (when possible) and 500 samples at a time. (As our dataset in |
|
98 |
+this vignette only contains 43 samples, the \Rfunction{ocSamples} |
|
99 |
+option would not have any effect.) One can view the current settings |
|
100 |
+for these commands, by typing the functions without an argument. |
|
101 |
+ |
|
102 |
+<<ram>>= |
|
103 |
+ocSamples() |
|
104 |
+ocProbesets() |
|
86 | 105 |
ocProbesets(150e3) |
87 |
-ocSamples(200) |
|
106 |
+ocSamples(500) |
|
107 |
+ocSamples() |
|
108 |
+ocProbesets() |
|
88 | 109 |
@ |
89 | 110 |
|
90 |
-This vignette was created using Illumina IDAT files that are located |
|
91 |
-in a specific directory on my computer (\Robject{pathToCels}). Long |
|
92 |
-computations are saved in the output directory \Robject{outdir}. |
|
93 |
-Towards this end, we make repeated use of the \Rfunction{checkExists} |
|
94 |
-simply checks that a variable is in the workspace. If not, the |
|
95 |
-variable is loaded from the indicated path. If the file (with '.rda' |
|
96 |
-extension) does not exist in this path, the function indicated by the |
|
97 |
-.FUN argument is executed. Alternatively, if \Robject{.load.it} is |
|
98 |
-\Robject{FALSE} the function will be executed regardless of whether |
|
99 |
-the file exists. Users should see the \Rpackage{weaver} package for a |
|
100 |
-more 'correct' approach for cacheing long computations. The following |
|
101 |
-code chunk checks that that the variables specific to the directory |
|
102 |
-structure on my computer exist. Users should modify the |
|
103 |
-\Robject{outdir} and \Robject{datadir} variables as appropriate. |
|
104 |
- |
|
105 |
-<<checkSetup>>= |
|
106 |
-if(!file.exists(outdir)) stop("Please specify valid directory for storing output") |
|
107 |
-if(!file.exists(datadir)) stop("Please specify the correct path to the CEL files") |
|
111 |
+<<cacheSweave, echo=FALSE, results=hide>>= |
|
112 |
+library(cacheSweave) |
|
113 |
+setCacheDir(outdir) |
|
108 | 114 |
@ |
109 | 115 |
|
110 |
-\section{Preprocessing Illumina IDAT files, genotyping, and copy number |
|
111 |
- estimation} |
|
112 |
-To perform copy number analysis on the Illumina platform, several |
|
113 |
-steps are required. The first step is to read in the IDAT files and |
|
114 |
-create a container for storing the red and green intensities. These |
|
115 |
-intensities are quantile normalized in the function |
|
116 |
-\Rfunction{crlmmIllumina}, and then genotyped using the crlmm |
|
117 |
-algorithm. Details on the crlmm genotyping algorithm are described |
|
118 |
-elsewhere. We will make use of the normalized intensities when we |
|
119 |
-estimate copy number. The object returned by the |
|
120 |
-\Rfunction{genotype.Illumina} function is an instance of the |
|
121 |
-\Robject{CNSet} class, a container for storing the genotype calls, |
|
122 |
-genotype confidence scores, the normalized intensities for the A and B |
|
123 |
-channels, and summary statistics on the batches. The batch variable |
|
124 |
-can be the date on which the array was scanned or the 96 well |
|
125 |
-microarray chemistry plate on which the samples were processed. Both |
|
126 |
-date and chemistry plate are useful surrogates for experimental |
|
127 |
-factors that effect subgroups of the probe intensities. (Quantile |
|
128 |
-normalization does not remove batch effects.) The genotype confidence |
|
129 |
-scores are saved as an integer, and can be converted back to a $[0, |
|
130 |
-1]$ probability scale by the transformation $round(-1000*log2(1-p))$. |
|
116 |
+\textbf{Limitations:} There is no minimum number of samples required |
|
117 |
+for preprocessing and genotyping. However, for copy number estimation |
|
118 |
+the \crlmm{} package currently requires at least 10 samples per batch. |
|
119 |
+The parameter estimates for copy number and the corresponding |
|
120 |
+estimates of raw copy number will tend to be more noisy for batches |
|
121 |
+with small sample sizes (e.g., $<$ 50). Chemistry plate or scan date |
|
122 |
+are often useful surrogates for batch. |
|
123 |
+ |
|
124 |
+\section{Initializing a container for storing processed data} |
|
125 |
+ |
|
126 |
+This section will initialize a container for storing processed forms |
|
127 |
+of the data, including the normalized intensities for the A and B |
|
128 |
+alleles and the CRLMM genotype calls and confidence scores. In |
|
129 |
+addition, the container will store information on the markers |
|
130 |
+(physical position, chromosome, and a SNP indicator), the batch, and |
|
131 |
+the samples (e.g., gender). To construct this container for Infinium |
|
132 |
+platforms, several steps are required. |
|
133 |
+ |
|
134 |
+We begin by specifying the path containing the raw IDAT files for a |
|
135 |
+set of samples from the Infinium 370k platform. |
|
136 |
+ |
|
137 |
+<libraries>>= |
|
138 |
+datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k" |
|
139 |
+@ |
|
140 |
+ |
|
141 |
+For Infinium platforms, an Illumina sample sheet containing |
|
142 |
+information for reading the raw IDAT files is required. Please refer |
|
143 |
+to the BeadStudio Genotyping guide, Appendix A, for additional |
|
144 |
+information. The following code reads in the samplesheet for the IDAT |
|
145 |
+files on our local server. For our dataset, the file extensions are |
|
146 |
+`Grn.dat' and `Red.idat'. We store the complete path to the filename |
|
147 |
+without the file extension in the \Robject{arrayNames} and check that |
|
148 |
+all of the green and red IDAT files exists. |
|
131 | 149 |
|
132 | 150 |
<<samplesheet>>= |
133 | 151 |
samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE) |
134 | 152 |
samplesheet <- samplesheet[-c(28:46,61:75,78:79), ] |
135 | 153 |
arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"])) |
136 |
-grnfiles = all(file.exists(paste(arrayNames, "_Grn.idat", sep=""))) |
|
137 |
-redfiles = all(file.exists(paste(arrayNames, "_Red.idat", sep=""))) |
|
138 |
-load.it <- TRUE |
|
139 |
-if(!load.it){ |
|
140 |
- rmFiles <- list.files(outdir, pattern=".ff", full.names=TRUE) |
|
141 |
- unlink(rmFiles) |
|
142 |
-} |
|
143 |
-plate <- samplesheet$Plate |
|
144 |
-table(plate) |
|
145 |
-## too few samples to run the plates as separate batches |
|
146 |
-batch <- rep("1", nrow(samplesheet)) |
|
154 |
+all(file.exists(paste(arrayNames, "_Grn.idat", sep=""))) |
|
155 |
+all(file.exists(paste(arrayNames, "_Red.idat", sep=""))) |
|
156 |
+arrayInfo <- list(barcode=NULL, position="SentrixPosition") |
|
147 | 157 |
@ |
148 | 158 |
|
149 |
-<<genotype,cache=TRUE>>= |
|
150 |
-container <- genotype.Illumina(sampleSheet=samplesheet, |
|
151 |
- path=dirname(arrayNames[1]), |
|
152 |
- arrayInfoColNames=list(barcode=NULL, position="SentrixPosition"), |
|
153 |
- cdfName="human370v1c", |
|
154 |
- batch=batch) |
|
155 |
-@ |
|
159 |
+As discussed previously, all supported platforms have a corresponding |
|
160 |
+annotation package. The appropriate annotation package is specified |
|
161 |
+by the platform identifier without the \verb+Crlmm+ postfix. |
|
156 | 162 |
|
157 |
-<<copynumber,cache=TRUE>>= |
|
158 |
-GT.CONF.THR <- 0.8 |
|
159 |
-cnSet <- crlmmCopynumber(container, GT.CONF.THR=GT.CONF.THR) |
|
163 |
+<<cdfname>>= |
|
164 |
+cdfName <- "human370v1c" |
|
160 | 165 |
@ |
161 | 166 |
|
162 |
-The \Robject{cnSet} returned by the \Rfunction{crlmmCopynumber} |
|
163 |
-function contains all of the parameters used to compute |
|
164 |
-allele-specific copy number. In order to reduce I/O and speed |
|
165 |
-computation, the allele-specific copy number estimates are not written |
|
166 |
-to file nor saved in the \Robject{CNSet} object. Rather, |
|
167 |
-allele-specific estimates are computed on the fly by the functions |
|
168 |
-\Rfunction{CA}, \Rfunction{CB}, or \Rfunction{totalCopynumber}. The |
|
169 |
-following codechunks provide a few examples. |
|
170 |
- |
|
171 |
-Copy number for polymorphic markers on chromosomes 1 - 22. |
|
172 |
- |
|
173 |
-<<ca>>= |
|
174 |
-snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)) & chromosome(cnSet) < 22) |
|
175 |
-ca <- CA(cnSet, i=snp.index, j=1:5) |
|
176 |
-cb <- CB(cnSet, i=snp.index, j=1:5) |
|
177 |
-ct <- ca+cb |
|
178 |
-@ |
|
167 |
+Next, we construct a character vector that specifies the batch for |
|
168 |
+each of the \Sexpr{length(arrayNames)} arrays. Here, we have a small |
|
169 |
+dataset and process the samples in a single batch. Processing the |
|
170 |
+samples as a single batch is generally reasonable if the samples were |
|
171 |
+processed at similar times (e.g., within a few weeks). |
|
179 | 172 |
|
180 |
-Alternatively, total copy number can be obtained by |
|
181 |
-<<totalCopynumber.snps>>= |
|
182 |
-ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5) |
|
183 |
-stopifnot(all.equal(ct, ct2)) |
|
173 |
+<<batch>>= |
|
174 |
+batch <- rep("1", nrow(samplesheet)) |
|
184 | 175 |
@ |
185 | 176 |
|
186 |
-Missing values can arise at polymorphic when the confidence score of |
|
187 |
-the genotype calls are below the threshold indicated by the threshold |
|
188 |
-\texttt{GT.CONF.THR} in \Robject{crlmmCopynumber}. See |
|
189 |
-\Robject{?crlmmCopynumber} for additional details. In the following |
|
190 |
-codechunk, we compute the number of samples that had confidence scores |
|
191 |
-below \Sexpr{GT.CONF.THR} at loci for which ${\hat CA}$ and ${\hat CB}$ |
|
192 |
-are missing. |
|
193 |
- |
|
194 |
-<<NAs.snps>>= |
|
195 |
-missing.copynumber <- which(rowSums(is.na(ct)) > 0) |
|
196 |
-invisible(open(snpCallProbability(cnSet))) |
|
197 |
-gt.confidence <- i2p(snpCallProbability(cnSet)[snp.index, ]) |
|
198 |
-n.below.threshold <- rowSums(gt.confidence < 0.95) |
|
199 |
-unique(n.below.threshold[missing.copynumber]) |
|
200 |
-@ |
|
201 |
-\noindent For loci with missing copy number, the confidence scores |
|
202 |
-were all below the threshold. |
|
203 |
- |
|
204 |
-At nonpolymorphic loci, either the \Rfunction{CA} or |
|
205 |
-\Rfunction{totalCopynumber} functions can be used to obtain estimates |
|
206 |
-of total copy number. |
|
207 |
- |
|
208 |
-<<nonpolymorphicAutosomes>>= |
|
209 |
-marker.index <- which(!isSnp(cnSet) & chromosome(cnSet) < 23) |
|
210 |
-ct <- CA(cnSet, i=marker.index, j=1:5) |
|
211 |
-## all zeros |
|
212 |
-stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0)) |
|
213 |
-ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5) |
|
214 |
-stopifnot(all.equal(ct, ct2)) |
|
177 |
+Finally, we initialize an object of class \Robject{CNSet} using the |
|
178 |
+function \Rfunction{constructInf}. |
|
179 |
+ |
|
180 |
+<<container,cache=TRUE>>= |
|
181 |
+cnSet <- constructInf(sampleSheet=samplesheet, |
|
182 |
+ arrayNames=arrayNames, |
|
183 |
+ batch=batch, |
|
184 |
+ arrayInfoColNames=arrayInfo, |
|
185 |
+ cdfName=cdfName, |
|
186 |
+ verbose=TRUE, |
|
187 |
+ saveDate=TRUE) |
|
215 | 188 |
@ |
216 | 189 |
|
190 |
+A concise summary of the object's contents can be viewed with the |
|
191 |
+\Rfunction{print} function. |
|
217 | 192 |
|
218 |
-\begin{figure}[t!] |
|
219 |
- Nonpolymorphic markers on chromosome X: |
|
220 |
- \centering |
|
221 |
-<<nonpolymorphicX, fig=TRUE, width=8, height=4>>= |
|
222 |
-npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet)) |
|
223 |
-M <- sample(which(cnSet$gender==1), 5) |
|
224 |
-F <- sample(which(cnSet$gender==2), 5) |
|
225 |
-cn.M <- CA(cnSet, i=npx.index, j=M) |
|
226 |
-cn.F <- CA(cnSet, i=npx.index, j=F) |
|
227 |
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", col="grey60", outline=FALSE) |
|
228 |
-@ |
|
229 |
-\caption{Copy number estimates for nonpolymorphic loci on chromosome |
|
230 |
- X (5 men, 5 women). crlmm assumes that the median copy number |
|
231 |
- across samples at a given marker on X is 1 for men and 2 for women.} |
|
232 |
-\end{figure} |
|
233 |
- |
|
234 |
- |
|
235 |
- |
|
236 |
- |
|
237 |
-\begin{figure}[t!] |
|
238 |
- \centering |
|
239 |
-<<polymorphicX, fig=TRUE, width=8, height=4>>= |
|
240 |
-## copy number estimates on X for SNPs are biased towards small values. |
|
241 |
-X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23) |
|
242 |
-ca.M <- CA(cnSet, i=X.markers, j=M) |
|
243 |
-cb.M <- CB(cnSet, i=X.markers, j=M) |
|
244 |
-ca.F <- CA(cnSet, i=X.markers, j=F) |
|
245 |
-cb.F <- CB(cnSet, i=X.markers, j=F) |
|
246 |
-cn.M <- ca.M+cb.M |
|
247 |
-cn.F <- ca.F+cb.F |
|
248 |
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", outline=FALSE, col="grey60") |
|
249 |
-cn2 <- totalCopynumber(cnSet, i=X.markers, j=c(M, F)) |
|
250 |
-stopifnot(all.equal(cbind(cn.M, cn.F), cn2)) |
|
251 |
-@ |
|
252 |
-\caption{Copy number estimates for polymorphic markers on chromosome |
|
253 |
- X. } |
|
254 |
-\end{figure} |
|
255 |
- |
|
256 |
-Often it is of interest to smooth the total copy number estimates (the |
|
257 |
-sum of the allele-specific copy number) as a function of the physical |
|
258 |
-position. The following code chunk can be used to create an instance |
|
259 |
-of \Robject{CopyNumberSet} that contains the CA + CB at polymorphic |
|
260 |
-markers and 'CA' at nonpolymorphic markers. |
|
261 |
- |
|
262 |
-The \Robject{cnSet} returned by the \Rfunction{crlmmCopynumber} |
|
263 |
-function contains all of the parameters used to compute |
|
264 |
-allele-specific copy number. In order to reduce I/O and speed |
|
265 |
-computation, the allele-specific copy number estimates are not written |
|
266 |
-to file nor saved in the \Robject{CNSet} object. Rather, the |
|
267 |
-estimates are computed on the fly when the user calls one of the |
|
268 |
-helper functions that uses this information. For instance, often it is |
|
269 |
-of interest to smooth the total copy number estimates (the sum of the |
|
270 |
-allele-specific copy number) as a function of the physical position. |
|
271 |
-The following code chunk can be used to create an instance of |
|
272 |
-\Robject{CopyNumberSet} that contains the CA + CB at polymorphic |
|
273 |
-markers and 'CA' at nonpolymorphic markers. |
|
274 |
- |
|
275 |
-Alternatively, total copy number can be obtained by |
|
276 |
-<<totalCopynumber.snps>>= |
|
277 |
-ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5) |
|
278 |
-stopifnot(all.equal(ct, ct2)) |
|
193 |
+<<cnset>>= |
|
194 |
+print(cnSet) |
|
279 | 195 |
@ |
280 | 196 |
|
281 |
-Missing values can arise at polymorphic when the confidence score of |
|
282 |
-the genotype calls are below the threshold indicated by the threshold |
|
283 |
-\texttt{GT.CONF.THR} in \Robject{crlmmCopynumber}. See |
|
284 |
-\Robject{?crlmmCopynumber} for additional details. In the following |
|
285 |
-codechunk, we compute the number of samples that had confidence scores |
|
286 |
-below \Sexpr{GT.CONF.THR} at loci for which ${\hat CA}$ and ${\hat CB}$ |
|
287 |
-are missing. |
|
197 |
+Note that the above object does not yet contain any processed data |
|
198 |
+(only \verb+NA+'s) and several \verb+.ff+ files now appear in the |
|
199 |
+\verb+outdir+. Again, these files should not be removed. |
|
288 | 200 |
|
289 |
-<<NAs.snps>>= |
|
290 |
-missing.copynumber <- which(rowSums(is.na(ct)) > 0) |
|
291 |
-invisible(open(snpCallProbability(cnSet))) |
|
292 |
-gt.confidence <- i2p(snpCallProbability(cnSet)[snp.index, ]) |
|
293 |
-n.below.threshold <- rowSums(gt.confidence < 0.95) |
|
294 |
-unique(n.below.threshold[missing.copynumber]) |
|
201 |
+<<listff>>= |
|
202 |
+list.files(outdir, pattern=".ff")[1:5] |
|
295 | 203 |
@ |
296 |
-\noindent For loci with missing copy number, the confidence scores |
|
297 |
-were all below the threshold. |
|
298 |
- |
|
299 |
-At nonpolymorphic loci, either the \Rfunction{CA} or |
|
300 |
-\Rfunction{totalCopynumber} functions can be used to obtain estimates |
|
301 |
-of total copy number. |
|
302 |
- |
|
303 |
-<<nonpolymorphicAutosomes>>= |
|
304 |
-marker.index <- which(!isSnp(cnSet) & chromosome(cnSet) < 23) |
|
305 |
-ct <- CA(cnSet, i=marker.index, j=1:5) |
|
306 |
-## all zeros |
|
307 |
-stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0)) |
|
308 |
-ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5) |
|
309 |
-stopifnot(all.equal(ct, ct2)) |
|
310 |
-@ |
|
311 |
- |
|
312 | 204 |
|
313 |
-\begin{figure}[t!] |
|
314 |
- Nonpolymorphic markers on chromosome X: |
|
315 |
- \centering |
|
316 |
-<<nonpolymorphicX, fig=TRUE, width=8, height=4>>= |
|
317 |
-npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet)) |
|
318 |
-M <- sample(which(cnSet$gender==1), 5) |
|
319 |
-F <- sample(which(cnSet$gender==2), 5) |
|
320 |
-cn.M <- CA(cnSet, i=npx.index, j=M) |
|
321 |
-cn.F <- CA(cnSet, i=npx.index, j=F) |
|
322 |
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", col="grey60", outline=FALSE) |
|
323 |
-@ |
|
324 |
-\caption{Copy number estimates for nonpolymorphic loci on chromosome |
|
325 |
- X (5 men, 5 women). crlmm assumes that the median copy number |
|
326 |
- across samples at a given marker on X is 1 for men and 2 for women.} |
|
327 |
-\end{figure} |
|
328 |
- |
|
329 |
- |
|
330 |
- |
|
331 |
- |
|
332 |
-\begin{figure}[t!] |
|
333 |
- \centering |
|
334 |
-<<polymorphicX, fig=TRUE, width=8, height=4>>= |
|
335 |
-## copy number estimates on X for SNPs are biased towards small values. |
|
336 |
-X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23) |
|
337 |
-ca.M <- CA(cnSet, i=X.markers, j=M) |
|
338 |
-cb.M <- CB(cnSet, i=X.markers, j=M) |
|
339 |
-ca.F <- CA(cnSet, i=X.markers, j=F) |
|
340 |
-cb.F <- CB(cnSet, i=X.markers, j=F) |
|
341 |
-cn.M <- ca.M+cb.M |
|
342 |
-cn.F <- ca.F+cb.F |
|
343 |
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", outline=FALSE, col="grey60") |
|
344 |
-cn2 <- totalCopynumber(cnSet, i=X.markers, j=c(M, F)) |
|
345 |
-stopifnot(all.equal(cbind(cn.M, cn.F), cn2)) |
|
346 |
-@ |
|
347 |
-\caption{Copy number estimates for polymorphic markers on chromosome |
|
348 |
- X. } |
|
349 |
-\end{figure} |
|
350 |
- |
|
351 |
-Often it is of interest to smooth the total copy number estimates (the |
|
352 |
-sum of the allele-specific copy number) as a function of the physical |
|
353 |
-position. The following code chunk can be used to create an instance |
|
354 |
-of \Robject{CopyNumberSet} that contains the CA + CB at polymorphic |
|
355 |
-markers and 'CA' at nonpolymorphic markers. |
|
356 |
- |
|
357 |
-<<copynumberObject>>= |
|
358 |
-marker.index <- which(chromosome(cnSet) <= 22)## & isSnp(cnSet)) |
|
359 |
-sample.index <- 1:5 ## first five samples |
|
360 |
-invisible(open(cnSet)) |
|
361 |
-copynumberSet <- as(cnSet[marker.index, 1:5], "CopyNumberSet") |
|
362 |
-invisible(close(cnSet)) |
|
363 |
-copynumberSet <- copynumberSet[order(chromosome(copynumberSet), position(copynumberSet)), ] |
|
364 |
-indices <- split(1:nrow(copynumberSet), chromosome(copynumberSet)) |
|
365 |
-dup.index <- unlist(sapply(indices, function(i, position) i[duplicated(position[i])], position=position(copynumberSet))) |
|
366 |
-if(length(dup.index) > 0) copynumberSet <- copynumberSet[-dup.index, ] |
|
367 |
-## exclude any rows that are all missing |
|
368 |
-missing.index <- which(rowSums(is.na(copyNumber(copynumberSet))) == ncol(copynumberSet)) |
|
369 |
-table(chromosome(copynumberSet)[missing.index]) |
|
370 |
-copynumberSet <- copynumberSet[-missing.index, ] |
|
371 |
-@ |
|
205 |
+Finally, note that the elements of the \verb+assayData+ slot are |
|
206 |
+\Robject{ff} objects and not matrices. For the most part, the |
|
207 |
+\emph{appearance} that the data is stored in memory is preserved. |
|
372 | 208 |
|
373 | 209 |
|
374 |
-We suggest plotting the total copy number as a function of the |
|
375 |
-physical position to evaluate whether a wave-correction is needed. |
|
376 |
- |
|
377 |
- |
|
378 |