Browse code

Add vignettes for copy number analysis. Add crlmmGT2.R file

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54171 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 30/03/2011 02:41:00
Showing 1 changed files
1 1
deleted file mode 100755
... ...
@@ -1,287 +0,0 @@
1
-%\VignetteIndexEntry{crlmm copy number Vignette for Illumina}
2
-%\VignetteDepends{crlmm}
3
-%\VignetteKeywords{crlmm, illumina}
4
-%\VignettePackage{crlmm}
5
-\documentclass{article}
6
-\usepackage{graphicx}
7
-\usepackage{natbib}
8
-\usepackage[margin=1in]{geometry}
9
-\newcommand{\crlmm}{\Rpackage{crlmm}}
10
-\newcommand{\Rfunction}[1]{{\texttt{#1}}}
11
-\newcommand{\Rmethod}[1]{{\texttt{#1}}}
12
-\newcommand{\Rcode}[1]{{\texttt{#1}}}
13
-\newcommand{\Robject}[1]{{\texttt{#1}}}
14
-\newcommand{\Rpackage}[1]{{\textsf{#1}}}
15
-\newcommand{\Rclass}[1]{{\textit{#1}}}
16
-\newcommand{\oligo}{\Rpackage{oligo }}
17
-\newcommand{\R}{\textsf{R}}
18
-\newcommand{\ff}{\Rpackage{ff}}
19
-
20
-\begin{document}
21
-\title{Using \crlmm{} for copy number estimation and genotype calling
22
-  with Illumina platforms}
23
-
24
-\date{\today}
25
-
26
-\author{Rob Scharpf}
27
-\maketitle
28
-
29
-
30
-\begin{abstract}
31
-  This vignette illustrates the steps required prior to copy number
32
-  analysis for Infinium platforms.  Specifically, we require
33
-  construction of a container to store processed forms of the raw
34
-  data, preprocessing to normalize the arrays, and genotyping using
35
-  the CRLMM algorithm.  After completing these steps, users can refer
36
-  to the copy number vignette.
37
-\end{abstract}
38
-
39
-\section{Setup}
40
-
41
-<<crlmm, results=hide, echo=FALSE>>=
42
-library(crlmm)
43
-options(width=70)
44
-options(continue=" ")
45
-@
46
-
47
-%\textbf{Supported platforms:} The supported Infinium platforms are
48
-%those for which a corresponding annotation package is available.  The
49
-%annotation packages contain information on the markers, such as
50
-%physical position and chromosome, as well as pre-computed parameters
51
-%estimated from HapMap used during the preprocessing and genotyping
52
-%steps. Currently supported Infinium platforms are listed in the
53
-%following code chunk.
54
-The following codechunk declares a directory for saving \Robject{ff}
55
-files that will contain the normalized intensities and the genotype
56
-calls.
57
-
58
-<<ldpath,results=hide>>=
59
-library(ff)
60
-if(getRversion() < "2.13.0"){
61
-	rpath <- getRversion()
62
-} else rpath <- "trunk"
63
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/illumina_vignette", sep="")
64
-ldPath(outdir)
65
-dir.create(outdir, recursive=TRUE, showWarnings=FALSE)
66
-@
67
-
68
-We will also store cached computations in the directory \verb+outdir+.
69
-
70
-<<cacheSweave, echo=FALSE, results=hide>>=
71
-library(cacheSweave)
72
-setCacheDir(outdir)
73
-@
74
-
75
-We declare that \crlmm{} should process 150,000 markers at a time
76
-(when possible) and/or 500 samples at a time.  As our example dataset
77
-in this vignette contains fewer than 500 samples, all samples will be
78
-processed simultaneously.
79
-
80
-<<ram>>=
81
-ocProbesets(150e3)
82
-ocSamples(500)
83
-@
84
-
85
-\textbf{Limitations:} There is no minimum number of samples required
86
-for preprocessing and genotyping.  However, for copy number estimation
87
-the \crlmm{} package currently requires at least 10 samples per batch.
88
-The parameter estimates for copy number and the corresponding
89
-estimates of raw copy number will tend to be more noisy for batches
90
-with small sample sizes (e.g., $<$ 50).  Chemistry plate or scan date
91
-are often useful surrogates for batch.
92
-
93
-\section{Initializing a container for storing processed data}
94
-
95
-This section will initialize a container for storing processed forms
96
-of the data, including the normalized intensities for the A and B
97
-alleles and the CRLMM genotype calls and confidence scores.  In
98
-addition, the container will store information on the markers
99
-(physical position, chromosome, and a SNP indicator), the batch, and
100
-the samples (e.g., gender).  To construct this container for Infinium
101
-platforms, several steps are required.
102
-
103
-We begin by specifying the path containing the raw IDAT files for a
104
-set of samples from the Infinium 370k platform.
105
-
106
-<<datadir>>=
107
-datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k"
108
-@
109
-
110
-For Infinium platforms, an Illumina sample sheet containing
111
-information for reading the raw IDAT files is required. Please refer
112
-to the BeadStudio Genotyping guide, Appendix A, for additional
113
-information.  The following code reads in the samplesheet for the IDAT
114
-files on our local server. For our dataset, the file extensions are
115
-`Grn.dat' and `Red.idat'.  We store the complete path to the filename
116
-without the file extension in the \Robject{arrayNames} and check that
117
-all of the green and red IDAT files exists.
118
-
119
-<<samplesheet>>=
120
-samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE)
121
-samplesheet <- samplesheet[-c(28:46,61:75,78:79), ]
122
-arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"]))
123
-all(file.exists(paste(arrayNames, "_Grn.idat", sep="")))
124
-all(file.exists(paste(arrayNames, "_Red.idat", sep="")))
125
-arrayInfo <- list(barcode=NULL, position="SentrixPosition")
126
-@
127
-
128
-As discussed previously, all supported platforms have a corresponding
129
-annotation package.  The appropriate annotation package is specified
130
-by the platform identifier without the \verb+Crlmm+ postfix.
131
-
132
-<<cdfname>>=
133
-cdfName <- "human370v1c"
134
-@
135
-
136
-Next, we construct a character vector that specifies the batch for
137
-each of the \Sexpr{length(arrayNames)} arrays.  Here, we have a small
138
-dataset and process the samples in a single batch. Processing the
139
-samples as a single batch is generally reasonable if the samples were
140
-processed at similar times (e.g., within a few weeks).
141
-
142
-<<batch>>=
143
-batch <- rep("1", nrow(samplesheet))
144
-@
145
-
146
-Finally, we initialize an object of class \Robject{CNSet} using the
147
-function \Rfunction{constructInf}.
148
-
149
-<<container,cache=TRUE>>=
150
-cnSet <- constructInf(sampleSheet=samplesheet,
151
-		      arrayNames=arrayNames,
152
-		      batch=batch,
153
-		      arrayInfoColNames=arrayInfo,
154
-		      cdfName=cdfName,
155
-		      verbose=TRUE,
156
-		      saveDate=TRUE)
157
-@
158
-
159
-A concise summary of the object's contents can be viewed with the
160
-\Rfunction{print} function.
161
-
162
-<<cnset>>=
163
-print(cnSet)
164
-@
165
-
166
-Note that the above object does not yet contain any processed data
167
-(only \verb+NA+'s) and several \verb+.ff+ files now appear in the
168
-\verb+outdir+.  Again, these files should not be removed.
169
-
170
-<<listff>>=
171
-list.files(outdir, pattern=".ff")[1:5]
172
-@
173
-
174
-Finally, note that the elements of the \verb+assayData+ slot are
175
-\Robject{ff} objects and not matrices. For the most part, the
176
-\emph{appearance} that the data is stored in memory is preserved.
177
-
178
-
179
-<<assaydataelements>>=
180
-sapply(assayData(cnSet), function(x) class(x)[1])
181
-@
182
-
183
-\section{Preprocessing}
184
-
185
-The raw intensities from the Infinium IDAT files are read and
186
-normalized using the function \Rfunction{preprocessInf}.  The function
187
-\Rfunction{preprocessInf} returns a \Robject{ff} object containing the
188
-parameters for the mixture model used by the CRLMM genotyping
189
-algorithm.
190
-
191
-<<preprocess,cache=TRUE>>=
192
-mixtureParams <- preprocessInf(cnSet=cnSet, sampleSheet=samplesheet, arrayNames=arrayNames, arrayInfoColNames=arrayInfo)
193
-invisible(open(mixtureParams))
194
-str(mixtureParams[])
195
-invisible(close(mixtureParams))
196
-@
197
-
198
-Note that the normalized intensities for the A and B alleles are no
199
-longer \verb+NA+s and can now be accessed and inspected using the
200
-methods \Rfunction{A} and \Rfunction{B}, respectively.
201
-
202
-<<intensities>>=
203
-invisible(open(A(cnSet)))
204
-invisible(open(B(cnSet)))
205
-as.matrix(A(cnSet)[1:5, 1:5])
206
-as.matrix(B(cnSet)[1:5, 1:5])
207
-invisible(close(A(cnSet)))
208
-invisible(close(B(cnSet)))
209
-@
210
-
211
-
212
-\section{Genotyping}
213
-
214
-
215
-CRLMM genotype calls and confidence scores are estimated using the
216
-function \Rfunction{genotypeInf}.
217
-
218
-<<genotype,cache=TRUE>>=
219
-updated <- genotypeInf(cnSet, mixtureParams=mixtureParams)
220
-print(updated)
221
-@
222
-
223
-The posterior probabilities for the
224
-genotype calls in the \verb+callProbability+ element of the
225
-\verb+assayData+ are stored as integers to reduce the file size on
226
-disk.  However, the scores can be easily transformed back to the
227
-probability scale using the \Rfunction{i2p} function as illustrated in
228
-the following code chunk.
229
-
230
-<<callprobs>>=
231
-invisible(open(snpCallProbability(cnSet)))
232
-callProbs <- as.matrix(snpCallProbability(cnSet)[1:5, 1:5])
233
-i2p(callProbs)
234
-invisible(close(snpCallProbability(cnSet)))
235
-@
236
-
237
-%As described in the \texttt{copynumber} vignette, copy number
238
-%estimation in \crlmm{} works best when there are a sufficient number
239
-%of samples such that AA, AB, and BB genotypes are observed at most
240
-%loci. For small studies (e.g., fewer than 50 samples), there will be a
241
-%large number of SNPs that are monomorphic. For monomorphic SNPs, the
242
-%estimation problem becomes more difficult and alternative strategies
243
-%that estimate the relative total copy number may be preferable.  In
244
-%addition to installing \crlmm{}, one must also install the appropriate
245
-%annotation package for the Illumina platform.  In the following code,
246
-%we list the platforms for which annotation packages are currently
247
-%available. Next we create a directory where output files will be
248
-%stored and indicate the directory that contains the IDAT files that
249
-%will be used in our analysis.
250
-
251
-\textbf{Wrapper:} The construction of a \Rclass{CNSet} object,
252
-preprocessing, and genotype calling are wrapped into a convenience
253
-function called \Rfunction{genotype.Illumina}.
254
-
255
-<<genotype.Illumina,cache=TRUE,results=hide>>=
256
-cnSet2 <- genotype.Illumina(sampleSheet=samplesheet,
257
-			    arrayNames=arrayNames,
258
-			    arrayInfoColNames=arrayInfo,
259
-			    cdfName="human370v1c",
260
-			    batch=batch)
261
-@
262
-
263
-<<checkIdenticalCalls>>=
264
-invisible(open(calls(cnSet)))
265
-invisible(open(calls(cnSet2)))
266
-snp.index <- which(isSnp(cnSet))
267
-identical(calls(cnSet)[snp.index, 1:20], calls(cnSet2)[snp.index, 1:20])
268
-invisible(close(calls(cnSet)))
269
-invisible(close(calls(cnSet2)))
270
-@
271
-
272
-To fully remove the data associated with the \Robject{cnSet2} object,
273
-one should use the \Rfunction{delete} function in the \Rpackage{ff}
274
-package followed by the \Rfunction{rm} function.  The following code
275
-is not evaluated is it would change the results of the cached
276
-computations in the previous code chunk.
277
-
278
-<<delete, eval=FALSE>>=
279
-lapply(assayData(cnSet2), delete)
280
-lapply(batchStatistics(cnSet2), delete)
281
-delete(cnSet2$gender)
282
-delete(cnSet2$SNR)
283
-delete(cnSet2$SKW)
284
-rm(cnSet2)
285
-@
286
-
287
-\end{document}
Browse code

make gender an ff object in genotype function for Affy

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54170 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 30/03/2011 02:40:42
Showing 1 changed files
... ...
@@ -36,7 +36,7 @@
36 36
   to the copy number vignette.
37 37
 \end{abstract}
38 38
 
39
-\section{Infrastructure}
39
+\section{Setup}
40 40
 
41 41
 <<crlmm, results=hide, echo=FALSE>>=
42 42
 library(crlmm)
... ...
@@ -44,29 +44,16 @@ options(width=70)
44 44
 options(continue=" ")
45 45
 @
46 46
 
47
-\textbf{Supported platforms:} The supported Infinium platforms are
48
-those for which a corresponding annotation package is available.  The
49
-annotation packages contain information on the markers, such as
50
-physical position and chromosome, as well as pre-computed parameters
51
-estimated from HapMap used during the preprocessing and genotyping
52
-steps. Currently supported Infinium platforms are listed in the
53
-following code chunk.
54
-
55
-<<supportedPlatforms>>=
56
-pkgs <- annotationPackages()
57
-crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)]
58
-crlmm.pkgs[grep("human", crlmm.pkgs)]
59
-@
60
-
61
-\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint,
62
-we require the \Rpackage{ff} for copy number estimation.  The
63
-\Rpackage{ff} package provides infrastructure for accessing and
64
-writing data to disk instead of keeping data in memory.  Each element
65
-of the \verb+assayData+ and \verb+batchStatistics+ slot of the
66
-\Rclass{CNSet} class are \Robject{ff} objects.  \Robject{ff} objects
67
-in the \R{} workspace contain pointers to several files with the
68
-\verb+.ff+ extension.  The location of where the data is stored on
69
-disk is specified by use of the \Rfunction{ldPath} function.
47
+%\textbf{Supported platforms:} The supported Infinium platforms are
48
+%those for which a corresponding annotation package is available.  The
49
+%annotation packages contain information on the markers, such as
50
+%physical position and chromosome, as well as pre-computed parameters
51
+%estimated from HapMap used during the preprocessing and genotyping
52
+%steps. Currently supported Infinium platforms are listed in the
53
+%following code chunk.
54
+The following codechunk declares a directory for saving \Robject{ff}
55
+files that will contain the normalized intensities and the genotype
56
+calls.
70 57
 
71 58
 <<ldpath,results=hide>>=
72 59
 library(ff)
... ...
@@ -78,41 +65,23 @@ ldPath(outdir)
78 65
 dir.create(outdir, recursive=TRUE, showWarnings=FALSE)
79 66
 @
80 67
 
81
-Users should not move or rename this directory.  If only output files
82
-are stored in \verb+outdir+, one can either remove the entire
83
-directory prior to rerunning the analysis or all of the '.ff' files.
84
-Otherwise, one would accumulate a large number of '.ff' files on disk
85
-that are no longer in use.
86
-
87
-As none of the functions for preprocessing, genotyping, or copy number
88
-estimation simultaneously require all samples and all probes in
89
-memory, memory-usage by \crlmm{} can be fine tuned by reading in and
90
-processing subsets of the markers and/or samples. The functions
91
-\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the
92
-\Rpackage{oligoClasses} package can be used to declare how many
93
-markers and samples to read at once.  In general, specifying smaller
94
-values should reduce the RAM required for a particular job, but would
95
-be expected to have an increased run-time. In the following
96
-code-chunk, we declare that \crlmm{} should process 150,000 markers at
97
-a time (when possible) and 500 samples at a time.  (As our dataset in
98
-this vignette only contains 43 samples, the \Rfunction{ocSamples}
99
-option would not have any effect.)  One can view the current settings
100
-for these commands, by typing the functions without an argument.
101
-
102
-<<ram>>=
103
-ocSamples()
104
-ocProbesets()
105
-ocProbesets(150e3)
106
-ocSamples(500)
107
-ocSamples()
108
-ocProbesets()
109
-@
68
+We will also store cached computations in the directory \verb+outdir+.
110 69
 
111 70
 <<cacheSweave, echo=FALSE, results=hide>>=
112 71
 library(cacheSweave)
113 72
 setCacheDir(outdir)
114 73
 @
115 74
 
75
+We declare that \crlmm{} should process 150,000 markers at a time
76
+(when possible) and/or 500 samples at a time.  As our example dataset
77
+in this vignette contains fewer than 500 samples, all samples will be
78
+processed simultaneously.
79
+
80
+<<ram>>=
81
+ocProbesets(150e3)
82
+ocSamples(500)
83
+@
84
+
116 85
 \textbf{Limitations:} There is no minimum number of samples required
117 86
 for preprocessing and genotyping.  However, for copy number estimation
118 87
 the \crlmm{} package currently requires at least 10 samples per batch.
... ...
@@ -134,7 +103,7 @@ platforms, several steps are required.
134 103
 We begin by specifying the path containing the raw IDAT files for a
135 104
 set of samples from the Infinium 370k platform.
136 105
 
137
-<libraries>>=
106
+<<datadir>>=
138 107
 datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k"
139 108
 @
140 109
 
Browse code

Rewrote illumina_copynumber vignette. Add functions and docmentation for constructInf, preprocessInf, and genotypeInf.

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54164 bc3139a8-67e5-0310-9ffc-ced21a209358

Rob Scharp authored on 30/03/2011 02:40:07
Showing 1 changed files
... ...
@@ -15,6 +15,7 @@
15 15
 \newcommand{\Rclass}[1]{{\textit{#1}}}
16 16
 \newcommand{\oligo}{\Rpackage{oligo }}
17 17
 \newcommand{\R}{\textsf{R}}
18
+\newcommand{\ff}{\Rpackage{ff}}
18 19
 
19 20
 \begin{document}
20 21
 \title{Using \crlmm{} for copy number estimation and genotype calling
... ...
@@ -27,619 +28,291 @@
27 28
 
28 29
 
29 30
 \begin{abstract}
30
-  This vignette illustrates the steps necessary for obtaining
31
-  marker-level estimates of allele-specific copy number from the raw
32
-  Illumina IDAT files.  Hidden markov models or segmentation
33
-  algorithms can be applied to the marker-level estimates.  Examples
34
-  of both are illustrated.
31
+  This vignette illustrates the steps required prior to copy number
32
+  analysis for Infinium platforms.  Specifically, we require
33
+  construction of a container to store processed forms of the raw
34
+  data, preprocessing to normalize the arrays, and genotyping using
35
+  the CRLMM algorithm.  After completing these steps, users can refer
36
+  to the copy number vignette.
35 37
 \end{abstract}
36 38
 
37
-\section{About this vignette}
39
+\section{Infrastructure}
40
+
38 41
 <<crlmm, results=hide, echo=FALSE>>=
39 42
 library(crlmm)
43
+options(width=70)
44
+options(continue=" ")
40 45
 @
41 46
 
42
-%The vignette for copy number estimation for illumina platforms is
43
-%under development.  Please use the latest stable release of crlmm.
44
-
45
-%\end{document}
47
+\textbf{Supported platforms:} The supported Infinium platforms are
48
+those for which a corresponding annotation package is available.  The
49
+annotation packages contain information on the markers, such as
50
+physical position and chromosome, as well as pre-computed parameters
51
+estimated from HapMap used during the preprocessing and genotyping
52
+steps. Currently supported Infinium platforms are listed in the
53
+following code chunk.
46 54
 
47
-Allele-specific copy number estimation in the crlmm package is
48
-available for several Illumina platforms.  As described in the
49
-\texttt{copynumber} vignette, copy number estimation in \crlmm{} works
50
-best when there are a sufficient number of samples such that AA, AB,
51
-and BB genotypes are observed at most loci. For small studies (e.g.,
52
-fewer than 50 samples), there will be a large number of SNPs that are
53
-monomorphic. For monomorphic SNPs, the estimation problem becomes more
54
-difficult and alternative strategies that estimate the relative total
55
-copy number may be preferable.  In addition to installing \crlmm{},
56
-one must also install the appropriate annotation package for the
57
-Illumina platform.  In the following code, we list the platforms for
58
-which annotation packages are currently available. Next we create a
59
-directory where output files will be stored and indicate the directory
60
-that contains the IDAT files that will be used in our analysis.
61
-
62
-
63
-<<setup, echo=FALSE, results=hide>>=
64
-options(width=70)
65
-options(continue=" ")
66
-library(cacheSweave)
55
+<<supportedPlatforms>>=
56
+pkgs <- annotationPackages()
57
+crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)]
58
+crlmm.pkgs[grep("human", crlmm.pkgs)]
67 59
 @
68 60
 
69
-<<libraries>>=
61
+\textbf{Large data:} In order to reduce \crlmm{}'s memory footprint,
62
+we require the \Rpackage{ff} for copy number estimation.  The
63
+\Rpackage{ff} package provides infrastructure for accessing and
64
+writing data to disk instead of keeping data in memory.  Each element
65
+of the \verb+assayData+ and \verb+batchStatistics+ slot of the
66
+\Rclass{CNSet} class are \Robject{ff} objects.  \Robject{ff} objects
67
+in the \R{} workspace contain pointers to several files with the
68
+\verb+.ff+ extension.  The location of where the data is stored on
69
+disk is specified by use of the \Rfunction{ldPath} function.
70
+
71
+<<ldpath,results=hide>>=
70 72
 library(ff)
71
-pkgs <- annotationPackages()
72
-pkgs[grep("Crlmm", pkgs)]
73 73
 if(getRversion() < "2.13.0"){
74 74
 	rpath <- getRversion()
75 75
 } else rpath <- "trunk"
76 76
 outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/illumina_vignette", sep="")
77
+ldPath(outdir)
77 78
 dir.create(outdir, recursive=TRUE, showWarnings=FALSE)
78
-datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k"
79
-setCacheDir(outdir)
80 79
 @
81 80
 
82
-%Options for controlling RAM with the \Rpackage{ff} package.
83
-
84
-<<ldOptions>>=
85
-ldPath(outdir)
81
+Users should not move or rename this directory.  If only output files
82
+are stored in \verb+outdir+, one can either remove the entire
83
+directory prior to rerunning the analysis or all of the '.ff' files.
84
+Otherwise, one would accumulate a large number of '.ff' files on disk
85
+that are no longer in use.
86
+
87
+As none of the functions for preprocessing, genotyping, or copy number
88
+estimation simultaneously require all samples and all probes in
89
+memory, memory-usage by \crlmm{} can be fine tuned by reading in and
90
+processing subsets of the markers and/or samples. The functions
91
+\Rfunction{ocSamples} and \Rfunction{ocProbesets} in the
92
+\Rpackage{oligoClasses} package can be used to declare how many
93
+markers and samples to read at once.  In general, specifying smaller
94
+values should reduce the RAM required for a particular job, but would
95
+be expected to have an increased run-time. In the following
96
+code-chunk, we declare that \crlmm{} should process 150,000 markers at
97
+a time (when possible) and 500 samples at a time.  (As our dataset in
98
+this vignette only contains 43 samples, the \Rfunction{ocSamples}
99
+option would not have any effect.)  One can view the current settings
100
+for these commands, by typing the functions without an argument.
101
+
102
+<<ram>>=
103
+ocSamples()
104
+ocProbesets()
86 105
 ocProbesets(150e3)
87
-ocSamples(200)
106
+ocSamples(500)
107
+ocSamples()
108
+ocProbesets()
88 109
 @
89 110
 
90
-This vignette was created using Illumina IDAT files that are located
91
-in a specific directory on my computer (\Robject{pathToCels}).  Long
92
-computations are saved in the output directory \Robject{outdir}.
93
-Towards this end, we make repeated use of the \Rfunction{checkExists}
94
-simply checks that a variable is in the workspace.  If not, the
95
-variable is loaded from the indicated path.  If the file (with '.rda'
96
-extension) does not exist in this path, the function indicated by the
97
-.FUN argument is executed.  Alternatively, if \Robject{.load.it} is
98
-\Robject{FALSE} the function will be executed regardless of whether
99
-the file exists.  Users should see the \Rpackage{weaver} package for a
100
-more 'correct' approach for cacheing long computations. The following
101
-code chunk checks that that the variables specific to the directory
102
-structure on my computer exist.  Users should modify the
103
-\Robject{outdir} and \Robject{datadir} variables as appropriate.
104
-
105
-<<checkSetup>>=
106
-if(!file.exists(outdir)) stop("Please specify valid directory for storing output")
107
-if(!file.exists(datadir)) stop("Please specify the correct path to the CEL files")
111
+<<cacheSweave, echo=FALSE, results=hide>>=
112
+library(cacheSweave)
113
+setCacheDir(outdir)
108 114
 @
109 115
 
110
-\section{Preprocessing Illumina IDAT files, genotyping, and copy number
111
-  estimation}
112
-To perform copy number analysis on the Illumina platform, several
113
-steps are required.  The first step is to read in the IDAT files and
114
-create a container for storing the red and green intensities. These
115
-intensities are quantile normalized in the function
116
-\Rfunction{crlmmIllumina}, and then genotyped using the crlmm
117
-algorithm.  Details on the crlmm genotyping algorithm are described
118
-elsewhere.  We will make use of the normalized intensities when we
119
-estimate copy number.  The object returned by the
120
-\Rfunction{genotype.Illumina} function is an instance of the
121
-\Robject{CNSet} class, a container for storing the genotype calls,
122
-genotype confidence scores, the normalized intensities for the A and B
123
-channels, and summary statistics on the batches.  The batch variable
124
-can be the date on which the array was scanned or the 96 well
125
-microarray chemistry plate on which the samples were processed. Both
126
-date and chemistry plate are useful surrogates for experimental
127
-factors that effect subgroups of the probe intensities.  (Quantile
128
-normalization does not remove batch effects.)  The genotype confidence
129
-scores are saved as an integer, and can be converted back to a $[0,
130
-1]$ probability scale by the transformation $round(-1000*log2(1-p))$.
116
+\textbf{Limitations:} There is no minimum number of samples required
117
+for preprocessing and genotyping.  However, for copy number estimation
118
+the \crlmm{} package currently requires at least 10 samples per batch.
119
+The parameter estimates for copy number and the corresponding
120
+estimates of raw copy number will tend to be more noisy for batches
121
+with small sample sizes (e.g., $<$ 50).  Chemistry plate or scan date
122
+are often useful surrogates for batch.
123
+
124
+\section{Initializing a container for storing processed data}
125
+
126
+This section will initialize a container for storing processed forms
127
+of the data, including the normalized intensities for the A and B
128
+alleles and the CRLMM genotype calls and confidence scores.  In
129
+addition, the container will store information on the markers
130
+(physical position, chromosome, and a SNP indicator), the batch, and
131
+the samples (e.g., gender).  To construct this container for Infinium
132
+platforms, several steps are required.
133
+
134
+We begin by specifying the path containing the raw IDAT files for a
135
+set of samples from the Infinium 370k platform.
136
+
137
+<libraries>>=
138
+datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k"
139
+@
140
+
141
+For Infinium platforms, an Illumina sample sheet containing
142
+information for reading the raw IDAT files is required. Please refer
143
+to the BeadStudio Genotyping guide, Appendix A, for additional
144
+information.  The following code reads in the samplesheet for the IDAT
145
+files on our local server. For our dataset, the file extensions are
146
+`Grn.dat' and `Red.idat'.  We store the complete path to the filename
147
+without the file extension in the \Robject{arrayNames} and check that
148
+all of the green and red IDAT files exists.
131 149
 
132 150
 <<samplesheet>>=
133 151
 samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE)
134 152
 samplesheet <- samplesheet[-c(28:46,61:75,78:79), ]
135 153
 arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"]))
136
-grnfiles = all(file.exists(paste(arrayNames, "_Grn.idat", sep="")))
137
-redfiles = all(file.exists(paste(arrayNames, "_Red.idat", sep="")))
138
-load.it <- TRUE
139
-if(!load.it){
140
-	rmFiles <- list.files(outdir, pattern=".ff", full.names=TRUE)
141
-	unlink(rmFiles)
142
-}
143
-plate <- samplesheet$Plate
144
-table(plate)
145
-## too few samples to run the plates as separate batches
146
-batch <- rep("1", nrow(samplesheet))
154
+all(file.exists(paste(arrayNames, "_Grn.idat", sep="")))
155
+all(file.exists(paste(arrayNames, "_Red.idat", sep="")))
156
+arrayInfo <- list(barcode=NULL, position="SentrixPosition")
147 157
 @
148 158
 
149
-<<genotype,cache=TRUE>>=
150
-container <- genotype.Illumina(sampleSheet=samplesheet,
151
-			       path=dirname(arrayNames[1]),
152
-			       arrayInfoColNames=list(barcode=NULL, position="SentrixPosition"),
153
-			       cdfName="human370v1c",
154
-			       batch=batch)
155
-@
159
+As discussed previously, all supported platforms have a corresponding
160
+annotation package.  The appropriate annotation package is specified
161
+by the platform identifier without the \verb+Crlmm+ postfix.
156 162
 
157
-<<copynumber,cache=TRUE>>=
158
-GT.CONF.THR <- 0.8
159
-cnSet <- crlmmCopynumber(container, GT.CONF.THR=GT.CONF.THR)
163
+<<cdfname>>=
164
+cdfName <- "human370v1c"
160 165
 @
161 166
 
162
-The \Robject{cnSet} returned by the \Rfunction{crlmmCopynumber}
163
-function contains all of the parameters used to compute
164
-allele-specific copy number.  In order to reduce I/O and speed
165
-computation, the allele-specific copy number estimates are not written
166
-to file nor saved in the \Robject{CNSet} object.  Rather,
167
-allele-specific estimates are computed on the fly by the functions
168
-\Rfunction{CA}, \Rfunction{CB}, or \Rfunction{totalCopynumber}.  The
169
-following codechunks provide a few examples.
170
-
171
-Copy number for polymorphic markers on chromosomes 1 - 22.
172
-
173
-<<ca>>=
174
-snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet)) & chromosome(cnSet) < 22)
175
-ca <- CA(cnSet, i=snp.index, j=1:5)
176
-cb <- CB(cnSet, i=snp.index, j=1:5)
177
-ct <- ca+cb
178
-@
167
+Next, we construct a character vector that specifies the batch for
168
+each of the \Sexpr{length(arrayNames)} arrays.  Here, we have a small
169
+dataset and process the samples in a single batch. Processing the
170
+samples as a single batch is generally reasonable if the samples were
171
+processed at similar times (e.g., within a few weeks).
179 172
 
180
-Alternatively, total copy number can be obtained by
181
-<<totalCopynumber.snps>>=
182
-ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5)
183
-stopifnot(all.equal(ct, ct2))
173
+<<batch>>=
174
+batch <- rep("1", nrow(samplesheet))
184 175
 @
185 176
 
186
-Missing values can arise at polymorphic when the confidence score of
187
-the genotype calls are below the threshold indicated by the threshold
188
-\texttt{GT.CONF.THR} in \Robject{crlmmCopynumber}. See
189
-\Robject{?crlmmCopynumber} for additional details.  In the following
190
-codechunk, we compute the number of samples that had confidence scores
191
-below \Sexpr{GT.CONF.THR} at loci for which ${\hat CA}$ and ${\hat CB}$
192
-are missing.
193
-
194
-<<NAs.snps>>=
195
-missing.copynumber <- which(rowSums(is.na(ct)) > 0)
196
-invisible(open(snpCallProbability(cnSet)))
197
-gt.confidence <- i2p(snpCallProbability(cnSet)[snp.index, ])
198
-n.below.threshold <- rowSums(gt.confidence < 0.95)
199
-unique(n.below.threshold[missing.copynumber])
200
-@
201
-\noindent For loci with missing copy number, the confidence scores
202
-were all below the threshold.
203
-
204
-At nonpolymorphic loci, either the \Rfunction{CA} or
205
-\Rfunction{totalCopynumber} functions can be used to obtain estimates
206
-of total copy number.
207
-
208
-<<nonpolymorphicAutosomes>>=
209
-marker.index <- which(!isSnp(cnSet) & chromosome(cnSet) < 23)
210
-ct <- CA(cnSet, i=marker.index, j=1:5)
211
-## all zeros
212
-stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0))
213
-ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5)
214
-stopifnot(all.equal(ct, ct2))
177
+Finally, we initialize an object of class \Robject{CNSet} using the
178
+function \Rfunction{constructInf}.
179
+
180
+<<container,cache=TRUE>>=
181
+cnSet <- constructInf(sampleSheet=samplesheet,
182
+		      arrayNames=arrayNames,
183
+		      batch=batch,
184
+		      arrayInfoColNames=arrayInfo,
185
+		      cdfName=cdfName,
186
+		      verbose=TRUE,
187
+		      saveDate=TRUE)
215 188
 @
216 189
 
190
+A concise summary of the object's contents can be viewed with the
191
+\Rfunction{print} function.
217 192
 
218
-\begin{figure}[t!]
219
-  Nonpolymorphic markers on chromosome X:
220
-  \centering
221
-<<nonpolymorphicX, fig=TRUE, width=8, height=4>>=
222
-npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet))
223
-M <- sample(which(cnSet$gender==1), 5)
224
-F <- sample(which(cnSet$gender==2), 5)
225
-cn.M <- CA(cnSet, i=npx.index, j=M)
226
-cn.F <- CA(cnSet, i=npx.index, j=F)
227
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", col="grey60", outline=FALSE)
228
-@
229
-\caption{Copy number estimates for nonpolymorphic loci on chromosome
230
-  X (5 men, 5 women).  crlmm assumes that the median copy number
231
-  across samples at a given marker on X is 1 for men and 2 for women.}
232
-\end{figure}
233
-
234
-
235
-
236
-
237
-\begin{figure}[t!]
238
- \centering
239
-<<polymorphicX, fig=TRUE, width=8, height=4>>=
240
-## copy number estimates on X for SNPs are biased towards small values.
241
-X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
242
-ca.M <- CA(cnSet, i=X.markers, j=M)
243
-cb.M <- CB(cnSet, i=X.markers, j=M)
244
-ca.F <- CA(cnSet, i=X.markers, j=F)
245
-cb.F <- CB(cnSet, i=X.markers, j=F)
246
-cn.M <- ca.M+cb.M
247
-cn.F <- ca.F+cb.F
248
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", outline=FALSE, col="grey60")
249
-cn2 <- totalCopynumber(cnSet, i=X.markers, j=c(M, F))
250
-stopifnot(all.equal(cbind(cn.M, cn.F), cn2))
251
-@
252
-\caption{Copy number estimates for polymorphic markers on chromosome
253
-  X. }
254
-\end{figure}
255
-
256
-Often it is of interest to smooth the total copy number estimates (the
257
-sum of the allele-specific copy number) as a function of the physical
258
-position.  The following code chunk can be used to create an instance
259
-of \Robject{CopyNumberSet} that contains the CA + CB at polymorphic
260
-markers and 'CA' at nonpolymorphic markers.
261
-
262
-The \Robject{cnSet} returned by the \Rfunction{crlmmCopynumber}
263
-function contains all of the parameters used to compute
264
-allele-specific copy number.  In order to reduce I/O and speed
265
-computation, the allele-specific copy number estimates are not written
266
-to file nor saved in the \Robject{CNSet} object.  Rather, the
267
-estimates are computed on the fly when the user calls one of the
268
-helper functions that uses this information. For instance, often it is
269
-of interest to smooth the total copy number estimates (the sum of the
270
-allele-specific copy number) as a function of the physical position.
271
-The following code chunk can be used to create an instance of
272
-\Robject{CopyNumberSet} that contains the CA + CB at polymorphic
273
-markers and 'CA' at nonpolymorphic markers.
274
-
275
-Alternatively, total copy number can be obtained by
276
-<<totalCopynumber.snps>>=
277
-ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5)
278
-stopifnot(all.equal(ct, ct2))
193
+<<cnset>>=
194
+print(cnSet)
279 195
 @
280 196
 
281
-Missing values can arise at polymorphic when the confidence score of
282
-the genotype calls are below the threshold indicated by the threshold
283
-\texttt{GT.CONF.THR} in \Robject{crlmmCopynumber}. See
284
-\Robject{?crlmmCopynumber} for additional details.  In the following
285
-codechunk, we compute the number of samples that had confidence scores
286
-below \Sexpr{GT.CONF.THR} at loci for which ${\hat CA}$ and ${\hat CB}$
287
-are missing.
197
+Note that the above object does not yet contain any processed data
198
+(only \verb+NA+'s) and several \verb+.ff+ files now appear in the
199
+\verb+outdir+.  Again, these files should not be removed.
288 200
 
289
-<<NAs.snps>>=
290
-missing.copynumber <- which(rowSums(is.na(ct)) > 0)
291
-invisible(open(snpCallProbability(cnSet)))
292
-gt.confidence <- i2p(snpCallProbability(cnSet)[snp.index, ])
293
-n.below.threshold <- rowSums(gt.confidence < 0.95)
294
-unique(n.below.threshold[missing.copynumber])
201
+<<listff>>=
202
+list.files(outdir, pattern=".ff")[1:5]
295 203
 @
296
-\noindent For loci with missing copy number, the confidence scores
297
-were all below the threshold.
298
-
299
-At nonpolymorphic loci, either the \Rfunction{CA} or
300
-\Rfunction{totalCopynumber} functions can be used to obtain estimates
301
-of total copy number.
302
-
303
-<<nonpolymorphicAutosomes>>=
304
-marker.index <- which(!isSnp(cnSet) & chromosome(cnSet) < 23)
305
-ct <- CA(cnSet, i=marker.index, j=1:5)
306
-## all zeros
307
-stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0))
308
-ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5)
309
-stopifnot(all.equal(ct, ct2))
310
-@
311
-
312 204
 
313
-\begin{figure}[t!]
314
-  Nonpolymorphic markers on chromosome X:
315
-  \centering
316
-<<nonpolymorphicX, fig=TRUE, width=8, height=4>>=
317
-npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet))
318
-M <- sample(which(cnSet$gender==1), 5)
319
-F <- sample(which(cnSet$gender==2), 5)
320
-cn.M <- CA(cnSet, i=npx.index, j=M)
321
-cn.F <- CA(cnSet, i=npx.index, j=F)
322
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", col="grey60", outline=FALSE)
323
-@
324
-\caption{Copy number estimates for nonpolymorphic loci on chromosome
325
-  X (5 men, 5 women).  crlmm assumes that the median copy number
326
-  across samples at a given marker on X is 1 for men and 2 for women.}
327
-\end{figure}
328
-
329
-
330
-
331
-
332
-\begin{figure}[t!]
333
- \centering
334
-<<polymorphicX, fig=TRUE, width=8, height=4>>=
335
-## copy number estimates on X for SNPs are biased towards small values.
336
-X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23)
337
-ca.M <- CA(cnSet, i=X.markers, j=M)
338
-cb.M <- CB(cnSet, i=X.markers, j=M)
339
-ca.F <- CA(cnSet, i=X.markers, j=F)
340
-cb.F <- CB(cnSet, i=X.markers, j=F)
341
-cn.M <- ca.M+cb.M
342
-cn.F <- ca.F+cb.F
343
-boxplot(data.frame(cbind(cn.M, cn.F)), pch=".", outline=FALSE, col="grey60")
344
-cn2 <- totalCopynumber(cnSet, i=X.markers, j=c(M, F))
345
-stopifnot(all.equal(cbind(cn.M, cn.F), cn2))
346
-@
347
-\caption{Copy number estimates for polymorphic markers on chromosome
348
-  X. }
349
-\end{figure}
350
-
351
-Often it is of interest to smooth the total copy number estimates (the
352
-sum of the allele-specific copy number) as a function of the physical
353
-position.  The following code chunk can be used to create an instance
354
-of \Robject{CopyNumberSet} that contains the CA + CB at polymorphic
355
-markers and 'CA' at nonpolymorphic markers.
356
-
357
-<<copynumberObject>>=
358
-marker.index <- which(chromosome(cnSet) <= 22)## & isSnp(cnSet))
359
-sample.index <- 1:5 ## first five samples
360
-invisible(open(cnSet))
361
-copynumberSet <- as(cnSet[marker.index, 1:5], "CopyNumberSet")
362
-invisible(close(cnSet))
363
-copynumberSet <- copynumberSet[order(chromosome(copynumberSet), position(copynumberSet)), ]
364
-indices <- split(1:nrow(copynumberSet), chromosome(copynumberSet))
365
-dup.index <- unlist(sapply(indices, function(i, position)  i[duplicated(position[i])], position=position(copynumberSet)))
366
-if(length(dup.index) > 0) copynumberSet <- copynumberSet[-dup.index, ]
367
-## exclude any rows that are all missing
368
-missing.index <- which(rowSums(is.na(copyNumber(copynumberSet))) == ncol(copynumberSet))
369
-table(chromosome(copynumberSet)[missing.index])
370
-copynumberSet <- copynumberSet[-missing.index, ]
371
-@
205
+Finally, note that the elements of the \verb+assayData+ slot are
206
+\Robject{ff} objects and not matrices. For the most part, the
207
+\emph{appearance} that the data is stored in memory is preserved.
372 208
 
373 209
 
374
-We suggest plotting the total copy number as a function of the
375
-physical position to evaluate whether a wave-correction is needed.
376
-
377
-
378