Commit information:
Commit id: 736588e3d47fd58358f61ce2241bb2c85b4d9a0c
update to vignette to process previously unevaluated codechunks
Committed by: Rob Scharpf
Author Name: Rob Scharpf
Commit date: 2014-09-19 10:14:04 -0400
Author date: 2014-09-19 10:14:04 -0400
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@94285 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -200,8 +200,7 @@ Estimation of log R ratios and B allele frequencies from an object of class \Rcl |
200 | 200 |
crlmmCopynumber(cnSet, fit.linearModel=FALSE) |
201 | 201 |
@ |
202 | 202 |
|
203 |
-<<oligoSnpSet,eval=FALSE>>= |
|
204 |
-library(VanillaICE) |
|
203 |
+<<oligoSnpSet>>= |
|
205 | 204 |
open(cnSet) |
206 | 205 |
oligoSetList <- BafLrrSetList(cnSet) |
207 | 206 |
close(cnSet) |
... | ... |
@@ -214,7 +213,7 @@ oligoSetList[[1]] |
214 | 213 |
\noindent Log R ratios and B allele frequences can be retrieved by the |
215 | 214 |
accessors \Rfunction{lrr} and \Rfunction{baf}, respectively. |
216 | 215 |
|
217 |
-<<testEqual,eval=FALSE>>= |
|
216 |
+<<testEqual>>= |
|
218 | 217 |
lrrList <- lrr(oligoSetList) |
219 | 218 |
class(lrrList) |
220 | 219 |
dim(lrrList[[1]]) ## log R ratios for chromosome 1. |
Commit information:
Commit id: 7a7b52509b1e09bf8d3f517244f90f9045d102ca
Commit message: Set enough.men and enough.women to FALSE by default in fit.lm3 function.
Committed by: Rob Scharpf
Author Name: Rob Scharpf
Commit date: 2014-08-25 21:44:19 -0400
Author date: 2014-08-25 21:44:19 -0400
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@93599 bc3139a8-67e5-0310-9ffc-ced21a209358
* collab:
add warning in vignette about NAs with BafLrrSetList function
Added Human Omni Express Exome 8 v1.1b as a supported chip
updated version number of pacakge and man pages to reflect these changes
skeleton for krlmm capability added. genotype.Illumina() can now take and XY object as input
update copynumber.Rnw to use BafLrrSetList
updates to vignettes
update namespace
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@79138 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -53,15 +53,24 @@ print(histogram(~snr, |
53 | 53 |
|
54 | 54 |
\section{Copy number estimation} |
55 | 55 |
|
56 |
-As described in \cite{Scharpf2011}, the CRLMM-CopyNumber algorithm |
|
57 |
-fits a linear model to the normalized intensities stratified by the |
|
58 |
-diallic genotype call. The intercept and slope from the linear model |
|
59 |
-are both SNP- and batch-specific. The implementation in the \crlmm{} |
|
60 |
-package is encapsulated by the function \Rfunction{crlmmCopynumber} |
|
61 |
-that, using the default settings, can be called by passing a single |
|
62 |
-object of class \Rclass{CNSet}. See the appropriate |
|
63 |
-preprocessing/genotyping vignette for the construction of an object of |
|
64 |
-class \Rclass{CNSet}. |
|
56 |
+There are two ways to obtain marker-level estimates of copy number |
|
57 |
+that are supported by \Rpackage{crlmm}. One approach is to fit a |
|
58 |
+linear model to the normalized intensities stratified by the diallelic |
|
59 |
+genotype call at each SNP, as described in \cite{Scharpf2011}. |
|
60 |
+Another alternative is to compute log R ratio and B allele |
|
61 |
+frequencies. The latter is often better supported by downstream hidden |
|
62 |
+Markov models such as those in the \Rpackage{VanillaICE} package. We |
|
63 |
+describe each approach in the following two sections. |
|
64 |
+ |
|
65 |
+\subsection{Linear model for normalized intensities} |
|
66 |
+ |
|
67 |
+In this section, we fit a linear model to the normalized intensities |
|
68 |
+stratified by the diallic genotype call. The intercept and slope from |
|
69 |
+the linear model are both SNP- and batch-specific. The implementation |
|
70 |
+in the \crlmm{} package is encapsulated by the function |
|
71 |
+\Rfunction{crlmmCopynumber} that, using the default settings, can be |
|
72 |
+called by passing a single object of class \Rclass{CNSet}. |
|
73 |
+ |
|
65 | 74 |
<<LDS_copynumber,cache=TRUE>>= |
66 | 75 |
crlmmCopynumber(cnSet) |
67 | 76 |
@ |
... | ... |
@@ -105,12 +114,7 @@ described in the \R{} package \Rpackage{ff}. The value returned by |
105 | 114 |
\Rfunction{crlmmCopynumber} is \verb+TRUE+, indicating that the files |
106 | 115 |
on disk have been successfully updated. Note that while the |
107 | 116 |
\Robject{cnSet} object is unchanged, the values on disk are different. |
108 |
-On the other hand, subsetting the \Robject{cnSet} with the `[' method |
|
109 |
-coerces all of the elements to class \Rclass{matrix}. The |
|
110 |
-batch-specific summaries are now ordinary matrices stored in RAM. The |
|
111 |
-object returned by \Robject{crlmmCopynumber} is an object of class |
|
112 |
-\Rclass{CNSet} with the matrices in the \verb+batchStatistics+ slot |
|
113 |
-updated. |
|
117 |
+ |
|
114 | 118 |
|
115 | 119 |
<<chr1index>>= |
116 | 120 |
chr1.index <- which(chromosome(cnSet) == 1) |
... | ... |
@@ -127,20 +131,10 @@ for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet2)[[nms[i]]])[1] |
127 | 131 |
all(cls == "matrix") |
128 | 132 |
@ |
129 | 133 |
|
130 |
-<<matrix_copynumber,cache=TRUE>>= |
|
131 |
-cnSet3 <- crlmmCopynumber(cnSet2) |
|
132 |
-class(cnSet3) |
|
133 |
-@ |
|
134 |
- |
|
135 | 134 |
<<clean, echo=FALSE, results=hide>>= |
136 | 135 |
rm(cnSet2); gc() |
137 | 136 |
@ |
138 | 137 |
|
139 |
-\subsection{Marker-specific estimates} |
|
140 |
-%\subsection{Raw copy number} |
|
141 |
- |
|
142 |
-%\paragraph{Log R ratios and B allele frequencies.} |
|
143 |
- |
|
144 | 138 |
\paragraph{Raw total copy number.} |
145 | 139 |
Several functions are available that will compute relatively quickly |
146 | 140 |
the allele-specific, \emph{raw} copy number estimates. At allele $k$, |
... | ... |
@@ -186,38 +180,49 @@ cb <- CB(cnSet, i=snp.index, j=1:2) |
186 | 180 |
|
187 | 181 |
|
188 | 182 |
\subsection{Container for log R ratios and B allele frequencies} |
183 |
+\label{sec:lrrBaf} |
|
189 | 184 |
|
190 | 185 |
A useful container for storing the \crlmm{} genotypes, genotype |
191 | 186 |
confidence scores, and the total or relative copy number at each |
192 |
-marker is the \Rclass{oligoSetList} class. Coercion of a |
|
193 |
-\Rclass{CNSet} object to a \Rfunction{oligoSnpSet} object can be |
|
194 |
-acheived by using the function \Rfunction{constructOligoSetFrom} as |
|
187 |
+marker is the \Rclass{BafLrrSetList} class. Coercion of a |
|
188 |
+\Rclass{CNSet} object to a \Rfunction{BafLrrSetList} object can be |
|
189 |
+acheived by the function \Rfunction{BafLrrSetList} as |
|
195 | 190 |
illustrated below. Users should note that if the \verb+assayData+ |
196 | 191 |
elements in the \Rclass{CNSet} instance are \Rpackage{ff} objects, the |
197 |
-\verb+assayData+ elements of each element in the \Rclass{oligoSetList} |
|
192 |
+\verb+assayData+ elements of each element in the \Rclass{BafLrrSetList} |
|
198 | 193 |
object will be \Rpackage{ff}-dervied objects (a new |
199 | 194 |
\verb+total_cn*.ff+ file will be created in the \verb+ldPath()+ |
200 |
-directory). |
|
195 |
+directory). The following code-chunk is not evalutated. |
|
196 |
+ |
|
197 |
+Estimation of log R ratios and B allele frequencies from an object of class \Rclass{CNSet} instantiated from genotyping Affymetrix CEL files or Illumina iDat files requires running the \R{} function \Rfunction{crlmmCopynumber} as a preliminary step. Specifying \texttt{fit.linearModel=FALSE} will speed up computation as this function will by default compute summary statistics unnecessary for computing BAFs and log R ratios. |
|
201 | 198 |
|
202 |
-<<oligoSnpSet>>= |
|
199 |
+<<callingCrlmmCopynumber,eval=FALSE>>= |
|
200 |
+crlmmCopynumber(cnSet, fit.linearModel=FALSE) |
|
201 |
+@ |
|
202 |
+ |
|
203 |
+<<oligoSnpSet,eval=FALSE>>= |
|
203 | 204 |
library(VanillaICE) |
204 |
-open(cnSet3) |
|
205 |
-oligoSetList <- constructOligoSetListFrom(cnSet3, batch.name=batch(cnSet3)[1]) |
|
206 |
-close(cnSet3) |
|
205 |
+open(cnSet) |
|
206 |
+oligoSetList <- BafLrrSetList(cnSet) |
|
207 |
+close(cnSet) |
|
207 | 208 |
show(oligoSetList) |
208 | 209 |
class(oligoSetList) |
209 | 210 |
## oligoSnpSet of first chromosome |
210 | 211 |
oligoSetList[[1]] |
211 | 212 |
@ |
212 | 213 |
|
213 |
-\noindent Note that log R ratios stored in the \Robject{oligoSnpSet} |
|
214 |
-object can be retrieved by the \Rfunction{copyNumber} accessor. B |
|
215 |
-allele frequences are retrieved by the \Rfunction{baf} accessor. |
|
214 |
+\noindent Log R ratios and B allele frequences can be retrieved by the |
|
215 |
+accessors \Rfunction{lrr} and \Rfunction{baf}, respectively. |
|
216 | 216 |
|
217 |
-<<testEqual>>= |
|
218 |
-lrrList <- copyNumber(oligoSetList) |
|
217 |
+<<testEqual,eval=FALSE>>= |
|
218 |
+lrrList <- lrr(oligoSetList) |
|
219 | 219 |
class(lrrList) |
220 | 220 |
dim(lrrList[[1]]) ## log R ratios for chromosome 1. |
221 | 221 |
bafList <- baf(oligoSetList) |
222 | 222 |
dim(bafList[[1]]) ## B allele frequencies for chromosome 1 |
223 | 223 |
@ |
224 |
+ |
|
225 |
+If the \Rfunction{crlmmCopynumber} function is not run prior to the |
|
226 |
+\Rfunction{BafLrrSetList} construction, the log R ratios and BAFs will |
|
227 |
+be initialized as NAs. |
|
228 |
+ |
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@71295 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -2,7 +2,7 @@ |
2 | 2 |
Affymetrix 5.0 and 6.0 platforms, as well as several Illumina |
3 | 3 |
platforms. This vignette assumes that the arrays have already been |
4 | 4 |
successfully preprocessed and genotyped as per the instructions in |
5 |
- the \verb+AffymetrixPreprocessCN+ and \verb+IlluminaPreprocessCN+ |
|
5 |
+ the \verb+AffyGW+ and \verb+IlluminaPreprocessCN+ |
|
6 | 6 |
vignettes for the Affymetrix and Illumina platforms, |
7 | 7 |
respectively. While this vignette uses Affymetrix 6.0 arrays for |
8 | 8 |
illustration, the steps at this point are identical for both |
* collab:
Update AffyGW.pdf and IlluminaPreprocessCN.pdf
crlmmIlluminaV2 calls crlmmGT. comment call to crlmmGT2
Update AffyGW vignette to step through construction of CNSet, normalization, and genotyping
v 1.15.27: Fix bug in crlmmGT2. clean up comments in crlmmIllumina.
update getFeatureData
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@69561 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -63,7 +63,7 @@ object of class \Rclass{CNSet}. See the appropriate |
63 | 63 |
preprocessing/genotyping vignette for the construction of an object of |
64 | 64 |
class \Rclass{CNSet}. |
65 | 65 |
<<LDS_copynumber,cache=TRUE>>= |
66 |
-(cnSet.updated <- crlmmCopynumber(cnSet)) |
|
66 |
+crlmmCopynumber(cnSet) |
|
67 | 67 |
@ |
68 | 68 |
|
69 | 69 |
The following steps were performed by the \Rfunction{crlmmCopynumber} |
* collab: (34 commits)
revert change to IlluminaPreprocessCN
fix bug in isValidCdfName
print warning when all features in a batch of probes are flagged, but allow processing to continue
add utility cleancdfnames
Add validCdfNames.Rd
export validCdfNames
imputeGender fix when chromosome Y not available
Use splitIndicesByLength(index, ocSamples/getDoParWorkers())
Can not allocate vector of size XG with genotype.Illumina. Use splitIndicesByNode() only if the length of the list is greater than the split from splitIndicesByLength(). Otherwise, split by length using ocSamples()
update .gitignore
Add make.unique for sampleSheet$Sample_ID in readIdatFiles
bug in description
ensure sample ids stored in samplesheet are unique when constructing cnSet object
update oligoClasses dependency
update unit test for genotype.Illumina
revert change in constructInf call from genotype.Illumina
Update genotype.Rd
edit ACN function
1.15.6 use make.unique(basename(arrayNames)) to allow processing of Illumina samples with duplicated barcodes
check that sample identifies are unique in crlmm function
...
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@67435 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -154,7 +154,7 @@ formally, \newcommand{\A}{A} \newcommand{\B}{B} |
154 | 154 |
\phi}_{k,ip}}\left(I_{k,ijp}-{\hat \nu}_{k,ip}\right), ~0\right\} |
155 | 155 |
\mbox{~for~} k \in \{\A, \B\}. |
156 | 156 |
\end{eqnarray} |
157 |
-\noindent See \cite{Scharpf2010} for details. |
|
157 |
+\noindent See \cite{Scharpf2011} for details. |
|
158 | 158 |
|
159 | 159 |
The function \Rfunction{totalCopynumber} translates the normalized |
160 | 160 |
intensities to an estimate of raw copy number by adding the |
... | ... |
@@ -175,42 +175,49 @@ dim(tmp2) |
175 | 175 |
|
176 | 176 |
Alternatively, the functions \Rfunction{CA} and \Rfunction{CB} compute |
177 | 177 |
the allele-specific copy number. For instance, the following code |
178 |
-chunk computes the allele-specific summaries at all polymorphic loci. |
|
178 |
+chunk computes the allele-specific summaries at all polymorphic loci |
|
179 |
+for the first 2 samples. |
|
179 | 180 |
|
180 | 181 |
<<ca>>= |
181 | 182 |
snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet))) |
182 |
-ca <- CA(cnSet, i=snp.index, j=seq_len(ncol(cnSet))) |
|
183 |
-cb <- CB(cnSet, i=snp.index, j=seq_len(ncol(cnSet))) |
|
183 |
+ca <- CA(cnSet, i=snp.index, j=1:2) |
|
184 |
+cb <- CB(cnSet, i=snp.index, j=1:2) |
|
184 | 185 |
@ |
185 | 186 |
|
186 | 187 |
|
187 |
-\subsection{Container for raw copy number} |
|
188 |
+\subsection{Container for log R ratios and B allele frequencies} |
|
188 | 189 |
|
189 | 190 |
A useful container for storing the \crlmm{} genotypes, genotype |
190 |
-confidence scores, and the total copy number at each marker is the |
|
191 |
-\Rclass{oligoSnpSet} class. Coercion of a \Rclass{CNSet} object to a |
|
192 |
-\Rfunction{oligoSnpSet} object can be acheived by using the method |
|
193 |
-\Rfunction{as} (as illustrated below). Users should note that if the |
|
194 |
-\verb+assayData+ elements in the \Rclass{CNSet} instance are |
|
195 |
-\Rpackage{ff} objects, the \verb+assayData+ elements of the |
|
196 |
-instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff}-dervied |
|
197 |
-objects (a new \verb+total_cn*.ff+ file will be created in the |
|
198 |
-\verb+ldPath()+ directory). |
|
191 |
+confidence scores, and the total or relative copy number at each |
|
192 |
+marker is the \Rclass{oligoSetList} class. Coercion of a |
|
193 |
+\Rclass{CNSet} object to a \Rfunction{oligoSnpSet} object can be |
|
194 |
+acheived by using the function \Rfunction{constructOligoSetFrom} as |
|
195 |
+illustrated below. Users should note that if the \verb+assayData+ |
|
196 |
+elements in the \Rclass{CNSet} instance are \Rpackage{ff} objects, the |
|
197 |
+\verb+assayData+ elements of each element in the \Rclass{oligoSetList} |
|
198 |
+object will be \Rpackage{ff}-dervied objects (a new |
|
199 |
+\verb+total_cn*.ff+ file will be created in the \verb+ldPath()+ |
|
200 |
+directory). |
|
199 | 201 |
|
200 | 202 |
<<oligoSnpSet>>= |
203 |
+library(VanillaICE) |
|
201 | 204 |
open(cnSet3) |
202 |
-oligoSet <- as(cnSet3, "oligoSnpSet") |
|
205 |
+oligoSetList <- constructOligoSetListFrom(cnSet3, batch.name=batch(cnSet3)[1]) |
|
203 | 206 |
close(cnSet3) |
204 |
-class(copyNumber(oligoSet)) |
|
207 |
+show(oligoSetList) |
|
208 |
+class(oligoSetList) |
|
209 |
+## oligoSnpSet of first chromosome |
|
210 |
+oligoSetList[[1]] |
|
205 | 211 |
@ |
206 | 212 |
|
207 |
-\noindent Note that the raw copy number estimates stored in the |
|
208 |
-\Robject{oligoSnpSet} object can be retrieved by the |
|
209 |
-\Rfunction{copyNumber} accessor and is equivalent to that returned by |
|
210 |
-the \Rfunction{totalCopynumber} function defined over the same row and |
|
211 |
-column indices. |
|
213 |
+\noindent Note that log R ratios stored in the \Robject{oligoSnpSet} |
|
214 |
+object can be retrieved by the \Rfunction{copyNumber} accessor. B |
|
215 |
+allele frequences are retrieved by the \Rfunction{baf} accessor. |
|
212 | 216 |
|
213 | 217 |
<<testEqual>>= |
214 |
-total.cn3 <- totalCopynumber(cnSet3, i=1:nrow(cnSet3), j=seq_len(ncol(cnSet3))) |
|
215 |
-all.equal(copyNumber(oligoSet), total.cn3) |
|
216 |
-@ |
|
217 | 218 |
\ No newline at end of file |
219 |
+lrrList <- copyNumber(oligoSetList) |
|
220 |
+class(lrrList) |
|
221 |
+dim(lrrList[[1]]) ## log R ratios for chromosome 1. |
|
222 |
+bafList <- baf(oligoSetList) |
|
223 |
+dim(bafList[[1]]) ## B allele frequencies for chromosome 1 |
|
224 |
+@ |
* mymac:
add AffyGW.pdf
update vignettes in inst/scripts
Change argument of validCEL to celfiles
Update constructInf to accommodate GenomeDataFrame class for featureData
bump version to 1.13.7
Add doRUnit.R
Add celfile-utils.Rd
Streamlne some of the Rd files
add validCEL function that checks whether all celfiles can be read
getFeatureData returns GenomeAnnotatedDataFrame
Remove imports from methods. Remove pdf of illumina_copynumber.pdf (large file) and copynumber.pdf
getFeatureDAta returns GenomeAnnotatedDataFrame
Remove separate vignette for copy number in inst/scripts. Include copynumber section in both affy and illumina pipelines.
update documentation files for genotype.Illumina, preprocessInf, and genotypeInf (cdfName added as argument. Indicate that 'batch' should be a character string)
pass cdfName to genotypeInf and preprocessInf
add unitTests and cn-functions for 'simple usage'
Combine AffyPreprocess and copynumber. Combine IlluminaPreprocess and copynumber
remove depency on ff to allow installation on my mac
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@62108 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -1,35 +1,3 @@ |
1 |
-%\VignetteIndexEntry{crlmm copy number Vignette} |
|
2 |
-%\VignetteDepends{crlmm, genomewidesnp6Crlmm} |
|
3 |
-%\VignetteKeywords{crlmm, SNP 6} |
|
4 |
-%\VignettePackage{crlmm} |
|
5 |
-\documentclass{article} |
|
6 |
-\usepackage{graphicx} |
|
7 |
-\usepackage{natbib} |
|
8 |
-\usepackage{url} |
|
9 |
-\usepackage{amsmath,amssymb} |
|
10 |
-\newcommand{\Rfunction}[1]{{\texttt{#1}}} |
|
11 |
-\newcommand{\Rmethod}[1]{{\texttt{#1}}} |
|
12 |
-\newcommand{\Rcode}[1]{{\texttt{#1}}} |
|
13 |
-\newcommand{\Robject}[1]{{\texttt{#1}}} |
|
14 |
-\newcommand{\Rpackage}[1]{{\textsf{#1}}} |
|
15 |
-\newcommand{\Rclass}[1]{{\textit{#1}}} |
|
16 |
-\newcommand{\oligo}{\Rpackage{oligo }} |
|
17 |
-\newcommand{\R}{\textsf{R}} |
|
18 |
-\newcommand{\crlmm}{\Rpackage{crlmm}} |
|
19 |
-\usepackage[margin=1in]{geometry} |
|
20 |
- |
|
21 |
-\begin{document} |
|
22 |
-\title{Copy number estimation with \Rpackage{crlmm}} |
|
23 |
-\date{\today} |
|
24 |
-\author{Rob Scharpf} |
|
25 |
-\maketitle |
|
26 |
- |
|
27 |
-<<setup, echo=FALSE, results=hide>>= |
|
28 |
-options(continue=" ", width=70) |
|
29 |
-@ |
|
30 |
- |
|
31 |
-\begin{abstract} |
|
32 |
- |
|
33 | 1 |
Copy number routines in the \crlmm{} package are available for |
34 | 2 |
Affymetrix 5.0 and 6.0 platforms, as well as several Illumina |
35 | 3 |
platforms. This vignette assumes that the arrays have already been |
... | ... |
@@ -44,35 +12,6 @@ options(continue=" ", width=70) |
44 | 12 |
\crlmm{} package is available from the author's website: |
45 | 13 |
\url{http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html}. |
46 | 14 |
|
47 |
-\end{abstract} |
|
48 |
- |
|
49 |
-\section{Set up} |
|
50 |
- |
|
51 |
-<<libraries,results=hide>>= |
|
52 |
-library(ff) |
|
53 |
-library(crlmm) |
|
54 |
-library(lattice) |
|
55 |
-library(cacheSweave) |
|
56 |
-if(getRversion() < "2.13.0"){ |
|
57 |
- rpath <- getRversion() |
|
58 |
-} else rpath <- "trunk" |
|
59 |
-outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
|
60 |
-@ |
|
61 |
- |
|
62 |
-<<ram>>= |
|
63 |
-ldPath(outdir) |
|
64 |
-setCacheDir(outdir) |
|
65 |
-ocProbesets(150e3) |
|
66 |
-ocSamples(200) |
|
67 |
-@ |
|
68 |
- |
|
69 |
-We begin by loading the \Robject{cnSet} object created by the |
|
70 |
-\verb+AffymetrixPreprocessCN+ vignette. |
|
71 |
- |
|
72 |
-<<loadcnset>>= |
|
73 |
-if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda")) |
|
74 |
-@ |
|
75 |
- |
|
76 | 15 |
|
77 | 16 |
\textbf{Limitations:} While a minimum number of samples is not |
78 | 17 |
required for preprocessing and genotyping, copy number estimation in |
... | ... |
@@ -100,24 +39,16 @@ quality. The following code chunk makes a histogram of the SNR values |
100 | 39 |
for the HapMap samples. |
101 | 40 |
|
102 | 41 |
<<snr,fig=TRUE,include=FALSE,width=6, height=4>>= |
42 |
+library(lattice) |
|
103 | 43 |
invisible(open(cnSet$SNR)) |
104 | 44 |
snr <- cnSet$SNR[] |
105 | 45 |
close(cnSet$SNR) |
106 |
-print(histogram(~snr, panel=function(...){ |
|
107 |
- panel.histogram(...) |
|
108 |
- panel.abline(v=5, col="grey",lty=2) |
|
109 |
-}, |
|
110 |
- breaks=25,xlim=c(4.5,10), xlab="SNR")) |
|
46 |
+print(histogram(~snr, |
|
47 |
+ panel=function(...){ |
|
48 |
+ panel.histogram(...)}, |
|
49 |
+ breaks=25, xlim=range(snr), xlab="SNR")) |
|
111 | 50 |
@ |
112 |
-\begin{figure}[t!] |
|
113 |
- \begin{center} |
|
114 |
- \includegraphics[width=0.6\textwidth]{copynumber-snr.pdf} |
|
115 |
- \caption{The signal to noise ratio (SNR) for 180 HapMap samples. For |
|
116 |
- Affymetrix platforms, SNR values below 5 can indicate possible |
|
117 |
- problems with sample quality. In some circumstances, it may be |
|
118 |
- more helpful to exclude samples with poor DNA quality.} |
|
119 |
-\end{center} |
|
120 |
-\end{figure} |
|
51 |
+ |
|
121 | 52 |
|
122 | 53 |
|
123 | 54 |
\section{Copy number estimation} |
... | ... |
@@ -149,29 +80,17 @@ computed for subsets of the markers to reduce the required RAM. Note |
149 | 80 |
that the value returned by the \Rfunction{crlmmCopynumber} function in |
150 | 81 |
the above example is \verb+TRUE+. The reason the function returns |
151 | 82 |
\verb+TRUE+ in the above example is that the elements of the |
152 |
-\verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}. Rather |
|
153 |
-than keep the statistical summaries in memory, the summaries are |
|
154 |
-written to files on disk using protocols described in the |
|
83 |
+\verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}. |
|
84 |
+Rather than keep the statistical summaries in memory, the summaries |
|
85 |
+are written to files on disk using protocols described in the |
|
155 | 86 |
\Rpackage{ff} package. Hence, while the \Robject{cnSet} object itself |
156 | 87 |
is unchanged as a result of the \Rfunction{crlmmCopynumber} function, |
157 | 88 |
the data on disk is updated accordingly. Users that are interested in |
158 | 89 |
accessing these low-level summaries can refer to the |
159 |
-\verb+Infrastructure+ vignette. Computation of the raw copy number |
|
160 |
-estimates for each allele is described in the following section. |
|
161 |
- |
|
162 |
-For users that are interested in the analysis of a specific chromosome |
|
163 |
-(subset of markers) or a s |
|
164 |
- |
|
165 |
-pointers to files on disk, are stored in the \verb+batchStatistics+ |
|
166 |
-slot of the class \Rclass{CNSet}. Using the default settings for the |
|
167 |
-\Rfunction{crlmmCopynumber} function, only an object of class |
|
168 |
-\Rclass{CNSet} is required. %(See \verb+AffyPreprocessCN+ or |
|
169 |
-%\verb+IlluminaPreprocessCN+ vignettes for details.) |
|
170 |
- |
|
171 |
-Note that depends on whether the elements of the \verb+batchStatistics+ |
|
172 |
-slot are \Robject{ff} objects or ordinary matrices. In this example, |
|
173 |
-the elements of \verb+batchStatistics+ have the class |
|
174 |
-\Rclass{ff\_matrix}. |
|
90 |
+\verb+Infrastructure+ vignette. Note that the data structure depends |
|
91 |
+on whether the elements of the \verb+batchStatistics+ slot are |
|
92 |
+\Robject{ff} objects or ordinary matrices. In this example, the |
|
93 |
+elements of \verb+batchStatistics+ have the class \Rclass{ff\_matrix}. |
|
175 | 94 |
|
176 | 95 |
<<classes>>= |
177 | 96 |
nms <- ls(batchStatistics(cnSet)) |
... | ... |
@@ -186,10 +105,8 @@ described in the \R{} package \Rpackage{ff}. The value returned by |
186 | 105 |
\Rfunction{crlmmCopynumber} is \verb+TRUE+, indicating that the files |
187 | 106 |
on disk have been successfully updated. Note that while the |
188 | 107 |
\Robject{cnSet} object is unchanged, the values on disk are different. |
189 |
- |
|
190 |
- |
|
191 |
-\noindent On the other hand, subsetting the \Robject{cnSet} with the |
|
192 |
-`[' method coerces all of the elements to class \Rclass{matrix}. The |
|
108 |
+On the other hand, subsetting the \Robject{cnSet} with the `[' method |
|
109 |
+coerces all of the elements to class \Rclass{matrix}. The |
|
193 | 110 |
batch-specific summaries are now ordinary matrices stored in RAM. The |
194 | 111 |
object returned by \Robject{crlmmCopynumber} is an object of class |
195 | 112 |
\Rclass{CNSet} with the matrices in the \verb+batchStatistics+ slot |
... | ... |
@@ -219,14 +136,18 @@ class(cnSet3) |
219 | 136 |
rm(cnSet2); gc() |
220 | 137 |
@ |
221 | 138 |
|
222 |
-\subsection{Raw copy number} |
|
139 |
+\subsection{Marker-specific estimates} |
|
140 |
+%\subsection{Raw copy number} |
|
141 |
+ |
|
142 |
+%\paragraph{Log R ratios and B allele frequencies.} |
|
223 | 143 |
|
144 |
+\paragraph{Raw total copy number.} |
|
224 | 145 |
Several functions are available that will compute relatively quickly |
225 | 146 |
the allele-specific, \emph{raw} copy number estimates. At allele $k$, |
226 | 147 |
marker $i$, sample $j$, and batch $p$, the estimate of allele-specific |
227 | 148 |
copy number is computed by subtracting the estimated background from |
228 |
-the normalized intensity and scaling by the slope coefficient. More formally, |
|
229 |
-\newcommand{\A}{A} \newcommand{\B}{B} |
|
149 |
+the normalized intensity and scaling by the slope coefficient. More |
|
150 |
+formally, \newcommand{\A}{A} \newcommand{\B}{B} |
|
230 | 151 |
\begin{eqnarray} |
231 | 152 |
\label{eq:cnK} |
232 | 153 |
{\hat c}_{k,ijp} &=& \mbox{max}\left\{\frac{1}{{\hat |
... | ... |
@@ -246,9 +167,9 @@ at all markers for the first 2 samples, and the total copy number for |
246 | 167 |
chromosome 20 for the first 50 samples. |
247 | 168 |
|
248 | 169 |
<<totalCopynumber>>= |
249 |
-tmp <- totalCopynumber(cnSet, i=1:nrow(cnSet), j=1:2) |
|
170 |
+tmp <- totalCopynumber(cnSet, i=seq_len(nrow(cnSet)), j=1:2) |
|
250 | 171 |
dim(tmp) |
251 |
-tmp2 <- totalCopynumber(cnSet, i=which(chromosome(cnSet) == 20), j=1:50) |
|
172 |
+tmp2 <- totalCopynumber(cnSet, i=which(chromosome(cnSet) == 20), j=seq_len(ncol(cnSet))) |
|
252 | 173 |
dim(tmp2) |
253 | 174 |
@ |
254 | 175 |
|
... | ... |
@@ -258,98 +179,12 @@ chunk computes the allele-specific summaries at all polymorphic loci. |
258 | 179 |
|
259 | 180 |
<<ca>>= |
260 | 181 |
snp.index <- which(isSnp(cnSet) & !is.na(chromosome(cnSet))) |
261 |
-ca <- CA(cnSet, i=snp.index, j=1:5) |
|
262 |
-cb <- CB(cnSet, i=snp.index, j=1:5) |
|
263 |
-@ |
|
264 |
- |
|
265 |
-\noindent Note the equivalence of the following calculations. |
|
266 |
- |
|
267 |
-<<totalCopynumber>>= |
|
268 |
-ct <- ca+cb |
|
269 |
-ct2 <- totalCopynumber(cnSet, i=snp.index, j=1:5) |
|
270 |
-stopifnot(all.equal(ct, ct2)) |
|
271 |
-@ |
|
272 |
- |
|
273 |
-At nonpolymorphic loci, \Rfunction{CA} function returns the total copy |
|
274 |
-number and, by construction, the \Rfunction{CB} function returns 0. |
|
275 |
- |
|
276 |
-<<nonpolymorphicAutosomes>>= |
|
277 |
-marker.index <- which(!isSnp(cnSet)) |
|
278 |
-ct <- CA(cnSet, i=marker.index, j=1:5) |
|
279 |
-## all zeros |
|
280 |
-stopifnot(all(CB(cnSet, i=marker.index, j=1:5) == 0)) |
|
281 |
-ct2 <- totalCopynumber(cnSet, i=marker.index, j=1:5) |
|
282 |
-stopifnot(all.equal(ct, ct2)) |
|
283 |
-@ |
|
284 |
- |
|
285 |
-In the following code chunk, we extract estimates of the total copy |
|
286 |
-number at nonpolymorphic markers on chromosome X. |
|
287 |
- |
|
288 |
-<<npx>>= |
|
289 |
-set.seed(123) |
|
290 |
-npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet)) |
|
291 |
-M <- sample(which(cnSet$gender[]==1), 5) |
|
292 |
-F <- sample(which(cnSet$gender[]==2), 5) |
|
293 |
-cn.M <- CA(cnSet, i=npx.index, j=M) |
|
294 |
-cn.F <- CA(cnSet, i=npx.index, j=F) |
|
295 |
-@ |
|
296 |
- |
|
297 |
-\noindent Again, the function \Rfunction{totalCopynumber} is |
|
298 |
-equivalent. |
|
299 |
- |
|
300 |
-<<npx2>>= |
|
301 |
-cnX <- cbind(cn.M, cn.F) |
|
302 |
-cnX2 <- totalCopynumber(cnSet, i=npx.index, j=c(M, F)) |
|
303 |
-stopifnot(all.equal(cnX, cnX2)) |
|
182 |
+ca <- CA(cnSet, i=snp.index, j=seq_len(ncol(cnSet))) |
|
183 |
+cb <- CB(cnSet, i=snp.index, j=seq_len(ncol(cnSet))) |
|
304 | 184 |
@ |
305 | 185 |
|
306 |
-<<nonpolymorphicX,fig=TRUE,include=FALSE, echo=FALSE, eval=FALSE>>= |
|
307 |
-df <- data.frame(cn=as.numeric(cnX), id=factor(rep(sampleNames(cnSet)[c(M,F)], each=nrow(cnX)), levels=sampleNames(cnSet)[c(M,F)], ordered=T)) |
|
308 |
-library(RColorBrewer) |
|
309 |
-cols <- brewer.pal(8, "Accent")[c(1, 5)] |
|
310 |
-print(bwplot(cn~id,df, panel=function(x,y,...){ |
|
311 |
- panel.grid(v=-10,h=0) |
|
312 |
- panel.bwplot(x,y, ...) |
|
313 |
- panel.abline(h=1:2, col="grey70",lwd=2) |
|
314 |
-}, |
|
315 |
- scales=list(x=list(rot=90)), cex=0.5, ylab="total copy number", main="Chr X SNPs", |
|
316 |
- fill=cols[cnSet$gender[c(M,F)]], key=mykey)) |
|
317 |
-@ |
|
318 |
- |
|
319 |
-%\begin{figure}[t!] |
|
320 |
-% \centering |
|
321 |
-% \includegraphics[width=0.5\textwidth]{copynumber-nonpolymorphicX} |
|
322 |
-% \caption{Copy number estimates for nonpolymorphic loci on chromosome |
|
323 |
-% X (5 men, 5 women). We assume that the median copy number across |
|
324 |
-% samples at a given marker on X is 1 for men and 2 for women.} |
|
325 |
-%\end{figure} |
|
326 |
- |
|
327 |
-Polymorphic markers on chromosome X: |
|
328 | 186 |
|
329 |
-<<polymorphicX,fig=TRUE,include=FALSE>>= |
|
330 |
-library(RColorBrewer) |
|
331 |
-cols <- brewer.pal(8, "Accent")[c(1, 5)] |
|
332 |
-X.markers <- which(isSnp(cnSet) & chromosome(cnSet) == 23) |
|
333 |
-cnX <- totalCopynumber(cnSet, i=X.markers, j=c(M,F)) |
|
334 |
-df <- data.frame(cn=as.numeric(cnX), id=factor(rep(sampleNames(cnSet)[c(M,F)], each=length(X.markers)), levels=sampleNames(cnSet)[c(M,F)], ordered=T)) |
|
335 |
-mykey <- simpleKey(c("male", "female"), points=FALSE,col=cols) |
|
336 |
-print(bwplot(cn~id,df, panel=function(x,y,...){ |
|
337 |
- panel.grid(v=-10,h=0) |
|
338 |
- panel.bwplot(x,y, ...) |
|
339 |
- panel.abline(h=1:2, col="grey70",lwd=2) |
|
340 |
-}, |
|
341 |
- scales=list(x=list(rot=90)), cex=0.5, ylab="total copy number", main="Chr X SNPs", |
|
342 |
- fill=cols[cnSet$gender[c(M,F)]], key=mykey)) |
|
343 |
-@ |
|
344 |
-\begin{figure}[t!] |
|
345 |
- \centering |
|
346 |
- \includegraphics[width=0.5\textwidth]{copynumber-polymorphicX} |
|
347 |
- \caption{Copy number estimates for polymorphic markers on chromosome |
|
348 |
- X. crlmm assumes that the median copy number across samples at a |
|
349 |
- given marker on X is 1 for men and 2 for women.} |
|
350 |
-\end{figure} |
|
351 |
- |
|
352 |
-\subsection{A container for raw copy number} |
|
187 |
+\subsection{Container for raw copy number} |
|
353 | 188 |
|
354 | 189 |
A useful container for storing the \crlmm{} genotypes, genotype |
355 | 190 |
confidence scores, and the total copy number at each marker is the |
... | ... |
@@ -362,28 +197,6 @@ instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff}-dervied |
362 | 197 |
objects (a new \verb+total_cn*.ff+ file will be created in the |
363 | 198 |
\verb+ldPath()+ directory). |
364 | 199 |
|
365 |
-%The advantage of this approach is that the raw copy number estimates |
|
366 |
-%can be computed for all markers and all samples simultaneously without |
|
367 |
-%excessive RAM. The disadvantage is that such a step creates |
|
368 |
-%additional I/O for reading and writing the raw copy number estimates |
|
369 |
-%from disk. For very large data sets (1000+ samples), the time |
|
370 |
-%required for I/O may outweigh the benefits. The alternative is to |
|
371 |
-%perform downstream processing (such as smoothing) on the total copy |
|
372 |
-%number for subsets of markers and/or samples. As indicated |
|
373 |
-%previously, the subset `[' method applied to an object of class |
|
374 |
-%\Robject{CNSet} will return an object of the same class with simple |
|
375 |
-%matrices instead of \Rpackage{ff} objects. Therefore, one can |
|
376 |
-%instantiate a \Rpackage{oligoSnpSet} object from a \Robject{CNSet} |
|
377 |
-%object without the I/O overhead for storing the total copy number |
|
378 |
-%estimates by first subsetting the \Rclass{CNSet} object. One could |
|
379 |
-%then smooth the raw copy number estimates and store a summary copy |
|
380 |
-%number for the interval, thereby substantially reducing the |
|
381 |
-%dimensionality of the copy number estimates. Examples of smoothing |
|
382 |
-%applications such as circular binary segmentation implemented in the |
|
383 |
-%\Rpackage{DNAcopy} package or a hidden Markov model implemented in the |
|
384 |
-%\Rpackage{VanillaICE} package are provided in the \verb+SmoothingCN+ |
|
385 |
-%vignette. |
|
386 |
- |
|
387 | 200 |
<<oligoSnpSet>>= |
388 | 201 |
open(cnSet3) |
389 | 202 |
oligoSet <- as(cnSet3, "oligoSnpSet") |
... | ... |
@@ -398,19 +211,6 @@ the \Rfunction{totalCopynumber} function defined over the same row and |
398 | 211 |
column indices. |
399 | 212 |
|
400 | 213 |
<<testEqual>>= |
401 |
-total.cn3 <- totalCopynumber(cnSet3, i=1:nrow(cnSet3), j=1:ncol(cnSet3)) |
|
214 |
+total.cn3 <- totalCopynumber(cnSet3, i=1:nrow(cnSet3), j=seq_len(ncol(cnSet3))) |
|
402 | 215 |
all.equal(copyNumber(oligoSet), total.cn3) |
403 |
-@ |
|
404 |
- |
|
405 |
- |
|
406 |
-\section{Session information} |
|
407 |
-<<sessionInfo, results=tex>>= |
|
408 |
-toLatex(sessionInfo()) |
|
409 |
-@ |
|
410 |
- |
|
411 |
-%\begin{bibliography} |
|
412 |
- \bibliographystyle{plain} |
|
413 |
- \bibliography{refs} |
|
414 |
-%\end{bibliography} |
|
415 |
- |
|
416 |
-\end{document} |
|
216 |
+@ |
|
417 | 217 |
\ No newline at end of file |
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54327 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -38,7 +38,7 @@ options(continue=" ", width=70) |
38 | 38 |
vignettes for the Affymetrix and Illumina platforms, |
39 | 39 |
respectively. While this vignette uses Affymetrix 6.0 arrays for |
40 | 40 |
illustration, the steps at this point are identical for both |
41 |
- platforms. See \citep{Scharpf2010} for details regarding the |
|
41 |
+ platforms. See \citep{Scharpf2011} for details regarding the |
|
42 | 42 |
methodology implemented in \crlmm{} for copy number analysis. In |
43 | 43 |
addition, a compendium describing copy number analysis using the |
44 | 44 |
\crlmm{} package is available from the author's website: |
... | ... |
@@ -80,7 +80,9 @@ the \crlmm{} package currently requires at least 10 samples per batch. |
80 | 80 |
The parameter estimates for copy number and the corresponding |
81 | 81 |
estimates of raw copy number will tend to be more noisy for batches |
82 | 82 |
with small sample sizes (e.g., $<$ 50). Chemistry plate or scan date |
83 |
-are often useful surrogates for batch. |
|
83 |
+are often useful surrogates for batch. Samples that were processed at |
|
84 |
+similar times (e.g., in the same month) can be grouped together in the |
|
85 |
+same batch. |
|
84 | 86 |
|
85 | 87 |
\section{Quality control} |
86 | 88 |
|
... | ... |
@@ -98,7 +100,7 @@ quality. The following code chunk makes a histogram of the SNR values |
98 | 100 |
for the HapMap samples. |
99 | 101 |
|
100 | 102 |
<<snr,fig=TRUE,include=FALSE,width=6, height=4>>= |
101 |
-open(cnSet$SNR) |
|
103 |
+invisible(open(cnSet$SNR)) |
|
102 | 104 |
snr <- cnSet$SNR[] |
103 | 105 |
close(cnSet$SNR) |
104 | 106 |
print(histogram(~snr, panel=function(...){ |
... | ... |
@@ -120,13 +122,13 @@ print(histogram(~snr, panel=function(...){ |
120 | 122 |
|
121 | 123 |
\section{Copy number estimation} |
122 | 124 |
|
123 |
-As described in \cite{Scharpf2010}, the CRLMM-CopyNumber algorithm |
|
125 |
+As described in \cite{Scharpf2011}, the CRLMM-CopyNumber algorithm |
|
124 | 126 |
fits a linear model to the normalized intensities stratified by the |
125 | 127 |
diallic genotype call. The intercept and slope from the linear model |
126 | 128 |
are both SNP- and batch-specific. The implementation in the \crlmm{} |
127 | 129 |
package is encapsulated by the function \Rfunction{crlmmCopynumber} |
128 | 130 |
that, using the default settings, can be called by passing a single |
129 |
-object of class \code{CNSet}. See the appropriate |
|
131 |
+object of class \Rclass{CNSet}. See the appropriate |
|
130 | 132 |
preprocessing/genotyping vignette for the construction of an object of |
131 | 133 |
class \Rclass{CNSet}. |
132 | 134 |
<<LDS_copynumber,cache=TRUE>>= |
... | ... |
@@ -147,7 +149,7 @@ computed for subsets of the markers to reduce the required RAM. Note |
147 | 149 |
that the value returned by the \Rfunction{crlmmCopynumber} function in |
148 | 150 |
the above example is \verb+TRUE+. The reason the function returns |
149 | 151 |
\verb+TRUE+ in the above example is that the elements of the |
150 |
-\verb+batchStatistics+ slot have the class \Rclass{ff_matrix}. Rather |
|
152 |
+\verb+batchStatistics+ slot have the class \Rclass{ff\_matrix}. Rather |
|
151 | 153 |
than keep the statistical summaries in memory, the summaries are |
152 | 154 |
written to files on disk using protocols described in the |
153 | 155 |
\Rpackage{ff} package. Hence, while the \Robject{cnSet} object itself |
... | ... |
@@ -169,9 +171,9 @@ slot of the class \Rclass{CNSet}. Using the default settings for the |
169 | 171 |
Note that depends on whether the elements of the \verb+batchStatistics+ |
170 | 172 |
slot are \Robject{ff} objects or ordinary matrices. In this example, |
171 | 173 |
the elements of \verb+batchStatistics+ have the class |
172 |
-\Rclass{ff_matrix}. |
|
174 |
+\Rclass{ff\_matrix}. |
|
173 | 175 |
|
174 |
-<<LDS_copynumber,cache=TRUE>>= |
|
176 |
+<<classes>>= |
|
175 | 177 |
nms <- ls(batchStatistics(cnSet)) |
176 | 178 |
cls <- rep(NA, length(nms)) |
177 | 179 |
for(i in seq_along(nms)) cls[i] <- class(batchStatistics(cnSet)[[nms[i]]])[1] |
... | ... |
@@ -284,6 +286,7 @@ In the following code chunk, we extract estimates of the total copy |
284 | 286 |
number at nonpolymorphic markers on chromosome X. |
285 | 287 |
|
286 | 288 |
<<npx>>= |
289 |
+set.seed(123) |
|
287 | 290 |
npx.index <- which(chromosome(cnSet)==23 & !isSnp(cnSet)) |
288 | 291 |
M <- sample(which(cnSet$gender[]==1), 5) |
289 | 292 |
F <- sample(which(cnSet$gender[]==2), 5) |
... | ... |
@@ -341,9 +344,9 @@ print(bwplot(cn~id,df, panel=function(x,y,...){ |
341 | 344 |
\begin{figure}[t!] |
342 | 345 |
\centering |
343 | 346 |
\includegraphics[width=0.5\textwidth]{copynumber-polymorphicX} |
344 |
-\caption{Copy number estimates for polymorphic markers on chromosome |
|
345 |
- X. crlmm assumes that the median copy number across samples at a |
|
346 |
- given marker on X is 1 for men and 2 for women.} |
|
347 |
+ \caption{Copy number estimates for polymorphic markers on chromosome |
|
348 |
+ X. crlmm assumes that the median copy number across samples at a |
|
349 |
+ given marker on X is 1 for men and 2 for women.} |
|
347 | 350 |
\end{figure} |
348 | 351 |
|
349 | 352 |
\subsection{A container for raw copy number} |
... | ... |
@@ -355,29 +358,31 @@ confidence scores, and the total copy number at each marker is the |
355 | 358 |
\Rfunction{as} (as illustrated below). Users should note that if the |
356 | 359 |
\verb+assayData+ elements in the \Rclass{CNSet} instance are |
357 | 360 |
\Rpackage{ff} objects, the \verb+assayData+ elements of the |
358 |
-instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff} object (a |
|
359 |
-new \verb+total_cn*.ff+ file will be created in the \verb+ldPath()+ |
|
360 |
-directory). The advantage of this approach is that the raw copy number |
|
361 |
-estimates can be computed for all markers and all samples |
|
362 |
-simultaneously without excessive RAM. The disadvantage is that such a |
|
363 |
-step creates additional I/O for reading and writing the raw copy |
|
364 |
-number estimates from disk. For very large data sets (1000+ samples), |
|
365 |
-the time required for I/O may outweigh the benefits. The alternative |
|
366 |
-is to perform downstream processing (such as smoothing) on the total |
|
367 |
-copy number for subsets of markers and/or samples. As indicated |
|
368 |
-previously, the subset `[' method applied to an object of class |
|
369 |
-\Robject{CNSet} will return an object of the same class with simple |
|
370 |
-matrices instead of \Rpackage{ff} objects. Therefore, one can |
|
371 |
-instantiate a \Rpackage{oligoSnpSet} object from a \Robject{CNSet} |
|
372 |
-object without the I/O overhead for storing the total copy number |
|
373 |
-estimates by first subsetting the \Rclass{CNSet} object. One could |
|
374 |
-then smooth the raw copy number estimates and store a summary copy |
|
375 |
-number for the interval, thereby substantially reducing the |
|
376 |
-dimensionality of the copy number estimates. Examples of smoothing |
|
377 |
-applications such as circular binary segmentation implemented in the |
|
378 |
-\Rpackage{DNAcopy} package or a hidden Markov model implemented in the |
|
379 |
-\Rpackage{VanillaICE} package are provided in the \verb+SmoothingCN+ |
|
380 |
-vignette. |
|
361 |
+instantiated \Rclass{oligoSnpSet} will also be \Rpackage{ff}-dervied |
|
362 |
+objects (a new \verb+total_cn*.ff+ file will be created in the |
|
363 |
+\verb+ldPath()+ directory). |
|
364 |
+ |
|
365 |
+%The advantage of this approach is that the raw copy number estimates |
|
366 |
+%can be computed for all markers and all samples simultaneously without |
|
367 |
+%excessive RAM. The disadvantage is that such a step creates |
|
368 |
+%additional I/O for reading and writing the raw copy number estimates |
|
369 |
+%from disk. For very large data sets (1000+ samples), the time |
|
370 |
+%required for I/O may outweigh the benefits. The alternative is to |
|
371 |
+%perform downstream processing (such as smoothing) on the total copy |
|
372 |
+%number for subsets of markers and/or samples. As indicated |
|
373 |
+%previously, the subset `[' method applied to an object of class |
|
374 |
+%\Robject{CNSet} will return an object of the same class with simple |
|
375 |
+%matrices instead of \Rpackage{ff} objects. Therefore, one can |
|
376 |
+%instantiate a \Rpackage{oligoSnpSet} object from a \Robject{CNSet} |
|
377 |
+%object without the I/O overhead for storing the total copy number |
|
378 |
+%estimates by first subsetting the \Rclass{CNSet} object. One could |
|
379 |
+%then smooth the raw copy number estimates and store a summary copy |
|
380 |
+%number for the interval, thereby substantially reducing the |
|
381 |
+%dimensionality of the copy number estimates. Examples of smoothing |
|
382 |
+%applications such as circular binary segmentation implemented in the |
|
383 |
+%\Rpackage{DNAcopy} package or a hidden Markov model implemented in the |
|
384 |
+%\Rpackage{VanillaICE} package are provided in the \verb+SmoothingCN+ |
|
385 |
+%vignette. |
|
381 | 386 |
|
382 | 387 |
<<oligoSnpSet>>= |
383 | 388 |
open(cnSet3) |
... | ... |
@@ -388,7 +393,7 @@ class(copyNumber(oligoSet)) |
388 | 393 |
|
389 | 394 |
\noindent Note that the raw copy number estimates stored in the |
390 | 395 |
\Robject{oligoSnpSet} object can be retrieved by the |
391 |
-\Rfunction{copyNumber} accessor and is equivalanet to that returned by |
|
396 |
+\Rfunction{copyNumber} accessor and is equivalent to that returned by |
|
392 | 397 |
the \Rfunction{totalCopynumber} function defined over the same row and |
393 | 398 |
column indices. |
394 | 399 |
|
git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/crlmm@54171 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -6,6 +6,7 @@ |
6 | 6 |
\usepackage{graphicx} |
7 | 7 |
\usepackage{natbib} |
8 | 8 |
\usepackage{url} |
9 |
+\usepackage{amsmath,amssymb} |
|
9 | 10 |
\newcommand{\Rfunction}[1]{{\texttt{#1}}} |
10 | 11 |
\newcommand{\Rmethod}[1]{{\texttt{#1}}} |
11 | 12 |
\newcommand{\Rcode}[1]{{\texttt{#1}}} |
... | ... |
@@ -18,7 +19,7 @@ |
18 | 19 |
\usepackage[margin=1in]{geometry} |
19 | 20 |
|
20 | 21 |
\begin{document} |
21 |
-\title{Copy number estimation and genotype calling with \Rpackage{crlmm}} |
|
22 |
+\title{Copy number estimation with \Rpackage{crlmm}} |
|
22 | 23 |
\date{\today} |
23 | 24 |
\author{Rob Scharpf} |
24 | 25 |
\maketitle |
... | ... |
@@ -27,178 +28,202 @@ |
27 | 28 |
options(continue=" ", width=70) |
28 | 29 |
@ |
29 | 30 |
|
30 |
-%\section{Estimating copy number} |
|
31 |
- |
|
32 |
-%At present, software for copy number estimation is provided only for the |
|
33 |
-%Affymetrix 6.0 platform. |
|
34 |
- |
|
35 | 31 |
\begin{abstract} |
36 |
- This vignette estimates copy number for HapMap samples on the |
|
37 |
- Affymetrix 6.0 platform. See \citep{Scharpf2010} for details |
|
38 |
- regarding the methodology implemented in \crlmm{}. In addition, a |
|
39 |
- compendium describing copy number analysis using the \crlmm{} |
|
40 |
- package is available from the author's website: |
|
32 |
+ |
|
33 |
+ Copy number routines in the \crlmm{} package are available for |
|
34 |
+ Affymetrix 5.0 and 6.0 platforms, as well as several Illumina |
|
35 |
+ platforms. This vignette assumes that the arrays have already been |
|
36 |
+ successfully preprocessed and genotyped as per the instructions in |
|
37 |
+ the \verb+AffymetrixPreprocessCN+ and \verb+IlluminaPreprocessCN+ |
|
38 |
+ vignettes for the Affymetrix and Illumina platforms, |
|
39 |
+ respectively. While this vignette uses Affymetrix 6.0 arrays for |
|
40 |
+ illustration, the steps at this point are identical for both |
|
41 |
+ platforms. See \citep{Scharpf2010} for details regarding the |
|
42 |
+ methodology implemented in \crlmm{} for copy number analysis. In |
|
43 |
+ addition, a compendium describing copy number analysis using the |
|
44 |
+ \crlmm{} package is available from the author's website: |
|
41 | 45 |
\url{http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html}. |
42 | 46 |
|
43 | 47 |
\end{abstract} |
44 | 48 |
|
45 |
-\section{Copy number estimation} |
|
46 |
- |
|
47 |
-\subsection{Set up} |
|
49 |
+\section{Set up} |
|
48 | 50 |
|
49 |
-<<cdfname>>= |
|
51 |
+<<libraries,results=hide>>= |
|
52 |
+library(ff) |
|
50 | 53 |
library(crlmm) |
54 |
+library(lattice) |
|
51 | 55 |
library(cacheSweave) |
52 |
-@ |
|
53 |
- |
|
54 |
-Several genotyping platforms are currently supported. Supported |
|
55 |
-platforms must have a corresponding annotation package (installed |
|
56 |
-separately) that are listed below. |
|
57 |
- |
|
58 |
-<<annotationPackages>>= |
|
59 |
-pkgs <- annotationPackages() |
|
60 |
-pkgs <- pkgs[grep("Crlmm", pkgs)] |
|
61 |
-pkgs |
|
62 |
-@ |
|
63 |
- |
|
64 |
-Note that 'pd.genomewidesnp6' and 'genomewidesnp6Crlmm' are both |
|
65 |
-annotation packages for the Affymetrix 6.0 platform available on |
|
66 |
-Bioconductor, but the 'genomewidesnp6Crlmm' annotation package must be |
|
67 |
-used for copy number estimation. The annotation package is specified |
|
68 |
-through the 'cdfName' -- the identifier without the 'Crlmm' postfix. |
|
69 |
-In the following code, we specify the cdf name for Affymetrix 6.0, |
|
70 |
-provide the complete path to the CEL files, and indicate where |
|
71 |
-intermediate files from the copy number estimation are to be saved. |
|
72 |
- |
|
73 |
-<<setup>>= |
|
74 |
-cdfName <- "genomewidesnp6" |
|
75 |
-pathToCels <- "/thumper/ctsa/snpmicroarray/hapmap/raw/affy/1m" |
|
76 | 56 |
if(getRversion() < "2.13.0"){ |
77 | 57 |
rpath <- getRversion() |
78 | 58 |
} else rpath <- "trunk" |
79 | 59 |
outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") |
80 |
-dir.create(outdir, recursive=TRUE, showWarnings=FALSE) |
|
60 |
+@ |
|
61 |
+ |
|
62 |
+<<ram>>= |
|
63 |
+ldPath(outdir) |
|
81 | 64 |
setCacheDir(outdir) |
65 |
+ocProbesets(150e3) |
|
66 |
+ocSamples(200) |
|
82 | 67 |
@ |
83 | 68 |
|
84 |
-All long computations are saved in the output directory |
|
85 |
-\Robject{outdir}. Users should change these variables as appropriate. |
|
86 |
-The following code chunk should fail unless the above arguments have |
|
87 |
-been set appropriately. |
|
69 |
+We begin by loading the \Robject{cnSet} object created by the |
|
70 |
+\verb+AffymetrixPreprocessCN+ vignette. |
|
88 | 71 |
|
89 |
-<<checkSetup>>= |
|
90 |
-if(!file.exists(outdir)) stop("Please specify a valid directory for storing output") |
|
91 |
-if(!file.exists(pathToCels)) stop("Please specify the correct path to the CEL files") |
|
72 |
+<<loadcnset>>= |
|
73 |
+if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda")) |
|
92 | 74 |
@ |
93 | 75 |
|
94 |
-Processed data from codechunks requiring long computations are saved |
|
95 |
-to disk by wrapping function calls in the \Robject{checkExists} |
|
96 |
-function. After running this vignette as a batch job, subsequent |
|
97 |
-calls to \verb@Sweave@ will load the saved computations from disk. See |
|
98 |
-the \Rfunction{checkExists} help file for additional details. |
|
99 |
- |
|
100 |
-\subsection{Preprocessing and genotyping.} |
|
101 |
- |
|
102 |
-In the following code chunk, we provide the complete path to the |
|
103 |
-Affymetrix CEL files and define a 'batch' variable. The |
|
104 |
-\Robject{batch} variable will be used to initialize a container for |
|
105 |
-storing the normalized intensities, the genotype calls, and the |
|
106 |
-parameter estimates for copy number. Often the chemistry plate or the |
|
107 |
-scan date of the array is a useful surrogate for batch. For the HapMap |
|
108 |
-CEL files in our analysis, the CEPH (C) and Yoruban (Y) samples were |
|
109 |
-prepared on separate chemistry plates. In the following code chunk, |
|
110 |
-we extract the population identifier from the CEL file names and |
|
111 |
-assign these identifiers to the variable \Robject{batch}. |
|
112 |
- |
|
113 |
-<<celfiles>>= |
|
114 |
-celFiles <- list.celfiles(pathToCels, full.names=TRUE, pattern=".CEL") |
|
115 |
-celFiles <- celFiles[substr(basename(celFiles), 13, 13) %in% c("C", "Y")] |
|
116 |
-plates <- substr(basename(celFiles), 13, 13) |
|
117 |
-@ |
|
118 | 76 |
|
119 |
-The preprocessing steps for copy number estimation includes quantile |
|
120 |
-normalization of the raw intensities for each probe and a step that |
|
121 |
-summarizes the intensities of multiple probes at a single locus. For |
|
122 |
-example, the Affymetrix 6.0 platform has 3 or 4 identical probes at |
|
123 |