Browse code

Updated vignette.

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GSVA@65908 bc3139a8-67e5-0310-9ffc-ced21a209358

Robert Castelo authored on 15/05/2012 10:17:32
Showing 6 changed files

... ...
@@ -1,13 +1,13 @@
1 1
 Package: GSVA
2
-Version: 1.5.1
3
-Date: 2012-03-23
4
-Title: Gene Set Variation Analysis
2
+Version: 1.5.2
3
+Date: 2012-05-15
4
+Title: Gene Set Variation Analysis for microarray and RNA-seq data
5 5
 Author: Justin Guinney <justin.guinney@sagebase.org> (with contributions from Robert Castelo <robert.castelo@upf.edu> and Sonja Haenzelmann <shanzelmann@imim.es)
6 6
 Maintainer: Justin Guinney <justin.guinney@sagebase.org>
7 7
 Depends: R (>= 2.13.0), methods, GSEABase (>= 1.17.4)
8 8
 Imports: methods, Biobase, GSEABase
9 9
 Enhances: snow, parallel
10
-Suggests: limma, qpgraph, RBGL, graph, Rgraphviz, RColorBrewer, genefilter, GSVAdata
10
+Suggests: limma, qpgraph, RBGL, graph, Rgraphviz, RColorBrewer, genefilter, mclust, edgeR, GSVAdata
11 11
 Description: Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner.
12 12
 License: GPL (>= 2)
13 13
 LazyLoad: yes
... ...
@@ -8,6 +8,6 @@ citEntry(entry="Article",
8 8
   journal = "submitted",
9 9
   year = "2011",
10 10
   textVersion = paste("Hanzelmann, S., Castelo, R. and Guinney, A.",
11
-                      "GSVA: Gene Set Variation Analysis",
12
-                      "submitted, 2011.")
11
+                      "GSVA: Gene Set Variation Analysis for microarray and RNA-seq data",
12
+                      "submitted, 2012.")
13 13
 )
... ...
@@ -15,8 +15,8 @@
15 15
 \newcommand{\Rpackage}[1]{{\textsf{#1}}}
16 16
 \newcommand{\Rclass}[1]{{\textit{#1}}}
17 17
 
18
-\title{GSVA - The Gene Set Variation Analysis Package}
19
-\author{Sonja H\"anzelmann, Robert Castelo and Justin Guinney}
18
+\title{GSVA: The Gene Set Variation Analysis package \\ for microarray and RNA-seq data}
19
+\author{Sonja H\"anzelmann$^1$, Robert Castelo$^1$ and Justin Guinney$^2$}
20 20
 
21 21
 \begin{document}
22 22
 
... ...
@@ -24,6 +24,14 @@
24 24
 
25 25
 \maketitle
26 26
 
27
+\begin{quote}
28
+{\scriptsize
29
+1. Research Program on Biomedical Informatics (GRIB), Hospital del Mar Research Institute (IMIM) and Universitat Pompeu Fabra, Parc de Recerca Biom\`edica de Barcelona, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain
30
+
31
+2. Sage Bionetworks, 1100 Fairview Ave N., Seattle, Washington, 98109 USA
32
+}
33
+\end{quote}
34
+
27 35
 \begin{abstract}
28 36
 The \Rpackage{GSVA} package implements a non-parametric unsupervised method,
29 37
 called Gene Set Variation Analysis (GSVA), for assessing gene set enrichment
... ...
@@ -85,102 +93,172 @@ GSVA, as well as for citing GSVA if you use it in your own work.
85 93
 
86 94
 \section{GSVA enrichment scores}
87 95
 
88
-GSVA enrichment scores are calculated from two main inputs: a matrix
89
-$X=\{x_{ij}\}_{p\times n}$ of expression values for $p$ genes through $n$
90
-samples, where typically $p\gg n$, and a collection of gene sets
91
-$\Gamma = \{\gamma_1, \dots, \gamma_m\}$. We shall denote by $x_i$ the
92
-expression profile of the $i$-th gene, by $x_{ij}$ the specific expression
93
-value of the $i$-th gene in the $j$-th sample, and by $\gamma_k$ the subset
94
-of row indices in $X$ such that $\gamma_k \subset \{1,\ldots\,p\}$ defines a
95
-set of genes forming a pathway or some other functional unit. Let $|\gamma_k |$
96
-be the number of genes in the gene set.
97
-
98
-The first step in the calculation consists of evaluating whether a gene $i$ is
99
-highly or lowly expressed in sample $j$ in the context of the sample population
100
-distribution, with an expression-level statistic calculated as follows. First,
101
-for each gene expression profile $x_i=\{x_{i1},\dots,x_{in}\}$, a non-parametric
102
-kernel estimation of its cumulative density function is performed using a
103
-Gaussian kernel \citep[pg.~148]{silverman_density_1986}:
96
+A schematic overview of the GSVA method is provided in Figure \ref{methods}.
97
+Shown are the two required main inputs: a matrix $X=\{x_{ij}\}_{p\times n}$ of
98
+normalized expression values (see Methods for details on the pre-processing steps)
99
+for $p$ genes by $n$ samples, where typically $p\gg n$, and a collection of gene
100
+sets $\Gamma = \{\gamma_1, \dots, \gamma_m\}$. We shall denote by $x_i$ the
101
+expression profile of the $i$-th gene, by $x_{ij}$ the specific expression value
102
+of the $i$-th gene in the $j$-th sample, and by $\gamma_k$ the subset of row
103
+indices in $X$ such that $\gamma_k \subset \{1,\ldots\,p\}$ defines a set of genes
104
+forming a pathway or some other functional unit. Let $|\gamma_k |$ be the number
105
+of genes in $\gamma_k$.
106
+
107
+\begin{figure}[ht]
108
+\centerline{\includegraphics[width=\textwidth]{methods}}
109
+\caption{{\bf GSVA methods outline.}
110
+The input for the GSVA algorithm are a gene expression matrix in the form of log2
111
+microarray expression values or RNA-seq counts and a database of gene sets.
112
+1. Kernel estimation of the cumulative density function (kcdf). The two plots
113
+show two simulated expression profiles mimicking 6 samples from microarray and
114
+RNA-seq. The $x$-axis corresponds to expression values where each gene is lowly
115
+expressed in the four samples with lower values and highly expressed in the other
116
+two. The scale of the kcdf is on the left $y$-axis and the scale of the Gaussian
117
+and Poisson kernels is on the right $y$-axis.
118
+2. The expression-level statistic is rank ordered for each sample. 
119
+3. For every gene set, the Kolmogorov-Smirnov-like rank statistic is calculated.
120
+The plot illustrates a gene set consisting of 3 genes out of a total number of 10
121
+with the sample-wise calculation of genes inside and outside of the gene set.
122
+4. The GSVA enrichment score is either the difference between the two sums or the
123
+maximum deviation from zero. The two plots show two simulations of the resulting
124
+scores under the null hypothesis of no gene expression change (see main text).
125
+The output of the algorithm is matrix containing pathway enrichment profiles for
126
+each gene set and sample. }
127
+\label{methods}
128
+\end{figure}
129
+
130
+GSVA starts by evaluating whether a gene $i$ is highly or lowly expressed in
131
+sample $j$ in the context of the sample population distribution. Probe effects
132
+can alter hybridization intensities in microarray data such that expression
133
+values can greatly differ between two non-expressed genes \citep{zilliox_gene_2007}.
134
+Analogous gene-specific biases, such as GC content or gene length have been
135
+described in RNA-seq data \citep{hansen_removing_2012}. To bring distinct expression
136
+profiles to a common scale, an expression-level statistic is calculated as follows.
137
+For each gene expression profile $x_i=\{x_{i1},\dots,x_{in}\}$, a non-parametric
138
+kernel estimation of its cumulative density function is performed using a Gaussian
139
+kernel \cite[pg.~148]{silverman_density_1986} in the case of microarray data:
104 140
 
105 141
 \begin{equation}
106 142
 \label{density}
107
-\hat{F}_{h_i}(x_{ij})=\frac{1}{n}\sum_{k=1}^n\int_{-\infty}^{\frac{x_{ij}-x_{ik}}{h_i}}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt\,.
143
+\hat{F}_{h_i}(x_{ij})=\frac{1}{n}\sum_{k=1}^n\int_{-\infty}^{\frac{x_{ij}-x_{ik}}{h_i}}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt\,,
108 144
 \end{equation}
109
-where $h_i$ is the gene-specific bandwidth parameter that controls the
110
-resolution of the kernel estimation and is taken as $h_i=s_i/4$, where $s_i$ is
111
-the sample standard deviation of the $i$-th gene. Second, the expression-level
112
-statistic is calculated as logarithm of the relative likelihood, or odds, that
113
-the $i$-th gene is expressed in sample $j$:
145
+where $h_i$ is the gene-specific bandwidth parameter that controls the resolution
146
+of the kernel estimation, which is set to $h_i=s_i/4$, where $s_i$ is the sample
147
+standard deviation of the $i$-th gene (Figure \ref{methods}, Step 1). In the case
148
+of RNA-seq data, a discrete Poisson kernel \citep{canale_bayesian_2011} is employed:
114 149
 
115 150
 \begin{equation}
116
-z_ij = \log\left(\frac{\hat{F}_{h_i}(x_{ij})}{1-\hat{F}_{h_i}(x_{ij})}\right)\,.
151
+\hat{F}_r(x_{ij}) = \frac{1}{n} \sum_{k=1}^n \sum_{y=0}^{x_{ij}} \frac{e^{-(x_{ik}+r)}(x_{ik}+r)^y}{y!}\,,
117 152
 \end{equation}
118
-
119
-The clearest context for differential gene expression arises under a bi-modal
120
-distribution of the data.  Here, $z_{ij}$'s corresponding to the lower mode will
121
-have a large negative statistic, higher modes will have a large positive
122
-statistic, and samples between the two modes will have $z_{ij}$'s at or near
123
-zero. Genes with uniform or unimodal population distributions do not preclude
124
-large negative or positive expression-level statistics, i.e. $x_{ij}$ lying in
125
-the tail of a distribution. To dampen the effect of potential outliers, the
126
-expression-level statistics $z_{ij}$ are first converted to ranks $z_{(i)j}$ for
127
-each $j$-th sample and then these ranks are further normalized into a rank-order
128
-statistic $r_{ij}=|p - z_{(i)j}+1-p/2|$ to be symmetric around zero, before evaluating
129
-the enrichment scores.
130
-
131
-We assess the enrichment score similarity to the GSEA method
132
-\citep{subramanian_gene_2005} using a Kolmogorov-Smirnov (K-S)
133
-like random walk statistic:
153
+where $r=0.5$ in order to set the mode of the Poisson kernel at each $x_{ik}$ since,
154
+in general, the mode of a Poisson distribution with an integer mean $\lambda$ occurs
155
+at $\lambda$ and $\lambda-1$ and at the largest integer smaller than $\lambda$ when
156
+$\lambda$ is continuous.
157
+
158
+Let $z_{ij}$ denote the previous expression-level statistic $\hat{F}_{h_i}(x_{ij})$,
159
+or $\hat{F}_r(x_{ij})$, depending on whether $x_{ij}$ are continuous microarray, or
160
+discrete count RNA-seq values, respectively. The following step condenses
161
+expression-level statistics into gene sets by calculating sample-wise enrichment
162
+scores. In order to reduce the influence of potential outliers on that step, we convert
163
+first the $z_{ij}$ into ranks $z_{(i)j}$ for each $j$-th sample and further normalize
164
+them into $r_{ij}=|p/2-z_{(i)j}|$ to make them symmetric around zero before evaluating
165
+the enrichment scores (Figure~\ref{methods}, Step 2).
166
+
167
+We assess the enrichment score similar to the GSEA and ASSESS methods
168
+\citep{subramanian_gene_2005,edelman_analysis_2006} using the Kolmogorov-Smirnov (KS)
169
+like random walk statistic (Figure~\ref{methods}, Step 3):
134 170
 \begin{equation}
135 171
 \label{walk}
136 172
 \nu_{jk}(\ell) = \frac{\sum_{i=1}^\ell |r_{ij}|^{\tau} I(g_{(i)} \in
137 173
 \gamma_k)}{\sum_{i=1}^p |r_{ij}|^{\tau}I(g_{(i)} \in \gamma_k)} -
138 174
 \frac{\sum_{i=1}^\ell I(g_{(i)} \not\in \gamma_k)}{p-|\gamma_k|},
139 175
 \end{equation}
140
-where $\tau$ is a parameter describing the weight of the tail in the
141
-random walk (default $\tau = 1$), $\gamma_k$ is the $k$-th gene set,
142
-$I(g_{(i)} \in \gamma_k)$ is the indicator function on whether
143
-the $i$-th gene (the gene corresponding to the $i$-th ranked
144
-expression-level statistic) is in gene set $\gamma_k$, $|\gamma_k|$ is
145
-the number of genes in the $k$-th gene set, and $p$ is the number of
146
-genes in the data set. The first term in this equation corresponds to
147
-the step cumulative distribution function of the $r_{ij}$ rank-order
148
-statistics of the gene forming gene set $\gamma_k$ throughout the
149
-ranking $z_{(1)j},\dots,z_{(p)j}$ in sample $j$, while the second term
150
-corresponds to the step cumulative distribution function of all other
151
-genes outside $\gamma_k$ throughout this same ranking.
152
-
153
-Two approaches are possible for turning the K-S random walk statistic
154
-into an enrichment score (ES). The first approach is the previously described
155
-maximum deviation method \citep{subramanian_gene_2005} where the ES for the
156
-$j$-th sample with respect to the $k$-th gene set is the maximum deviation
157
-of the random walk from zero:
176
+where $\tau$ is a parameter describing the weight of the tail in the random walk
177
+(default $\tau = 1$), $\gamma_k$ is the $k$-th gene set, $I(g_{(i)} \in \gamma_k)$ is
178
+the indicator function on whether the $i$-th gene (the gene corresponding to the $i$-th
179
+ranked expression-level statistic) is in gene set $\gamma_k$, $|\gamma_k|$ is the number
180
+of genes in the $k$-th gene set, and $p$ is the number of genes in the data set. 
181
+
182
+We offer two approaches for turning the KS random walk statistic into an enrichment
183
+statistic (ES). The first approach is the previously described maximum deviation method
184
+\citep{subramanian_gene_2005,edelman_analysis_2006,verhaak_integrated_2010} where the ES
185
+for the $j$-th sample with respect to the $k$-th gene set is the maximum deviation from
186
+zero of the random walk:
187
+
158 188
 \begin{equation}
159 189
 \label{escore}
160
-ES_{jk}^{\mbox{\tiny{max}}} = \nu_{jk}[\arg \max_{\ell=1,..,p} \left|\nu_{jk}(\ell)\right|].
190
+ES^{\mbox{\tiny{max}}}_{jk} = \nu_{jk}[\arg \max_{\ell=1,\dots,p} \left|\nu_{jk}(\ell)\right|].
161 191
 \end{equation}
162
-For each gene set $k$, this approach produces a distribution of enrichment
163
-scores that is bi-modal. Within the supervised paradigm, a ``normalized''
164
-enrichment score could be obtained via a permutation of the phenotypic labels;
165
-since we are operating without labels, we propose a second approach that
166
-produces an ES distribution that is unimodal:
192
+For each gene set $k$, this approach produces a distribution of enrichment scores that
193
+is bimodal (Figure~\ref{methods}, Step 4, Top panel). This is an intrinsic property of
194
+the KS-like random walk, which generates non-zero maximum deviations under the null
195
+distribution. In GSEA \citep{subramanian_gene_2005} it is also observed that the empirical
196
+null distribution obtained by permuting phenotypes is bimodal and, for this reason,
197
+significance is determined independently using the positive and negative sides of the null
198
+distribution. In our case, we would like to provide a standard Gaussian distribution of
199
+enrichment scores under the null hypothesis of no change in pathway activity throughout the
200
+sample population. For this purpose, we propose the following alternative score that produces
201
+an ES distribution approximating this requirement (Figure~\ref{methods}, Step 4, Bottom panel):
167 202
 
168 203
 \begin{equation}
169 204
 \label{escore_2}
170
-ES_{jk}^{\mbox{\tiny{diff}}} = \left|ES^{+}_{jk}\right| - \left|ES^{-}_{jk}\right|=\max_{\ell=1,\dots,p}(0,\nu_{jk}(\ell)) - \min_{\ell=1,\dots,p}(0,\nu_{jk}(\ell))\,,
205
+ES^{\mbox{\tiny{diff}}}_{jk} = \left|ES^{+}_{jk}\right| - \left|ES^{-}_{jk}\right|=\max_{\ell=1,\dots,p}(0,\nu_{jk}(\ell)) - \min_{\ell=1,\dots,p}(0,\nu_{jk}(\ell))\,.
171 206
 \end{equation}
172
-where $ES^{+}_{jk}$ and $ES^{-}_{jk}$ are the largest positive and negative
173
-random walk deviations from zero, respectively, for sample $j$ and gene set $k$.
174
-This approach takes the magnitude difference between the largest positive and
175
-negative random walk deviations, and has the effect of dampening out large
176
-enrichment scores if there is both a large positive and negative deviation in
177
-the random walk. This approach emphasizes genes in pathways that are concordantly
178
-activated in one direction only, i.e. over-expressed or under-expressed relative
179
-to the overall population. For pathways with genes strongly acting in both
180
-directions, the deviations will cancel each other out and show negligible enrichment.
181
-Also, because this statistic is unimodal (and through our own experimental
182
-observations, approximately normal), this statistic lends itself to downstream
183
-analyses which may impose distributional assumptions on the data.
207
+where $ES_{jk}^{+}$ and $ES_{jk}^{-}$ are the largest positive and negative random walk
208
+deviations from zero, respectively, for sample $j$ and gene set $k$.  This statistic may
209
+be compared to the Kuiper test statistic \citep{pearson_comparison_1963}, which sums the
210
+maximum and minimum deviations to make the test statistic more sensitive in the tails.
211
+In contrast, our test statistic penalizes deviations that are large in both tails, and
212
+provides a ``normalization'' of the enrichment score by subtracting potential noise. There
213
+is a clear biological interpretation of this statistic, it emphasizes genes in pathways that
214
+are concordantly activated in one direction only, either over-expressed or under-expressed
215
+relative to the overall population.  For pathways containing genes having very different
216
+expression patterns, the deviations will cancel each other out and show little or no enrichment.
217
+Because this statistic is unimodal and approximately normal downstream analyses which may
218
+impose distributional assumptions on the data are possible.
219
+
220
+Step 4 in Figure~\ref{methods} shows a simple simulation which we reproduce with the code
221
+below, where standard Gaussian deviates are independenty sampled from $p=20,000$ genes and
222
+$n=30$ samples, thus mimicking a null distribution of no change in gene expression. One
223
+hundred gene sets are uniformly sampled at random from the $p$ genes with sizes ranging
224
+between 10 and 100 genes. Using these two inputs, GSVA enrichment scores are calculated
225
+under the two approaches and their distribution is depicted in the two panels of this step
226
+in Figure~\ref{methods} and the larger figure below, illustrating the previous description.
227
+
228
+<<>>=
229
+library(GSVA)
230
+
231
+p <- 20000    ## number of genes
232
+n <- 30       ## number of samples
233
+nGS <- 100    ## number of gene sets
234
+min.sz <- 10  ## minimum gene set size
235
+max.sz <- 100 ## maximum gene set size
236
+X <- matrix(rnorm(p*n), nrow=p, dimnames=list(1:p, 1:n))
237
+dim(X)
238
+gs <- as.list(sample(min.sz:max.sz, size=nGS, replace=TRUE)) ## sample gene set sizes
239
+gs <- lapply(gs, function(n, p) sample(1:p, size=n, replace=FALSE), p) ## sample gene sets
240
+es.max <- gsva(X, gs, mx.diff=FALSE, verbose=FALSE)$es.obs
241
+es.dif <- gsva(X, gs, mx.diff=TRUE, verbose=FALSE)$es.obs
242
+@
243
+
244
+\begin{center}
245
+<<maxvsdif, fig=TRUE, png=TRUE, echo=TRUE, height=5, width=8>>=
246
+par(mfrow=c(1,2), mar=c(4, 4, 4, 1))
247
+plot(density(as.vector(es.max)), main="Maximum deviation from zero",
248
+     xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
249
+axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
250
+plot(density(as.vector(es.dif)), main="Difference between largest\npositive and negative deviations",
251
+     xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
252
+axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
253
+@
254
+\end{center}
255
+
256
+\bigskip
257
+Although the GSVA algorithm itself does not evaluate statistical significance for the
258
+enrichment of gene sets, significance with respect to one or more phenotypes can be easily
259
+evaluated using conventional statistical models. Likewise, false discovery rates can be
260
+estimated by permuting the sample labels (Methods).  Examples of these techniques are
261
+provided in the following section.
184 262
 
185 263
 \section{Overview of the package}
186 264
 
... ...
@@ -236,7 +314,7 @@ the parallelization of the calculations through a cluster of computers. In order
236 314
 to activate this option we should specify in the argument \Robject{parallel.sz}
237 315
 the number of processors we want to use (default is zero which means no
238 316
 parallelization is going to be employed). The other is possibility is loading
239
-the library \Rpackage{multicore} and then the \Rfunction{gsva()} function will
317
+the library \Rpackage{parallel} and then the \Rfunction{gsva()} function will
240 318
 use the core processors of the computer where R is running. If we want to
241 319
 limit \Rfunction{gsva()} in the number of core processors that is should use
242 320
 we can do it by specifying such a value in the \Robject{parallel.sz} argument.
... ...
@@ -247,6 +325,14 @@ serve to explicitly filter out gene sets by size and by pairwise overlap,
247 325
 respectively. Note that the size filter can be also applied within the
248 326
 \Rfunction{gsva()} function call.
249 327
 
328
+The \Rfunction{gsva()} function offers also two other different unsupervised
329
+GSE methods that yield single sample enrichment scores and can be selected
330
+through the \Robject{method} argument. One of them is a combined z-score
331
+approach \citep{lee_inferring_2008} (\Robject{method="zscore"}) and the other
332
+is the single-sample GSEA approach \citep{barbie_systematic_2009}
333
+(\Robject{method="ssgsea"}). By default \Robject{method="gsva"} and the
334
+Rfunction{gsva()} function uses the GSVA algorithm described before.
335
+
250 336
 \section{Applications}
251 337
 
252 338
 In this section we illustrate the following applications of \Rpackage{GSVA}:
... ...
@@ -512,7 +598,6 @@ dev.off()
512 598
 \label{leukemiaHeatmapGenes}
513 599
 \end{figure}
514 600
 
515
-
516 601
 \subsection{Molecular signature identification}
517 602
 
518 603
 In \citep{verhaak_integrated_2010} four subtypes of Glioblastoma multiforme
... ...
@@ -862,7 +947,7 @@ gHubNet <- layoutGraph(gHubNet, layoutType="fdp")
862 947
 renderGraph(gHubNet)
863 948
 @
864 949
 \begin{figure}[p]
865
-\centerline{\includegraphics[height=\textheight]{GSVA-selectedGGMhighDeg}}
950
+\centerline{\includegraphics[height=0.9\textheight]{GSVA-selectedGGMhighDeg}}
866 951
 \caption{Network of cross-talk associations between Broad C2 canonical pathways
867 952
          of the leukemia data set obtained by a Gaussian graphical modeling approach
868 953
          by which edges are included in the graph with a minimum absolute PCC > 0.7
... ...
@@ -889,6 +974,288 @@ methyltransferase {\it MTR} gene that encodes an enzyme that catalyzes the final
889 974
 biosynthesis and has been associated to an increased risk of ALL which was most pronounced for cases
890 975
 with the \textit{MLL} translocation \citep{lightfoot_genetic_2010}.
891 976
 
977
+\section{Comparison with other methods}
978
+
979
+In this section we compare the performance of GSVA with the two other methods available
980
+through the argument \Robject{method} of the function \Rfunction{gsva()} to condense
981
+gene-level expression into samplewise pathway-level enrichment scores. We start by
982
+calculating fold-changes using limma:
983
+
984
+<<>>=
985
+design <- model.matrix(~ factor(leukemia_eset$subtype))
986
+fitGenes <- lmFit(leukemia_eset, design)
987
+fitGenes <- eBayes(fitGenes)
988
+ttgenes <- topTable(fitGenes, coef=2, n=Inf)
989
+@
990
+Next, we split all \Sexpr{nrow(ttgenes)} genes into three equally sized fractions
991
+of the ranking by fold change:
992
+
993
+<<>>=
994
+abslogfc <- abs(ttgenes$logFC)
995
+names(abslogfc) <- ttgenes$ID
996
+qtlsabslogfc <- quantile(abslogfc, p=seq(0, 1, by=1/3))
997
+abslogfccutoff <- cut(abslogfc, qtlsabslogfc, include.lowest=TRUE)
998
+genesbyabslogfccutoff <- split(ttgenes$ID, abslogfccutoff)
999
+@
1000
+We want compare the performance of GSVA in producing enrichment scores that enable
1001
+the signature identification of a dichotomous phenotype under an scenario of subtle
1002
+changes. For this purpose we will use the genes in the second tercile of fold-changes
1003
+to form gene sets and calculate enrichment scores for them with each of the three
1004
+methods. Using these enrichment scores and the package \Rpackage{limma} we will select
1005
+the top five differentially expressed gene sets:
1006
+
1007
+<<>>=
1008
+ntopgenesets <- 5
1009
+@
1010
+and use them to build a hierarchical clustering of the samples whose two main branches
1011
+are taken as the predicted sample classification. The accuracy of this classification
1012
+is then measured in terms of the adjusted rand index (ARI); see \citet{hubert_ari_1985}.
1013
+We will apply this procedure through data sets of 10 samples per condition bootstrapped
1014
+from the original leukemia data set. This entire strategy is coded in the following function:
1015
+
1016
+<<>>=
1017
+computeARI <- function(geneSets, eset, selGenes, ntopgenesets) {
1018
+  eset <- eset[selGenes, c(sample(which(eset$subtype == "ALL"), size=10),
1019
+                           sample(which(eset$subtype == "MLL"), size=10))]
1020
+  e <- exprs(eset)
1021
+  design <- model.matrix(~ factor(eset$subtype))
1022
+
1023
+  es.gsva <- gsva(e, geneSets, min.sz=10, max.sz=500, verbose=FALSE)$es.obs
1024
+  fit <- lmFit(es.gsva, design)
1025
+  fit <- eBayes(fit)
1026
+  topgeneset <- topTable(fit, coef=2)$ID[1:ntopgenesets]
1027
+
1028
+  sampleClustering <- hclust(as.dist(1-cor(es.gsva[topgeneset, ], method="spearman")), method="complete")
1029
+  ari.gsva <- mclust::adjustedRandIndex(as.vector(cutree(sampleClustering, k=2)),
1030
+                                        as.integer(factor(eset$subtype)))
1031
+
1032
+  es.ss <- gsva(e, geneSets, method="ssgsea", min.sz=10, max.sz=500, verbose=FALSE)
1033
+  fit <- lmFit(es.ss, design)
1034
+  fit <- eBayes(fit)
1035
+  topgeneset <- topTable(fit, coef=2)$ID[1:ntopgenesets]
1036
+
1037
+  sampleClustering <- hclust(as.dist(1-cor(es.ss[topgeneset, ], method="spearman")), method="complete")
1038
+  ari.ss <- mclust::adjustedRandIndex(as.vector(cutree(sampleClustering, k=2)),
1039
+                                      as.integer(factor(eset$subtype)))
1040
+
1041
+  es.z <- gsva(e, geneSets, method="zscore", min.sz=10, max.sz=500, verbose=FALSE)
1042
+  fit <- lmFit(es.z, design)
1043
+  fit <- eBayes(fit)
1044
+  topgeneset <- topTable(fit, coef=2)$ID[1:ntopgenesets]
1045
+
1046
+  sampleClustering <- hclust(as.dist(1-cor(es.z[topgeneset, ], method="spearman")), method="complete")
1047
+  ari.z <- mclust::adjustedRandIndex(as.vector(cutree(sampleClustering, k=2)),
1048
+                                     as.integer(factor(eset$subtype)))
1049
+
1050
+  c(ari.gsva, ari.ss, ari.z)
1051
+}
1052
+@
1053
+
1054
+In order to make computations feasible for this vignette we consider only gene sets in the C2 Broad
1055
+Collection whose name includes either the word \verb+LEUKEMIA+ or \verb+MLL+ which should makes them
1056
+more relevant to the ALL/MLL leukemia data we are using:
1057
+
1058
+<<>>=
1059
+mappedGeneSets <- geneIds(mapIdentifiers(c2BroadSets[grep("LEUKEMIA|MLL", names(c2BroadSets))],
1060
+                                         GSEABase::AnnoOrEntrezIdentifier(annotation(leukemia_eset))))
1061
+length(mappedGeneSets)
1062
+names(mappedGeneSets)
1063
+@
1064
+Finally, we carry out the simulation on 50 bootstrapped data sets as follows:
1065
+
1066
+<<>>=
1067
+set.seed(123)
1068
+ari <- replicate(50, computeARI(mappedGeneSets, leukemia_eset,
1069
+                                genesbyabslogfccutoff[[2]], ntopgenesets))
1070
+@
1071
+where we have set the seed on the random number generator to enable reproducing the exact result.
1072
+In Figure~\ref{comp_methods} we can see the 50 ARI values per method which are higher in GSVA than with
1073
+ssGSEA or with the combined z-score approach. On top of each of the boxplots for these two compared
1074
+ethods we show the $P$ values of the Kruskal-Wallis rank sum test of the null that the location
1075
+arameters of the GSVA ARI values are the same than those from ssGSEA and the combined z-score,
1076
+respectively. Both $P$ values indicate that GSVA provides larger power than ssGSEA and z-score to
1077
+detect subtle changes in pathway activity.
1078
+
1079
+<<perf, echo=FALSE, results=hide>>=
1080
+png(filename="GSVA-perf.png", width=1000, height=500, res=150)
1081
+par(mfrow=c(1,2), mar=c(4, 4, 1, 1))
1082
+design <- model.matrix(~ factor(leukemia_eset$subtype))
1083
+fitGenes <- lmFit(leukemia_eset, design)
1084
+fitGenes <- eBayes(fitGenes)
1085
+ttgenes_leu <- topTable(fitGenes, coef=2, n=Inf)
1086
+abslogfc_leu <- abs(ttgenes_leu$logFC)
1087
+names(abslogfc_leu) <- ttgenes_leu$ID
1088
+qtlsabslogfc_leu <- quantile(abslogfc_leu, p=seq(0, 1, by=1/3))
1089
+abslogfccutoff_leu <- cut(abslogfc_leu, qtlsabslogfc_leu, include.lowest=TRUE)
1090
+genesbyabslogfccutoff_leu <- split(ttgenes_leu$ID, abslogfccutoff_leu)
1091
+M <- ttgenes_leu$logFC
1092
+P <- -log10(ttgenes_leu$P.Value)
1093
+names(M) <- names(P) <- ttgenes_leu$ID
1094
+plot(M, P, pch=".", cex=1, type="n",
1095
+     xlim=c(-2, 2), ylim=c(0, 15), xlab=expression(paste(log[2], " ALL", -log[2], " MLL")),
1096
+     ylab=expression(paste(-log[10], " P-value")), main="", las=1)
1097
+colrs <- c("blue", "violet", "red")
1098
+for (i in length(genesbyabslogfccutoff_leu):1) {
1099
+  points(M[genesbyabslogfccutoff_leu[[i]]], P[genesbyabslogfccutoff_leu[[i]]], pch=".", cex=1, col=colrs[i])
1100
+}
1101
+boxplot(t(ari), names=c("GSVA", "ssGSEA", "zscore"), las=1, xlab="Method", ylab="Adjusted Rand Index (ARI)")
1102
+points(1:3+0.25, rowMeans(ari), pch=23, col="black", bg="black")
1103
+pp <- c(kruskal.test(list(ari[1,], ari[2,]))$p.value, kruskal.test(list(ari[1,], ari[3,]))$p.value)
1104
+pp <- sprintf("P=%.3g", pp)
1105
+text(1:3, 0.9*rep(max(ari), 3), c(" ", paste("Kruskal-Wallis", pp, sep="\n")), cex=0.5)
1106
+dev.off()
1107
+@
1108
+\begin{figure}[ht]
1109
+\centerline{\includegraphics[width=\textwidth]{GSVA-perf}}
1110
+\caption{{\bf Comparison of GSVA with single sample GSEA (ssGSEA) and combined z-score (zscore).}
1111
+         On the left a volcano plot is shown where genes highlighted in red form the first tercile
1112
+         of largest absolute fold-changes, violet indicates the second tercile and blue the third
1113
+         tercile. On the right, boxplots of the adjusted rand index (ARI) values are shown for each
1114
+         method using gene sets formed from genes belonging to the second tercile of fold-change
1115
+         magnitude (violet in the volcano plot). Diamonds indicate mean value.}
1116
+\label{comp_methods}
1117
+\end{figure}
1118
+
1119
+\section{GSVA for RNA-Seq data}
1120
+
1121
+In this section we illustrate how to use GSVA with RNA-seq data and, more importantly, how the
1122
+method provides pathway activity profiles analogous to the ones obtained from microarray data by
1123
+using samples of lymphoblastoid cell lines (LCL) from HapMap individuals which have been profiled
1124
+using both technologies \cite{huang_genome-wide_2007, pickrell_understanding_2010}. These data
1125
+form part of the experimental package \Rpackage{GSVAdata} and the corresponding help pages contain
1126
+details on how the data were processed. We start loading these data and verifying that
1127
+they indeed contain expression data for the same genes and samples, as follows:
1128
+
1129
+<<>>=
1130
+data(commonPickrellHuang)
1131
+
1132
+stopifnot(identical(featureNames(huangArrayRMAnoBatchCommon_eset),
1133
+                    featureNames(pickrellCountsArgonneCQNcommon_eset)))
1134
+stopifnot(identical(sampleNames(huangArrayRMAnoBatchCommon_eset),
1135
+                    sampleNames(pickrellCountsArgonneCQNcommon_eset)))
1136
+@
1137
+Next, for the purpose of the current analysis we are going to use the previously defined collection
1138
+of canonical C2 Broad Sets enlarged with two additional ones corresponding to genes with sex-specific
1139
+expression:
1140
+
1141
+<<<>>=
1142
+data(genderGenesEntrez)
1143
+
1144
+MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
1145
+               collectionType=BroadCollection(category="c2"), setName="MSY")
1146
+MSY
1147
+XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
1148
+               collectionType=BroadCollection(category="c2"), setName="XiE")
1149
+XiE
1150
+
1151
+
1152
+canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))
1153
+canonicalC2BroadSets
1154
+@
1155
+We calculate now GSVA enrichment scores for these gene sets using first the microarray
1156
+data and then the RNA-seq data. Note that the only requirement to do the latter is to
1157
+set the argument \Robject{rnaseq=TRUE} which is \Robject{FALSE} by default.
1158
+
1159
+<<<>>=
1160
+esmicro <- gsva(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
1161
+                mx.diff=TRUE, verbose=FALSE, rnaseq=FALSE)$es.obs
1162
+dim(esmicro)
1163
+esrnaseq <- gsva(pickrellCountsArgonneCQNcommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
1164
+                 mx.diff=TRUE, verbose=FALSE, rnaseq=TRUE)$es.obs
1165
+dim(esrnaseq)
1166
+@
1167
+In order to compare expression values from both technologies we are going to transform
1168
+the RNA-seq read counts into RPKM values. For this purpose we need gene length and G+C content
1169
+information also stored in the \Rpackage{GSVAdata} package and use the \Rfunction{cpm()}
1170
+function from the \Rpackage{edgeR} package. Note that RPKMs can only be calculated for those
1171
+genes for which the gene length and G+C content information is available:
1172
+
1173
+<<>>=
1174
+data(annotEntrez220212)
1175
+head(annotEntrez220212)
1176
+
1177
+cpm <- edgeR::cpm(exprs(pickrellCountsArgonneCQNcommon_eset))
1178
+dim(cpm)
1179
+
1180
+common <- intersect(rownames(cpm), rownames(annotEntrez220212))
1181
+length(common)
1182
+
1183
+rpkm <- sweep(cpm[common, ], 1, annotEntrez220212[common, "Length"] / 10^3, FUN="/")
1184
+dim(rpkm)
1185
+
1186
+dim(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ])
1187
+@
1188
+We finally calculate Spearman correlations between gene and gene-level expression values
1189
+and gene-set level GSVA enrichment scores produced from data obtained by microarray and
1190
+RNA-seq technologies:
1191
+
1192
+<<>>=
1193
+corsrowsgene <- sapply(1:nrow(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]),
1194
+                       function(i, expmicro, exprnaseq) cor(expmicro[i, ], exprnaseq[i, ], method="pearson"),
1195
+                       exprs(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]), log2(rpkm+0.1))
1196
+names(corsrowsgene) <- rownames(rpkm)
1197
+
1198
+corsrowsgs <- sapply(1:nrow(esmicro),
1199
+                     function(i, esmicro, esrnaseq) cor(esmicro[i, ], esrnaseq[i, ], method="spearman"),
1200
+                     exprs(esmicro), exprs(esrnaseq))
1201
+names(corsrowsgs) <- rownames(esmicro)
1202
+@
1203
+In panels A and B of Figure~\ref{rnaseqcomp} we can see the distribution of these correlations at
1204
+gene and gene-set level and they show that GSVA enrichment scores correlate similarly to gene
1205
+expression levels produced by both profiling technologies.
1206
+
1207
+We have also examined in detail two gene sets containing gender specific genes: those that escape
1208
+X-inactivation in female samples \cite{carrel_x-inactivation_2005} and those that are located on
1209
+the male-specific region of the Y chrosomome \cite{skaletsky_male-specific_2003}. In panels C and D
1210
+of Figure~\ref{rnaseqcomp} we can see how microarray and RNA-seq enrichment scores correlate very
1211
+well in these gene sets, with $\rho=0.82$ for the male-specific gene set and $\rho=0.78$ for the
1212
+female-specific gene set. Male and female samples show higher GSVA enrichment scores in their
1213
+corresponding gene set. This demonstrates the flexibility of GSVA to enable analogous unsupervised
1214
+and single sample GSE analyses in data coming from both, microarray and RNA-seq technologies.
1215
+
1216
+<<RNAseqComp, echo=FALSE, results=hide>>=
1217
+png(filename="GSVA-RNAseqComp.png", width=1100, height=1100, res=150)
1218
+par(mfrow=c(2,2), mar=c(4, 5, 3, 2))
1219
+hist(corsrowsgene, xlab="Spearman correlation", main="Gene level\n(RNA-seq RPKM vs Microarray RMA)",
1220
+     xlim=c(-1, 1), col="grey", las=1)
1221
+par(xpd=TRUE)
1222
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "A", font=2, cex=2)
1223
+par(xpd=FALSE)
1224
+hist(corsrowsgs, xlab="Spearman correlation", main="Gene set level\n(GSVA enrichment scores)",
1225
+     xlim=c(-1, 1), col="grey", las=1)
1226
+par(xpd=TRUE)
1227
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "B", font=2, cex=2)
1228
+par(xpd=FALSE)
1229
+plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
1230
+     main=sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])), las=1, type="n")
1231
+     sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ]))
1232
+abline(lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ]), lwd=2, lty=2, col="grey")
1233
+points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"],
1234
+       exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1)
1235
+points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
1236
+       exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1)
1237
+par(xpd=TRUE)
1238
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "C", font=2, cex=2)
1239
+par(xpd=FALSE)
1240
+plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
1241
+     main=sprintf("XiE R=%.2f", cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])), las=1, type="n")
1242
+abline(lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ]), lwd=2, lty=2, col="grey")
1243
+points(exprs(esrnaseq["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
1244
+       exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1)
1245
+points(exprs(esrnaseq)["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
1246
+       exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1)
1247
+par(xpd=TRUE)
1248
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "D", font=2, cex=2)
1249
+par(xpd=FALSE)
1250
+dev.off()
1251
+@
1252
+\begin{figure}[p]
1253
+\centerline{\includegraphics[height=0.5\texheight]{GSVA-RNAseqComp}}
1254
+\caption{{\bf GSVA for RNA-seq (Argonne).} A. Distribution of Spearman correlation values between gene expression profiles of RNA-seq and microarray data. B. Distribution of Spearman correlation values between GSVA enrichment scores of gene sets calculated from RNA-seq and microarray data. C and D. Comparison of GSVA enrichment scores obtained from microarray and RNA-seq data for two gene sets containing genes with sex-specific expression: MSY formed by genes from the male-specific region of the Y chromosome, thus male-specific, and XiE formed by genes that escape X-inactivation in females, thus female-specific. Red and blue dots represent female and male samples, respectively. In both cases GSVA-scores show very high correlation between the two profiling technologies where female samples show higher enrichment scores in the female-specific gene set and male samples show higher enrichment scores in the male-specific gene set.}
1255
+\label{rnaseqcomp}
1256
+\end{figure}
1257
+
1258
+
892 1259
 \section{Session Information}
893 1260
 
894 1261
 <<info, results=tex>>=
... ...
@@ -1,8 +1,8 @@
1 1
 @article{GSVA_submitted,
2 2
 	Author = {Sonja H\"anzelmann and Robert Castelo and Justin Guinney},
3 3
 	Journal = {submitted},
4
-	Title = {{GSVA}: Gene Set Variation Analysis},
5
-	Year = {2011}}
4
+	Title = {{GSVA}: Gene Set Variation Analysis for microarray and RNA-Seq data},
5
+	Year = {2012}}
6 6
 
7 7
 
8 8
 @article{cahoy_transcriptome_2008,
... ...
@@ -75,7 +75,7 @@
75 75
 	Issn = {1532-4435},
76 76
 	Journal = {J Mach Learn Res},
77 77
 	Pages = {2621--2650},
78
-	Title = {A robust procedure for Gaussian graphical model search from microarray data with p larger than n},
78
+	Title = {A robust procedure for {G}aussian graphical model search from microarray data with p larger than n},
79 79
 	Volume = {7},
80 80
 	Year = {2006}}
81 81
 
... ...
@@ -110,7 +110,7 @@
110 110
 	Author = {Roverato, A. and Whittaker, J.},
111 111
 	Journal = {Stat Comput},
112 112
 	Pages = {297--302},
113
-	Title = {Standard errors for the parameters of graphical Gaussian models},
113
+	Title = {Standard errors for the parameters of graphical {G}aussian models},
114 114
 	Volume = {6},
115 115
 	Year = {1996}}
116 116
 
... ...
@@ -301,3 +301,120 @@
301 301
   Address = {Helsinki},
302 302
 	Year = {2010},
303 303
 }
304
+
305
+@article{zilliox_gene_2007,
306
+  title = {A gene expression bar code for microarray data},
307
+  volume = {4},
308
+  url = {http://dx.doi.org/10.1038/nmeth1102},
309
+  number = {11},
310
+  journal = {Nat Meth},
311
+  author = {Zilliox, Michael J and Irizarry, Rafael A},
312
+  year = {2007},
313
+  pages = {911--913}
314
+}
315
+
316
+@article{hansen_removing_2012,
317
+  title = {Removing technical variability in {RNA-seq} data using conditional quantile normalization},
318
+  journal = {Biostatistics},
319
+  author = {Hansen, Kasper D. and Irizarry, Rafael A. and Wu, Zhijin},
320
+  year = {2012}
321
+}
322
+
323
+@article{canale_bayesian_2011,
324
+  title = {Bayesian kernel mixtures for counts},
325
+  volume = {106},
326
+  number = {496},
327
+  journal = {Journal of the American Statistical Association},
328
+  author = {Canale, Antonio and Dunson, David B.},
329
+  year = {2011},
330
+  pages = {1528--1539}
331
+}
332
+
333
+@article{hubert_ari_1985,
334
+  title = {Comparing partitions},
335
+  volume = {2},
336
+  number = {1},
337
+  journal = {Journal of Classification},
338
+  author = {Hubert, Lawrence and Arabie, Phipps},
339
+  year = {1985},
340
+  pages = {193--218}
341
+}
342
+
343
+@article{edelman_analysis_2006,
344
+  title = {Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles},
345
+  volume = {22},
346
+  number = {14},
347
+  journal = {Bioinformatics},
348
+  author = {Edelman, Elena and Porrello, Alessandro and Guinney, Justin and Balakumaran, Bala and Bild, Andrea and Febbo, Phillip G and Mukherjee, Sayan},
349
+  year = {2006},
350
+  pages = {e108--e116}
351
+}
352
+
353
+@article{pearson_comparison_1963,
354
+  title = {Comparison of tests for randomness of points on a line},
355
+  volume = {50},
356
+  journal = {Biometrika},
357
+  author = {Pearson, {E.S.}},
358
+  year = {1963},
359
+  pages = {315--325}
360
+}
361
+
362
+@article{lee_inferring_2008,
363
+  title = {Inferring Pathway Activity toward Precise Disease Classification},
364
+  volume = {4},
365
+  number = {11},
366
+  journal = {{PLoS} Comput Biol},
367
+  author = {Lee, Eunjung and Chuang, {Han-Yu} and Kim, {Jong-Won} and Ideker, Trey and Lee, Doheon},
368
+  year = {2008},
369
+  pages = {e1000217}
370
+}
371
+
372
+@article{barbie_systematic_2009,
373
+  title = {Systematic {RNA} interference reveals that oncogenic {KRAS-driven} cancers require {TBK1}},
374
+  volume = {462},
375
+  number = {7269},
376
+  journal = {Nature},
377
+  author = {Barbie, David A. and Tamayo, Pablo and Boehm, Jesse S. and Kim, So Young and Moody, Susan E. and Dunn, Ian F. and Schinzel, Anna C. and Sandy, Peter and Meylan, Etienne and Scholl, Claudia and Fr{\textbackslash}textbackslashtextbar[ouml]{\textbackslash}textbackslashtextbarhling, Stefan and Chan, Edmond M. and Sos, Martin L. and Michel, Kathrin and Mermel, Craig and Silver, Serena J. and Weir, Barbara A. and Reiling, Jan H. and Sheng, Qing and Gupta, Piyush B. and Wadlow, Raymond C. and Le, Hanh and Hoersch, Sebastian and Wittner, Ben S. and Ramaswamy, Sridhar and Livingston, David M. and Sabatini, David M. and Meyerson, Matthew and Thomas, Roman K. and Lander, Eric S. and Mesirov, Jill P. and Root, David E. and Gilliland, D. Gary and Jacks, Tyler and Hahn, William C.},
378
+  year = {2009},
379
+  pages = {108--112}
380
+}
381
+
382
+@article{huang_genome-wide_2007,
383
+  title = {A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity},
384
+  volume = {104},
385
+  author = {Huang, R Stephanie and Duan, Shiwei and Bleibel, Wasim K and Kistner, Emily O and Zhang, Wei and Clark, Tyson A and Chen, Tina X and Schweitzer, Anthony C and Blume, John E and Cox, Nancy J and Dolan, M Eileen},
386
+  number = {23},
387
+  journal = {Proceedings of the National Academy of Sciences of the United States of America},
388
+  year = {2007},
389
+  pages = {9758--9763}
390
+}
391
+
392
+@article{pickrell_understanding_2010,
393
+  title = {Understanding mechanisms underlying human gene expression variation with {RNA} sequencing},
394
+  volume = {464},
395
+  number = {7289},
396
+  journal = {Nature},
397
+  author = {Pickrell, Joseph K. and Marioni, John C. and Pai, Athma A. and Degner, Jacob F. and Engelhardt, Barbara E. and Nkadori, Everlyne and Veyrieras, {Jean-Baptiste} and Stephens, Matthew and Gilad, Yoav and Pritchard, Jonathan K.},
398
+  year = {2010},
399
+  pages = {768--772}
400
+}
401
+
402
+@article{carrel_x-inactivation_2005,
403
+  title = {X-inactivation profile reveals extensive variability in X-linked gene expression in females},
404
+  volume = {434},
405
+  number = {7031},
406
+  journal = {Nature},
407
+  author = {Carrel, Laura and Willard, Huntington F},
408
+  year = {2005},
409
+  pages = {400--404}
410
+}
411
+
412
+@article{skaletsky_male-specific_2003,
413
+  title = {The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes},
414
+  volume = {423},
415
+  number = {6942},
416
+  journal = {Nature},
417
+  author = {Skaletsky, Helen and {Kuroda-Kawaguchi}, Tomoko and Minx, Patrick J. and Cordum, Holland S. and Hillier, {LaDeana} and Brown, Laura G. and Repping, Sjoerd and Pyntikova, Tatyana and Ali, Johar and Bieri, Tamberlyn and Chinwalla, Asif and Delehaunty, Andrew and Delehaunty, Kim and Du, Hui and Fewell, Ginger and Fulton, Lucinda and Fulton, Robert and Graves, Tina and Hou, {Shun-Fang} and Latrielle, Philip and Leonard, Shawn and Mardis, Elaine and Maupin, Rachel and {McPherson}, John and Miner, Tracie and Nash, William and Nguyen, Christine and Ozersky, Philip and Pepin, Kymberlie and Rock, Susan and Rohlfing, Tracy and Scott, Kelsi and Schultz, Brian and Strong, Cindy and {Tin-Wollam}, Aye and Yang, {Shiaw-Pyng} and Waterston, Robert H. and Wilson, Richard K. and Rozen, Steve and Page, David C.},
418
+  year = {2003},
419
+  pages = {825--837}
420
+}
304 421
new file mode 100644
305 422
Binary files /dev/null and b/inst/doc/methods.png differ
... ...
@@ -72,7 +72,8 @@ void row_d(double* x, double* y, double* r, int size_density_n, int size_test_n,
72 72
 			left_tail += rnaseq ? ppois(y[j], x[i]+bw, TRUE, FALSE) : precomputedCdf(y[j]-x[i], bw);
73 73
 		}
74 74
 		left_tail = left_tail / size_density_n;
75
-		r[j] = -1.0 * log((1.0-left_tail)/left_tail);
75
+		/* r[j] = -1.0 * log((1.0-left_tail)/left_tail); */ /* log-odds not necessary */
76
+    r[j] = left_tail;
76 77
 	}
77 78
 }
78 79