git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GSVA@65908 bc3139a8-67e5-0310-9ffc-ced21a209358
... | ... |
@@ -1,13 +1,13 @@ |
1 | 1 |
Package: GSVA |
2 |
-Version: 1.5.1 |
|
3 |
-Date: 2012-03-23 |
|
4 |
-Title: Gene Set Variation Analysis |
|
2 |
+Version: 1.5.2 |
|
3 |
+Date: 2012-05-15 |
|
4 |
+Title: Gene Set Variation Analysis for microarray and RNA-seq data |
|
5 | 5 |
Author: Justin Guinney <justin.guinney@sagebase.org> (with contributions from Robert Castelo <robert.castelo@upf.edu> and Sonja Haenzelmann <shanzelmann@imim.es) |
6 | 6 |
Maintainer: Justin Guinney <justin.guinney@sagebase.org> |
7 | 7 |
Depends: R (>= 2.13.0), methods, GSEABase (>= 1.17.4) |
8 | 8 |
Imports: methods, Biobase, GSEABase |
9 | 9 |
Enhances: snow, parallel |
10 |
-Suggests: limma, qpgraph, RBGL, graph, Rgraphviz, RColorBrewer, genefilter, GSVAdata |
|
10 |
+Suggests: limma, qpgraph, RBGL, graph, Rgraphviz, RColorBrewer, genefilter, mclust, edgeR, GSVAdata |
|
11 | 11 |
Description: Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner. |
12 | 12 |
License: GPL (>= 2) |
13 | 13 |
LazyLoad: yes |
... | ... |
@@ -8,6 +8,6 @@ citEntry(entry="Article", |
8 | 8 |
journal = "submitted", |
9 | 9 |
year = "2011", |
10 | 10 |
textVersion = paste("Hanzelmann, S., Castelo, R. and Guinney, A.", |
11 |
- "GSVA: Gene Set Variation Analysis", |
|
12 |
- "submitted, 2011.") |
|
11 |
+ "GSVA: Gene Set Variation Analysis for microarray and RNA-seq data", |
|
12 |
+ "submitted, 2012.") |
|
13 | 13 |
) |
... | ... |
@@ -15,8 +15,8 @@ |
15 | 15 |
\newcommand{\Rpackage}[1]{{\textsf{#1}}} |
16 | 16 |
\newcommand{\Rclass}[1]{{\textit{#1}}} |
17 | 17 |
|
18 |
-\title{GSVA - The Gene Set Variation Analysis Package} |
|
19 |
-\author{Sonja H\"anzelmann, Robert Castelo and Justin Guinney} |
|
18 |
+\title{GSVA: The Gene Set Variation Analysis package \\ for microarray and RNA-seq data} |
|
19 |
+\author{Sonja H\"anzelmann$^1$, Robert Castelo$^1$ and Justin Guinney$^2$} |
|
20 | 20 |
|
21 | 21 |
\begin{document} |
22 | 22 |
|
... | ... |
@@ -24,6 +24,14 @@ |
24 | 24 |
|
25 | 25 |
\maketitle |
26 | 26 |
|
27 |
+\begin{quote} |
|
28 |
+{\scriptsize |
|
29 |
+1. Research Program on Biomedical Informatics (GRIB), Hospital del Mar Research Institute (IMIM) and Universitat Pompeu Fabra, Parc de Recerca Biom\`edica de Barcelona, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain |
|
30 |
+ |
|
31 |
+2. Sage Bionetworks, 1100 Fairview Ave N., Seattle, Washington, 98109 USA |
|
32 |
+} |
|
33 |
+\end{quote} |
|
34 |
+ |
|
27 | 35 |
\begin{abstract} |
28 | 36 |
The \Rpackage{GSVA} package implements a non-parametric unsupervised method, |
29 | 37 |
called Gene Set Variation Analysis (GSVA), for assessing gene set enrichment |
... | ... |
@@ -85,102 +93,172 @@ GSVA, as well as for citing GSVA if you use it in your own work. |
85 | 93 |
|
86 | 94 |
\section{GSVA enrichment scores} |
87 | 95 |
|
88 |
-GSVA enrichment scores are calculated from two main inputs: a matrix |
|
89 |
-$X=\{x_{ij}\}_{p\times n}$ of expression values for $p$ genes through $n$ |
|
90 |
-samples, where typically $p\gg n$, and a collection of gene sets |
|
91 |
-$\Gamma = \{\gamma_1, \dots, \gamma_m\}$. We shall denote by $x_i$ the |
|
92 |
-expression profile of the $i$-th gene, by $x_{ij}$ the specific expression |
|
93 |
-value of the $i$-th gene in the $j$-th sample, and by $\gamma_k$ the subset |
|
94 |
-of row indices in $X$ such that $\gamma_k \subset \{1,\ldots\,p\}$ defines a |
|
95 |
-set of genes forming a pathway or some other functional unit. Let $|\gamma_k |$ |
|
96 |
-be the number of genes in the gene set. |
|
97 |
- |
|
98 |
-The first step in the calculation consists of evaluating whether a gene $i$ is |
|
99 |
-highly or lowly expressed in sample $j$ in the context of the sample population |
|
100 |
-distribution, with an expression-level statistic calculated as follows. First, |
|
101 |
-for each gene expression profile $x_i=\{x_{i1},\dots,x_{in}\}$, a non-parametric |
|
102 |
-kernel estimation of its cumulative density function is performed using a |
|
103 |
-Gaussian kernel \citep[pg.~148]{silverman_density_1986}: |
|
96 |
+A schematic overview of the GSVA method is provided in Figure \ref{methods}. |
|
97 |
+Shown are the two required main inputs: a matrix $X=\{x_{ij}\}_{p\times n}$ of |
|
98 |
+normalized expression values (see Methods for details on the pre-processing steps) |
|
99 |
+for $p$ genes by $n$ samples, where typically $p\gg n$, and a collection of gene |
|
100 |
+sets $\Gamma = \{\gamma_1, \dots, \gamma_m\}$. We shall denote by $x_i$ the |
|
101 |
+expression profile of the $i$-th gene, by $x_{ij}$ the specific expression value |
|
102 |
+of the $i$-th gene in the $j$-th sample, and by $\gamma_k$ the subset of row |
|
103 |
+indices in $X$ such that $\gamma_k \subset \{1,\ldots\,p\}$ defines a set of genes |
|
104 |
+forming a pathway or some other functional unit. Let $|\gamma_k |$ be the number |
|
105 |
+of genes in $\gamma_k$. |
|
106 |
+ |
|
107 |
+\begin{figure}[ht] |
|
108 |
+\centerline{\includegraphics[width=\textwidth]{methods}} |
|
109 |
+\caption{{\bf GSVA methods outline.} |
|
110 |
+The input for the GSVA algorithm are a gene expression matrix in the form of log2 |
|
111 |
+microarray expression values or RNA-seq counts and a database of gene sets. |
|
112 |
+1. Kernel estimation of the cumulative density function (kcdf). The two plots |
|
113 |
+show two simulated expression profiles mimicking 6 samples from microarray and |
|
114 |
+RNA-seq. The $x$-axis corresponds to expression values where each gene is lowly |
|
115 |
+expressed in the four samples with lower values and highly expressed in the other |
|
116 |
+two. The scale of the kcdf is on the left $y$-axis and the scale of the Gaussian |
|
117 |
+and Poisson kernels is on the right $y$-axis. |
|
118 |
+2. The expression-level statistic is rank ordered for each sample. |
|
119 |
+3. For every gene set, the Kolmogorov-Smirnov-like rank statistic is calculated. |
|
120 |
+The plot illustrates a gene set consisting of 3 genes out of a total number of 10 |
|
121 |
+with the sample-wise calculation of genes inside and outside of the gene set. |
|
122 |
+4. The GSVA enrichment score is either the difference between the two sums or the |
|
123 |
+maximum deviation from zero. The two plots show two simulations of the resulting |
|
124 |
+scores under the null hypothesis of no gene expression change (see main text). |
|
125 |
+The output of the algorithm is matrix containing pathway enrichment profiles for |
|
126 |
+each gene set and sample. } |
|
127 |
+\label{methods} |
|
128 |
+\end{figure} |
|
129 |
+ |
|
130 |
+GSVA starts by evaluating whether a gene $i$ is highly or lowly expressed in |
|
131 |
+sample $j$ in the context of the sample population distribution. Probe effects |
|
132 |
+can alter hybridization intensities in microarray data such that expression |
|
133 |
+values can greatly differ between two non-expressed genes \citep{zilliox_gene_2007}. |
|
134 |
+Analogous gene-specific biases, such as GC content or gene length have been |
|
135 |
+described in RNA-seq data \citep{hansen_removing_2012}. To bring distinct expression |
|
136 |
+profiles to a common scale, an expression-level statistic is calculated as follows. |
|
137 |
+For each gene expression profile $x_i=\{x_{i1},\dots,x_{in}\}$, a non-parametric |
|
138 |
+kernel estimation of its cumulative density function is performed using a Gaussian |
|
139 |
+kernel \cite[pg.~148]{silverman_density_1986} in the case of microarray data: |
|
104 | 140 |
|
105 | 141 |
\begin{equation} |
106 | 142 |
\label{density} |
107 |
-\hat{F}_{h_i}(x_{ij})=\frac{1}{n}\sum_{k=1}^n\int_{-\infty}^{\frac{x_{ij}-x_{ik}}{h_i}}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt\,. |
|
143 |
+\hat{F}_{h_i}(x_{ij})=\frac{1}{n}\sum_{k=1}^n\int_{-\infty}^{\frac{x_{ij}-x_{ik}}{h_i}}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt\,, |
|
108 | 144 |
\end{equation} |
109 |
-where $h_i$ is the gene-specific bandwidth parameter that controls the |
|
110 |
-resolution of the kernel estimation and is taken as $h_i=s_i/4$, where $s_i$ is |
|
111 |
-the sample standard deviation of the $i$-th gene. Second, the expression-level |
|
112 |
-statistic is calculated as logarithm of the relative likelihood, or odds, that |
|
113 |
-the $i$-th gene is expressed in sample $j$: |
|
145 |
+where $h_i$ is the gene-specific bandwidth parameter that controls the resolution |
|
146 |
+of the kernel estimation, which is set to $h_i=s_i/4$, where $s_i$ is the sample |
|
147 |
+standard deviation of the $i$-th gene (Figure \ref{methods}, Step 1). In the case |
|
148 |
+of RNA-seq data, a discrete Poisson kernel \citep{canale_bayesian_2011} is employed: |
|
114 | 149 |
|
115 | 150 |
\begin{equation} |
116 |
-z_ij = \log\left(\frac{\hat{F}_{h_i}(x_{ij})}{1-\hat{F}_{h_i}(x_{ij})}\right)\,. |
|
151 |
+\hat{F}_r(x_{ij}) = \frac{1}{n} \sum_{k=1}^n \sum_{y=0}^{x_{ij}} \frac{e^{-(x_{ik}+r)}(x_{ik}+r)^y}{y!}\,, |
|
117 | 152 |
\end{equation} |
118 |
- |
|
119 |
-The clearest context for differential gene expression arises under a bi-modal |
|
120 |
-distribution of the data. Here, $z_{ij}$'s corresponding to the lower mode will |
|
121 |
-have a large negative statistic, higher modes will have a large positive |
|
122 |
-statistic, and samples between the two modes will have $z_{ij}$'s at or near |
|
123 |
-zero. Genes with uniform or unimodal population distributions do not preclude |
|
124 |
-large negative or positive expression-level statistics, i.e. $x_{ij}$ lying in |
|
125 |
-the tail of a distribution. To dampen the effect of potential outliers, the |
|
126 |
-expression-level statistics $z_{ij}$ are first converted to ranks $z_{(i)j}$ for |
|
127 |
-each $j$-th sample and then these ranks are further normalized into a rank-order |
|
128 |
-statistic $r_{ij}=|p - z_{(i)j}+1-p/2|$ to be symmetric around zero, before evaluating |
|
129 |
-the enrichment scores. |
|
130 |
- |
|
131 |
-We assess the enrichment score similarity to the GSEA method |
|
132 |
-\citep{subramanian_gene_2005} using a Kolmogorov-Smirnov (K-S) |
|
133 |
-like random walk statistic: |
|
153 |
+where $r=0.5$ in order to set the mode of the Poisson kernel at each $x_{ik}$ since, |
|
154 |
+in general, the mode of a Poisson distribution with an integer mean $\lambda$ occurs |
|
155 |
+at $\lambda$ and $\lambda-1$ and at the largest integer smaller than $\lambda$ when |
|
156 |
+$\lambda$ is continuous. |
|
157 |
+ |
|
158 |
+Let $z_{ij}$ denote the previous expression-level statistic $\hat{F}_{h_i}(x_{ij})$, |
|
159 |
+or $\hat{F}_r(x_{ij})$, depending on whether $x_{ij}$ are continuous microarray, or |
|
160 |
+discrete count RNA-seq values, respectively. The following step condenses |
|
161 |
+expression-level statistics into gene sets by calculating sample-wise enrichment |
|
162 |
+scores. In order to reduce the influence of potential outliers on that step, we convert |
|
163 |
+first the $z_{ij}$ into ranks $z_{(i)j}$ for each $j$-th sample and further normalize |
|
164 |
+them into $r_{ij}=|p/2-z_{(i)j}|$ to make them symmetric around zero before evaluating |
|
165 |
+the enrichment scores (Figure~\ref{methods}, Step 2). |
|
166 |
+ |
|
167 |
+We assess the enrichment score similar to the GSEA and ASSESS methods |
|
168 |
+\citep{subramanian_gene_2005,edelman_analysis_2006} using the Kolmogorov-Smirnov (KS) |
|
169 |
+like random walk statistic (Figure~\ref{methods}, Step 3): |
|
134 | 170 |
\begin{equation} |
135 | 171 |
\label{walk} |
136 | 172 |
\nu_{jk}(\ell) = \frac{\sum_{i=1}^\ell |r_{ij}|^{\tau} I(g_{(i)} \in |
137 | 173 |
\gamma_k)}{\sum_{i=1}^p |r_{ij}|^{\tau}I(g_{(i)} \in \gamma_k)} - |
138 | 174 |
\frac{\sum_{i=1}^\ell I(g_{(i)} \not\in \gamma_k)}{p-|\gamma_k|}, |
139 | 175 |
\end{equation} |
140 |
-where $\tau$ is a parameter describing the weight of the tail in the |
|
141 |
-random walk (default $\tau = 1$), $\gamma_k$ is the $k$-th gene set, |
|
142 |
-$I(g_{(i)} \in \gamma_k)$ is the indicator function on whether |
|
143 |
-the $i$-th gene (the gene corresponding to the $i$-th ranked |
|
144 |
-expression-level statistic) is in gene set $\gamma_k$, $|\gamma_k|$ is |
|
145 |
-the number of genes in the $k$-th gene set, and $p$ is the number of |
|
146 |
-genes in the data set. The first term in this equation corresponds to |
|
147 |
-the step cumulative distribution function of the $r_{ij}$ rank-order |
|
148 |
-statistics of the gene forming gene set $\gamma_k$ throughout the |
|
149 |
-ranking $z_{(1)j},\dots,z_{(p)j}$ in sample $j$, while the second term |
|
150 |
-corresponds to the step cumulative distribution function of all other |
|
151 |
-genes outside $\gamma_k$ throughout this same ranking. |
|
152 |
- |
|
153 |
-Two approaches are possible for turning the K-S random walk statistic |
|
154 |
-into an enrichment score (ES). The first approach is the previously described |
|
155 |
-maximum deviation method \citep{subramanian_gene_2005} where the ES for the |
|
156 |
-$j$-th sample with respect to the $k$-th gene set is the maximum deviation |
|
157 |
-of the random walk from zero: |
|
176 |
+where $\tau$ is a parameter describing the weight of the tail in the random walk |
|
177 |
+(default $\tau = 1$), $\gamma_k$ is the $k$-th gene set, $I(g_{(i)} \in \gamma_k)$ is |
|
178 |
+the indicator function on whether the $i$-th gene (the gene corresponding to the $i$-th |
|
179 |
+ranked expression-level statistic) is in gene set $\gamma_k$, $|\gamma_k|$ is the number |
|
180 |
+of genes in the $k$-th gene set, and $p$ is the number of genes in the data set. |
|
181 |
+ |
|
182 |
+We offer two approaches for turning the KS random walk statistic into an enrichment |
|
183 |
+statistic (ES). The first approach is the previously described maximum deviation method |
|
184 |
+\citep{subramanian_gene_2005,edelman_analysis_2006,verhaak_integrated_2010} where the ES |
|
185 |
+for the $j$-th sample with respect to the $k$-th gene set is the maximum deviation from |
|
186 |
+zero of the random walk: |
|
187 |
+ |
|
158 | 188 |
\begin{equation} |
159 | 189 |
\label{escore} |
160 |
-ES_{jk}^{\mbox{\tiny{max}}} = \nu_{jk}[\arg \max_{\ell=1,..,p} \left|\nu_{jk}(\ell)\right|]. |
|
190 |
+ES^{\mbox{\tiny{max}}}_{jk} = \nu_{jk}[\arg \max_{\ell=1,\dots,p} \left|\nu_{jk}(\ell)\right|]. |
|
161 | 191 |
\end{equation} |
162 |
-For each gene set $k$, this approach produces a distribution of enrichment |
|
163 |
-scores that is bi-modal. Within the supervised paradigm, a ``normalized'' |
|
164 |
-enrichment score could be obtained via a permutation of the phenotypic labels; |
|
165 |
-since we are operating without labels, we propose a second approach that |
|
166 |
-produces an ES distribution that is unimodal: |
|
192 |
+For each gene set $k$, this approach produces a distribution of enrichment scores that |
|
193 |
+is bimodal (Figure~\ref{methods}, Step 4, Top panel). This is an intrinsic property of |
|
194 |
+the KS-like random walk, which generates non-zero maximum deviations under the null |
|
195 |
+distribution. In GSEA \citep{subramanian_gene_2005} it is also observed that the empirical |
|
196 |
+null distribution obtained by permuting phenotypes is bimodal and, for this reason, |
|
197 |
+significance is determined independently using the positive and negative sides of the null |
|
198 |
+distribution. In our case, we would like to provide a standard Gaussian distribution of |
|
199 |
+enrichment scores under the null hypothesis of no change in pathway activity throughout the |
|
200 |
+sample population. For this purpose, we propose the following alternative score that produces |
|
201 |
+an ES distribution approximating this requirement (Figure~\ref{methods}, Step 4, Bottom panel): |
|
167 | 202 |
|
168 | 203 |
\begin{equation} |
169 | 204 |
\label{escore_2} |
170 |
-ES_{jk}^{\mbox{\tiny{diff}}} = \left|ES^{+}_{jk}\right| - \left|ES^{-}_{jk}\right|=\max_{\ell=1,\dots,p}(0,\nu_{jk}(\ell)) - \min_{\ell=1,\dots,p}(0,\nu_{jk}(\ell))\,, |
|
205 |
+ES^{\mbox{\tiny{diff}}}_{jk} = \left|ES^{+}_{jk}\right| - \left|ES^{-}_{jk}\right|=\max_{\ell=1,\dots,p}(0,\nu_{jk}(\ell)) - \min_{\ell=1,\dots,p}(0,\nu_{jk}(\ell))\,. |
|
171 | 206 |
\end{equation} |
172 |
-where $ES^{+}_{jk}$ and $ES^{-}_{jk}$ are the largest positive and negative |
|
173 |
-random walk deviations from zero, respectively, for sample $j$ and gene set $k$. |
|
174 |
-This approach takes the magnitude difference between the largest positive and |
|
175 |
-negative random walk deviations, and has the effect of dampening out large |
|
176 |
-enrichment scores if there is both a large positive and negative deviation in |
|
177 |
-the random walk. This approach emphasizes genes in pathways that are concordantly |
|
178 |
-activated in one direction only, i.e. over-expressed or under-expressed relative |
|
179 |
-to the overall population. For pathways with genes strongly acting in both |
|
180 |
-directions, the deviations will cancel each other out and show negligible enrichment. |
|
181 |
-Also, because this statistic is unimodal (and through our own experimental |
|
182 |
-observations, approximately normal), this statistic lends itself to downstream |
|
183 |
-analyses which may impose distributional assumptions on the data. |
|
207 |
+where $ES_{jk}^{+}$ and $ES_{jk}^{-}$ are the largest positive and negative random walk |
|
208 |
+deviations from zero, respectively, for sample $j$ and gene set $k$. This statistic may |
|
209 |
+be compared to the Kuiper test statistic \citep{pearson_comparison_1963}, which sums the |
|
210 |
+maximum and minimum deviations to make the test statistic more sensitive in the tails. |
|
211 |
+In contrast, our test statistic penalizes deviations that are large in both tails, and |
|
212 |
+provides a ``normalization'' of the enrichment score by subtracting potential noise. There |
|
213 |
+is a clear biological interpretation of this statistic, it emphasizes genes in pathways that |
|
214 |
+are concordantly activated in one direction only, either over-expressed or under-expressed |
|
215 |
+relative to the overall population. For pathways containing genes having very different |
|
216 |
+expression patterns, the deviations will cancel each other out and show little or no enrichment. |
|
217 |
+Because this statistic is unimodal and approximately normal downstream analyses which may |
|
218 |
+impose distributional assumptions on the data are possible. |
|
219 |
+ |
|
220 |
+Step 4 in Figure~\ref{methods} shows a simple simulation which we reproduce with the code |
|
221 |
+below, where standard Gaussian deviates are independenty sampled from $p=20,000$ genes and |
|
222 |
+$n=30$ samples, thus mimicking a null distribution of no change in gene expression. One |
|
223 |
+hundred gene sets are uniformly sampled at random from the $p$ genes with sizes ranging |
|
224 |
+between 10 and 100 genes. Using these two inputs, GSVA enrichment scores are calculated |
|
225 |
+under the two approaches and their distribution is depicted in the two panels of this step |
|
226 |
+in Figure~\ref{methods} and the larger figure below, illustrating the previous description. |
|
227 |
+ |
|
228 |
+<<>>= |
|
229 |
+library(GSVA) |
|
230 |
+ |
|
231 |
+p <- 20000 ## number of genes |
|
232 |
+n <- 30 ## number of samples |
|
233 |
+nGS <- 100 ## number of gene sets |
|
234 |
+min.sz <- 10 ## minimum gene set size |
|
235 |
+max.sz <- 100 ## maximum gene set size |
|
236 |
+X <- matrix(rnorm(p*n), nrow=p, dimnames=list(1:p, 1:n)) |
|
237 |
+dim(X) |
|
238 |
+gs <- as.list(sample(min.sz:max.sz, size=nGS, replace=TRUE)) ## sample gene set sizes |
|
239 |
+gs <- lapply(gs, function(n, p) sample(1:p, size=n, replace=FALSE), p) ## sample gene sets |
|
240 |
+es.max <- gsva(X, gs, mx.diff=FALSE, verbose=FALSE)$es.obs |
|
241 |
+es.dif <- gsva(X, gs, mx.diff=TRUE, verbose=FALSE)$es.obs |
|
242 |
+@ |
|
243 |
+ |
|
244 |
+\begin{center} |
|
245 |
+<<maxvsdif, fig=TRUE, png=TRUE, echo=TRUE, height=5, width=8>>= |
|
246 |
+par(mfrow=c(1,2), mar=c(4, 4, 4, 1)) |
|
247 |
+plot(density(as.vector(es.max)), main="Maximum deviation from zero", |
|
248 |
+ xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8) |
|
249 |
+axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8) |
|
250 |
+plot(density(as.vector(es.dif)), main="Difference between largest\npositive and negative deviations", |
|
251 |
+ xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8) |
|
252 |
+axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8) |
|
253 |
+@ |
|
254 |
+\end{center} |
|
255 |
+ |
|
256 |
+\bigskip |
|
257 |
+Although the GSVA algorithm itself does not evaluate statistical significance for the |
|
258 |
+enrichment of gene sets, significance with respect to one or more phenotypes can be easily |
|
259 |
+evaluated using conventional statistical models. Likewise, false discovery rates can be |
|
260 |
+estimated by permuting the sample labels (Methods). Examples of these techniques are |
|
261 |
+provided in the following section. |
|
184 | 262 |
|
185 | 263 |
\section{Overview of the package} |
186 | 264 |
|
... | ... |
@@ -236,7 +314,7 @@ the parallelization of the calculations through a cluster of computers. In order |
236 | 314 |
to activate this option we should specify in the argument \Robject{parallel.sz} |
237 | 315 |
the number of processors we want to use (default is zero which means no |
238 | 316 |
parallelization is going to be employed). The other is possibility is loading |
239 |
-the library \Rpackage{multicore} and then the \Rfunction{gsva()} function will |
|
317 |
+the library \Rpackage{parallel} and then the \Rfunction{gsva()} function will |
|
240 | 318 |
use the core processors of the computer where R is running. If we want to |
241 | 319 |
limit \Rfunction{gsva()} in the number of core processors that is should use |
242 | 320 |
we can do it by specifying such a value in the \Robject{parallel.sz} argument. |
... | ... |
@@ -247,6 +325,14 @@ serve to explicitly filter out gene sets by size and by pairwise overlap, |
247 | 325 |
respectively. Note that the size filter can be also applied within the |
248 | 326 |
\Rfunction{gsva()} function call. |
249 | 327 |
|
328 |
+The \Rfunction{gsva()} function offers also two other different unsupervised |
|
329 |
+GSE methods that yield single sample enrichment scores and can be selected |
|
330 |
+through the \Robject{method} argument. One of them is a combined z-score |
|
331 |
+approach \citep{lee_inferring_2008} (\Robject{method="zscore"}) and the other |
|
332 |
+is the single-sample GSEA approach \citep{barbie_systematic_2009} |
|
333 |
+(\Robject{method="ssgsea"}). By default \Robject{method="gsva"} and the |
|
334 |
+Rfunction{gsva()} function uses the GSVA algorithm described before. |
|
335 |
+ |
|
250 | 336 |
\section{Applications} |
251 | 337 |
|
252 | 338 |
In this section we illustrate the following applications of \Rpackage{GSVA}: |
... | ... |
@@ -512,7 +598,6 @@ dev.off() |
512 | 598 |
\label{leukemiaHeatmapGenes} |
513 | 599 |
\end{figure} |
514 | 600 |
|
515 |
- |
|
516 | 601 |
\subsection{Molecular signature identification} |
517 | 602 |
|
518 | 603 |
In \citep{verhaak_integrated_2010} four subtypes of Glioblastoma multiforme |
... | ... |
@@ -862,7 +947,7 @@ gHubNet <- layoutGraph(gHubNet, layoutType="fdp") |
862 | 947 |
renderGraph(gHubNet) |
863 | 948 |
@ |
864 | 949 |
\begin{figure}[p] |
865 |
-\centerline{\includegraphics[height=\textheight]{GSVA-selectedGGMhighDeg}} |
|
950 |
+\centerline{\includegraphics[height=0.9\textheight]{GSVA-selectedGGMhighDeg}} |
|
866 | 951 |
\caption{Network of cross-talk associations between Broad C2 canonical pathways |
867 | 952 |
of the leukemia data set obtained by a Gaussian graphical modeling approach |
868 | 953 |
by which edges are included in the graph with a minimum absolute PCC > 0.7 |
... | ... |
@@ -889,6 +974,288 @@ methyltransferase {\it MTR} gene that encodes an enzyme that catalyzes the final |
889 | 974 |
biosynthesis and has been associated to an increased risk of ALL which was most pronounced for cases |
890 | 975 |
with the \textit{MLL} translocation \citep{lightfoot_genetic_2010}. |
891 | 976 |
|
977 |
+\section{Comparison with other methods} |
|
978 |
+ |
|
979 |
+In this section we compare the performance of GSVA with the two other methods available |
|
980 |
+through the argument \Robject{method} of the function \Rfunction{gsva()} to condense |
|
981 |
+gene-level expression into samplewise pathway-level enrichment scores. We start by |
|
982 |
+calculating fold-changes using limma: |
|
983 |
+ |
|
984 |
+<<>>= |
|
985 |
+design <- model.matrix(~ factor(leukemia_eset$subtype)) |
|
986 |
+fitGenes <- lmFit(leukemia_eset, design) |
|
987 |
+fitGenes <- eBayes(fitGenes) |
|
988 |
+ttgenes <- topTable(fitGenes, coef=2, n=Inf) |
|
989 |
+@ |
|
990 |
+Next, we split all \Sexpr{nrow(ttgenes)} genes into three equally sized fractions |
|
991 |
+of the ranking by fold change: |
|
992 |
+ |
|
993 |
+<<>>= |
|
994 |
+abslogfc <- abs(ttgenes$logFC) |
|
995 |
+names(abslogfc) <- ttgenes$ID |
|
996 |
+qtlsabslogfc <- quantile(abslogfc, p=seq(0, 1, by=1/3)) |
|
997 |
+abslogfccutoff <- cut(abslogfc, qtlsabslogfc, include.lowest=TRUE) |
|
998 |
+genesbyabslogfccutoff <- split(ttgenes$ID, abslogfccutoff) |
|
999 |
+@ |
|
1000 |
+We want compare the performance of GSVA in producing enrichment scores that enable |
|
1001 |
+the signature identification of a dichotomous phenotype under an scenario of subtle |
|
1002 |
+changes. For this purpose we will use the genes in the second tercile of fold-changes |
|
1003 |
+to form gene sets and calculate enrichment scores for them with each of the three |
|
1004 |
+methods. Using these enrichment scores and the package \Rpackage{limma} we will select |
|
1005 |
+the top five differentially expressed gene sets: |
|
1006 |
+ |
|
1007 |
+<<>>= |
|
1008 |
+ntopgenesets <- 5 |
|
1009 |
+@ |
|
1010 |
+and use them to build a hierarchical clustering of the samples whose two main branches |
|
1011 |
+are taken as the predicted sample classification. The accuracy of this classification |
|
1012 |
+is then measured in terms of the adjusted rand index (ARI); see \citet{hubert_ari_1985}. |
|
1013 |
+We will apply this procedure through data sets of 10 samples per condition bootstrapped |
|
1014 |
+from the original leukemia data set. This entire strategy is coded in the following function: |
|
1015 |
+ |
|
1016 |
+<<>>= |
|
1017 |
+computeARI <- function(geneSets, eset, selGenes, ntopgenesets) { |
|
1018 |
+ eset <- eset[selGenes, c(sample(which(eset$subtype == "ALL"), size=10), |
|
1019 |
+ sample(which(eset$subtype == "MLL"), size=10))] |
|
1020 |
+ e <- exprs(eset) |
|
1021 |
+ design <- model.matrix(~ factor(eset$subtype)) |
|
1022 |
+ |
|
1023 |
+ es.gsva <- gsva(e, geneSets, min.sz=10, max.sz=500, verbose=FALSE)$es.obs |
|
1024 |
+ fit <- lmFit(es.gsva, design) |
|
1025 |
+ fit <- eBayes(fit) |
|
1026 |
+ topgeneset <- topTable(fit, coef=2)$ID[1:ntopgenesets] |
|
1027 |
+ |
|
1028 |
+ sampleClustering <- hclust(as.dist(1-cor(es.gsva[topgeneset, ], method="spearman")), method="complete") |
|
1029 |
+ ari.gsva <- mclust::adjustedRandIndex(as.vector(cutree(sampleClustering, k=2)), |
|
1030 |
+ as.integer(factor(eset$subtype))) |
|
1031 |
+ |
|
1032 |
+ es.ss <- gsva(e, geneSets, method="ssgsea", min.sz=10, max.sz=500, verbose=FALSE) |
|
1033 |
+ fit <- lmFit(es.ss, design) |
|
1034 |
+ fit <- eBayes(fit) |
|
1035 |
+ topgeneset <- topTable(fit, coef=2)$ID[1:ntopgenesets] |
|
1036 |
+ |
|
1037 |
+ sampleClustering <- hclust(as.dist(1-cor(es.ss[topgeneset, ], method="spearman")), method="complete") |
|
1038 |
+ ari.ss <- mclust::adjustedRandIndex(as.vector(cutree(sampleClustering, k=2)), |
|
1039 |
+ as.integer(factor(eset$subtype))) |
|
1040 |
+ |
|
1041 |
+ es.z <- gsva(e, geneSets, method="zscore", min.sz=10, max.sz=500, verbose=FALSE) |
|
1042 |
+ fit <- lmFit(es.z, design) |
|
1043 |
+ fit <- eBayes(fit) |
|
1044 |
+ topgeneset <- topTable(fit, coef=2)$ID[1:ntopgenesets] |
|
1045 |
+ |
|
1046 |
+ sampleClustering <- hclust(as.dist(1-cor(es.z[topgeneset, ], method="spearman")), method="complete") |
|
1047 |
+ ari.z <- mclust::adjustedRandIndex(as.vector(cutree(sampleClustering, k=2)), |
|
1048 |
+ as.integer(factor(eset$subtype))) |
|
1049 |
+ |
|
1050 |
+ c(ari.gsva, ari.ss, ari.z) |
|
1051 |
+} |
|
1052 |
+@ |
|
1053 |
+ |
|
1054 |
+In order to make computations feasible for this vignette we consider only gene sets in the C2 Broad |
|
1055 |
+Collection whose name includes either the word \verb+LEUKEMIA+ or \verb+MLL+ which should makes them |
|
1056 |
+more relevant to the ALL/MLL leukemia data we are using: |
|
1057 |
+ |
|
1058 |
+<<>>= |
|
1059 |
+mappedGeneSets <- geneIds(mapIdentifiers(c2BroadSets[grep("LEUKEMIA|MLL", names(c2BroadSets))], |
|
1060 |
+ GSEABase::AnnoOrEntrezIdentifier(annotation(leukemia_eset)))) |
|
1061 |
+length(mappedGeneSets) |
|
1062 |
+names(mappedGeneSets) |
|
1063 |
+@ |
|
1064 |
+Finally, we carry out the simulation on 50 bootstrapped data sets as follows: |
|
1065 |
+ |
|
1066 |
+<<>>= |
|
1067 |
+set.seed(123) |
|
1068 |
+ari <- replicate(50, computeARI(mappedGeneSets, leukemia_eset, |
|
1069 |
+ genesbyabslogfccutoff[[2]], ntopgenesets)) |
|
1070 |
+@ |
|
1071 |
+where we have set the seed on the random number generator to enable reproducing the exact result. |
|
1072 |
+In Figure~\ref{comp_methods} we can see the 50 ARI values per method which are higher in GSVA than with |
|
1073 |
+ssGSEA or with the combined z-score approach. On top of each of the boxplots for these two compared |
|
1074 |
+ethods we show the $P$ values of the Kruskal-Wallis rank sum test of the null that the location |
|
1075 |
+arameters of the GSVA ARI values are the same than those from ssGSEA and the combined z-score, |
|
1076 |
+respectively. Both $P$ values indicate that GSVA provides larger power than ssGSEA and z-score to |
|
1077 |
+detect subtle changes in pathway activity. |
|
1078 |
+ |
|
1079 |
+<<perf, echo=FALSE, results=hide>>= |
|
1080 |
+png(filename="GSVA-perf.png", width=1000, height=500, res=150) |
|
1081 |
+par(mfrow=c(1,2), mar=c(4, 4, 1, 1)) |
|
1082 |
+design <- model.matrix(~ factor(leukemia_eset$subtype)) |
|
1083 |
+fitGenes <- lmFit(leukemia_eset, design) |
|
1084 |
+fitGenes <- eBayes(fitGenes) |
|
1085 |
+ttgenes_leu <- topTable(fitGenes, coef=2, n=Inf) |
|
1086 |
+abslogfc_leu <- abs(ttgenes_leu$logFC) |
|
1087 |
+names(abslogfc_leu) <- ttgenes_leu$ID |
|
1088 |
+qtlsabslogfc_leu <- quantile(abslogfc_leu, p=seq(0, 1, by=1/3)) |
|
1089 |
+abslogfccutoff_leu <- cut(abslogfc_leu, qtlsabslogfc_leu, include.lowest=TRUE) |
|
1090 |
+genesbyabslogfccutoff_leu <- split(ttgenes_leu$ID, abslogfccutoff_leu) |
|
1091 |
+M <- ttgenes_leu$logFC |
|
1092 |
+P <- -log10(ttgenes_leu$P.Value) |
|
1093 |
+names(M) <- names(P) <- ttgenes_leu$ID |
|
1094 |
+plot(M, P, pch=".", cex=1, type="n", |
|
1095 |
+ xlim=c(-2, 2), ylim=c(0, 15), xlab=expression(paste(log[2], " ALL", -log[2], " MLL")), |
|
1096 |
+ ylab=expression(paste(-log[10], " P-value")), main="", las=1) |
|
1097 |
+colrs <- c("blue", "violet", "red") |
|
1098 |
+for (i in length(genesbyabslogfccutoff_leu):1) { |
|
1099 |
+ points(M[genesbyabslogfccutoff_leu[[i]]], P[genesbyabslogfccutoff_leu[[i]]], pch=".", cex=1, col=colrs[i]) |
|
1100 |
+} |
|
1101 |
+boxplot(t(ari), names=c("GSVA", "ssGSEA", "zscore"), las=1, xlab="Method", ylab="Adjusted Rand Index (ARI)") |
|
1102 |
+points(1:3+0.25, rowMeans(ari), pch=23, col="black", bg="black") |
|
1103 |
+pp <- c(kruskal.test(list(ari[1,], ari[2,]))$p.value, kruskal.test(list(ari[1,], ari[3,]))$p.value) |
|
1104 |
+pp <- sprintf("P=%.3g", pp) |
|
1105 |
+text(1:3, 0.9*rep(max(ari), 3), c(" ", paste("Kruskal-Wallis", pp, sep="\n")), cex=0.5) |
|
1106 |
+dev.off() |
|
1107 |
+@ |
|
1108 |
+\begin{figure}[ht] |
|
1109 |
+\centerline{\includegraphics[width=\textwidth]{GSVA-perf}} |
|
1110 |
+\caption{{\bf Comparison of GSVA with single sample GSEA (ssGSEA) and combined z-score (zscore).} |
|
1111 |
+ On the left a volcano plot is shown where genes highlighted in red form the first tercile |
|
1112 |
+ of largest absolute fold-changes, violet indicates the second tercile and blue the third |
|
1113 |
+ tercile. On the right, boxplots of the adjusted rand index (ARI) values are shown for each |
|
1114 |
+ method using gene sets formed from genes belonging to the second tercile of fold-change |
|
1115 |
+ magnitude (violet in the volcano plot). Diamonds indicate mean value.} |
|
1116 |
+\label{comp_methods} |
|
1117 |
+\end{figure} |
|
1118 |
+ |
|
1119 |
+\section{GSVA for RNA-Seq data} |
|
1120 |
+ |
|
1121 |
+In this section we illustrate how to use GSVA with RNA-seq data and, more importantly, how the |
|
1122 |
+method provides pathway activity profiles analogous to the ones obtained from microarray data by |
|
1123 |
+using samples of lymphoblastoid cell lines (LCL) from HapMap individuals which have been profiled |
|
1124 |
+using both technologies \cite{huang_genome-wide_2007, pickrell_understanding_2010}. These data |
|
1125 |
+form part of the experimental package \Rpackage{GSVAdata} and the corresponding help pages contain |
|
1126 |
+details on how the data were processed. We start loading these data and verifying that |
|
1127 |
+they indeed contain expression data for the same genes and samples, as follows: |
|
1128 |
+ |
|
1129 |
+<<>>= |
|
1130 |
+data(commonPickrellHuang) |
|
1131 |
+ |
|
1132 |
+stopifnot(identical(featureNames(huangArrayRMAnoBatchCommon_eset), |
|
1133 |
+ featureNames(pickrellCountsArgonneCQNcommon_eset))) |
|
1134 |
+stopifnot(identical(sampleNames(huangArrayRMAnoBatchCommon_eset), |
|
1135 |
+ sampleNames(pickrellCountsArgonneCQNcommon_eset))) |
|
1136 |
+@ |
|
1137 |
+Next, for the purpose of the current analysis we are going to use the previously defined collection |
|
1138 |
+of canonical C2 Broad Sets enlarged with two additional ones corresponding to genes with sex-specific |
|
1139 |
+expression: |
|
1140 |
+ |
|
1141 |
+<<<>>= |
|
1142 |
+data(genderGenesEntrez) |
|
1143 |
+ |
|
1144 |
+MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(), |
|
1145 |
+ collectionType=BroadCollection(category="c2"), setName="MSY") |
|
1146 |
+MSY |
|
1147 |
+XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(), |
|
1148 |
+ collectionType=BroadCollection(category="c2"), setName="XiE") |
|
1149 |
+XiE |
|
1150 |
+ |
|
1151 |
+ |
|
1152 |
+canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE)) |
|
1153 |
+canonicalC2BroadSets |
|
1154 |
+@ |
|
1155 |
+We calculate now GSVA enrichment scores for these gene sets using first the microarray |
|
1156 |
+data and then the RNA-seq data. Note that the only requirement to do the latter is to |
|
1157 |
+set the argument \Robject{rnaseq=TRUE} which is \Robject{FALSE} by default. |
|
1158 |
+ |
|
1159 |
+<<<>>= |
|
1160 |
+esmicro <- gsva(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500, |
|
1161 |
+ mx.diff=TRUE, verbose=FALSE, rnaseq=FALSE)$es.obs |
|
1162 |
+dim(esmicro) |
|
1163 |
+esrnaseq <- gsva(pickrellCountsArgonneCQNcommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500, |
|
1164 |
+ mx.diff=TRUE, verbose=FALSE, rnaseq=TRUE)$es.obs |
|
1165 |
+dim(esrnaseq) |
|
1166 |
+@ |
|
1167 |
+In order to compare expression values from both technologies we are going to transform |
|
1168 |
+the RNA-seq read counts into RPKM values. For this purpose we need gene length and G+C content |
|
1169 |
+information also stored in the \Rpackage{GSVAdata} package and use the \Rfunction{cpm()} |
|
1170 |
+function from the \Rpackage{edgeR} package. Note that RPKMs can only be calculated for those |
|
1171 |
+genes for which the gene length and G+C content information is available: |
|
1172 |
+ |
|
1173 |
+<<>>= |
|
1174 |
+data(annotEntrez220212) |
|
1175 |
+head(annotEntrez220212) |
|
1176 |
+ |
|
1177 |
+cpm <- edgeR::cpm(exprs(pickrellCountsArgonneCQNcommon_eset)) |
|
1178 |
+dim(cpm) |
|
1179 |
+ |
|
1180 |
+common <- intersect(rownames(cpm), rownames(annotEntrez220212)) |
|
1181 |
+length(common) |
|
1182 |
+ |
|
1183 |
+rpkm <- sweep(cpm[common, ], 1, annotEntrez220212[common, "Length"] / 10^3, FUN="/") |
|
1184 |
+dim(rpkm) |
|
1185 |
+ |
|
1186 |
+dim(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]) |
|
1187 |
+@ |
|
1188 |
+We finally calculate Spearman correlations between gene and gene-level expression values |
|
1189 |
+and gene-set level GSVA enrichment scores produced from data obtained by microarray and |
|
1190 |
+RNA-seq technologies: |
|
1191 |
+ |
|
1192 |
+<<>>= |
|
1193 |
+corsrowsgene <- sapply(1:nrow(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]), |
|
1194 |
+ function(i, expmicro, exprnaseq) cor(expmicro[i, ], exprnaseq[i, ], method="pearson"), |
|
1195 |
+ exprs(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]), log2(rpkm+0.1)) |
|
1196 |
+names(corsrowsgene) <- rownames(rpkm) |
|
1197 |
+ |
|
1198 |
+corsrowsgs <- sapply(1:nrow(esmicro), |
|
1199 |
+ function(i, esmicro, esrnaseq) cor(esmicro[i, ], esrnaseq[i, ], method="spearman"), |
|
1200 |
+ exprs(esmicro), exprs(esrnaseq)) |
|
1201 |
+names(corsrowsgs) <- rownames(esmicro) |
|
1202 |
+@ |
|
1203 |
+In panels A and B of Figure~\ref{rnaseqcomp} we can see the distribution of these correlations at |
|
1204 |
+gene and gene-set level and they show that GSVA enrichment scores correlate similarly to gene |
|
1205 |
+expression levels produced by both profiling technologies. |
|
1206 |
+ |
|
1207 |
+We have also examined in detail two gene sets containing gender specific genes: those that escape |
|
1208 |
+X-inactivation in female samples \cite{carrel_x-inactivation_2005} and those that are located on |
|
1209 |
+the male-specific region of the Y chrosomome \cite{skaletsky_male-specific_2003}. In panels C and D |
|
1210 |
+of Figure~\ref{rnaseqcomp} we can see how microarray and RNA-seq enrichment scores correlate very |
|
1211 |
+well in these gene sets, with $\rho=0.82$ for the male-specific gene set and $\rho=0.78$ for the |
|
1212 |
+female-specific gene set. Male and female samples show higher GSVA enrichment scores in their |
|
1213 |
+corresponding gene set. This demonstrates the flexibility of GSVA to enable analogous unsupervised |
|
1214 |
+and single sample GSE analyses in data coming from both, microarray and RNA-seq technologies. |
|
1215 |
+ |
|
1216 |
+<<RNAseqComp, echo=FALSE, results=hide>>= |
|
1217 |
+png(filename="GSVA-RNAseqComp.png", width=1100, height=1100, res=150) |
|
1218 |
+par(mfrow=c(2,2), mar=c(4, 5, 3, 2)) |
|
1219 |
+hist(corsrowsgene, xlab="Spearman correlation", main="Gene level\n(RNA-seq RPKM vs Microarray RMA)", |
|
1220 |
+ xlim=c(-1, 1), col="grey", las=1) |
|
1221 |
+par(xpd=TRUE) |
|
1222 |
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "A", font=2, cex=2) |
|
1223 |
+par(xpd=FALSE) |
|
1224 |
+hist(corsrowsgs, xlab="Spearman correlation", main="Gene set level\n(GSVA enrichment scores)", |
|
1225 |
+ xlim=c(-1, 1), col="grey", las=1) |
|
1226 |
+par(xpd=TRUE) |
|
1227 |
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "B", font=2, cex=2) |
|
1228 |
+par(xpd=FALSE) |
|
1229 |
+plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray", |
|
1230 |
+ main=sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])), las=1, type="n") |
|
1231 |
+ sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])) |
|
1232 |
+abline(lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ]), lwd=2, lty=2, col="grey") |
|
1233 |
+points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"], |
|
1234 |
+ exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1) |
|
1235 |
+points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"], |
|
1236 |
+ exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1) |
|
1237 |
+par(xpd=TRUE) |
|
1238 |
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "C", font=2, cex=2) |
|
1239 |
+par(xpd=FALSE) |
|
1240 |
+plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray", |
|
1241 |
+ main=sprintf("XiE R=%.2f", cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])), las=1, type="n") |
|
1242 |
+abline(lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ]), lwd=2, lty=2, col="grey") |
|
1243 |
+points(exprs(esrnaseq["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]), |
|
1244 |
+ exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1) |
|
1245 |
+points(exprs(esrnaseq)["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"], |
|
1246 |
+ exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1) |
|
1247 |
+par(xpd=TRUE) |
|
1248 |
+text(par("usr")[1]*1.5, par("usr")[4]*1.1, "D", font=2, cex=2) |
|
1249 |
+par(xpd=FALSE) |
|
1250 |
+dev.off() |
|
1251 |
+@ |
|
1252 |
+\begin{figure}[p] |
|
1253 |
+\centerline{\includegraphics[height=0.5\texheight]{GSVA-RNAseqComp}} |
|
1254 |
+\caption{{\bf GSVA for RNA-seq (Argonne).} A. Distribution of Spearman correlation values between gene expression profiles of RNA-seq and microarray data. B. Distribution of Spearman correlation values between GSVA enrichment scores of gene sets calculated from RNA-seq and microarray data. C and D. Comparison of GSVA enrichment scores obtained from microarray and RNA-seq data for two gene sets containing genes with sex-specific expression: MSY formed by genes from the male-specific region of the Y chromosome, thus male-specific, and XiE formed by genes that escape X-inactivation in females, thus female-specific. Red and blue dots represent female and male samples, respectively. In both cases GSVA-scores show very high correlation between the two profiling technologies where female samples show higher enrichment scores in the female-specific gene set and male samples show higher enrichment scores in the male-specific gene set.} |
|
1255 |
+\label{rnaseqcomp} |
|
1256 |
+\end{figure} |
|
1257 |
+ |
|
1258 |
+ |
|
892 | 1259 |
\section{Session Information} |
893 | 1260 |
|
894 | 1261 |
<<info, results=tex>>= |
... | ... |
@@ -1,8 +1,8 @@ |
1 | 1 |
@article{GSVA_submitted, |
2 | 2 |
Author = {Sonja H\"anzelmann and Robert Castelo and Justin Guinney}, |
3 | 3 |
Journal = {submitted}, |
4 |
- Title = {{GSVA}: Gene Set Variation Analysis}, |
|
5 |
- Year = {2011}} |
|
4 |
+ Title = {{GSVA}: Gene Set Variation Analysis for microarray and RNA-Seq data}, |
|
5 |
+ Year = {2012}} |
|
6 | 6 |
|
7 | 7 |
|
8 | 8 |
@article{cahoy_transcriptome_2008, |
... | ... |
@@ -75,7 +75,7 @@ |
75 | 75 |
Issn = {1532-4435}, |
76 | 76 |
Journal = {J Mach Learn Res}, |
77 | 77 |
Pages = {2621--2650}, |
78 |
- Title = {A robust procedure for Gaussian graphical model search from microarray data with p larger than n}, |
|
78 |
+ Title = {A robust procedure for {G}aussian graphical model search from microarray data with p larger than n}, |
|
79 | 79 |
Volume = {7}, |
80 | 80 |
Year = {2006}} |
81 | 81 |
|
... | ... |
@@ -110,7 +110,7 @@ |
110 | 110 |
Author = {Roverato, A. and Whittaker, J.}, |
111 | 111 |
Journal = {Stat Comput}, |
112 | 112 |
Pages = {297--302}, |
113 |
- Title = {Standard errors for the parameters of graphical Gaussian models}, |
|
113 |
+ Title = {Standard errors for the parameters of graphical {G}aussian models}, |
|
114 | 114 |
Volume = {6}, |
115 | 115 |
Year = {1996}} |
116 | 116 |
|
... | ... |
@@ -301,3 +301,120 @@ |
301 | 301 |
Address = {Helsinki}, |
302 | 302 |
Year = {2010}, |
303 | 303 |
} |
304 |
+ |
|
305 |
+@article{zilliox_gene_2007, |
|
306 |
+ title = {A gene expression bar code for microarray data}, |
|
307 |
+ volume = {4}, |
|
308 |
+ url = {http://dx.doi.org/10.1038/nmeth1102}, |
|
309 |
+ number = {11}, |
|
310 |
+ journal = {Nat Meth}, |
|
311 |
+ author = {Zilliox, Michael J and Irizarry, Rafael A}, |
|
312 |
+ year = {2007}, |
|
313 |
+ pages = {911--913} |
|
314 |
+} |
|
315 |
+ |
|
316 |
+@article{hansen_removing_2012, |
|
317 |
+ title = {Removing technical variability in {RNA-seq} data using conditional quantile normalization}, |
|
318 |
+ journal = {Biostatistics}, |
|
319 |
+ author = {Hansen, Kasper D. and Irizarry, Rafael A. and Wu, Zhijin}, |
|
320 |
+ year = {2012} |
|
321 |
+} |
|
322 |
+ |
|
323 |
+@article{canale_bayesian_2011, |
|
324 |
+ title = {Bayesian kernel mixtures for counts}, |
|
325 |
+ volume = {106}, |
|
326 |
+ number = {496}, |
|
327 |
+ journal = {Journal of the American Statistical Association}, |
|
328 |
+ author = {Canale, Antonio and Dunson, David B.}, |
|
329 |
+ year = {2011}, |
|
330 |
+ pages = {1528--1539} |
|
331 |
+} |
|
332 |
+ |
|
333 |
+@article{hubert_ari_1985, |
|
334 |
+ title = {Comparing partitions}, |
|
335 |
+ volume = {2}, |
|
336 |
+ number = {1}, |
|
337 |
+ journal = {Journal of Classification}, |
|
338 |
+ author = {Hubert, Lawrence and Arabie, Phipps}, |
|
339 |
+ year = {1985}, |
|
340 |
+ pages = {193--218} |
|
341 |
+} |
|
342 |
+ |
|
343 |
+@article{edelman_analysis_2006, |
|
344 |
+ title = {Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles}, |
|
345 |
+ volume = {22}, |
|
346 |
+ number = {14}, |
|
347 |
+ journal = {Bioinformatics}, |
|
348 |
+ author = {Edelman, Elena and Porrello, Alessandro and Guinney, Justin and Balakumaran, Bala and Bild, Andrea and Febbo, Phillip G and Mukherjee, Sayan}, |
|
349 |
+ year = {2006}, |
|
350 |
+ pages = {e108--e116} |
|
351 |
+} |
|
352 |
+ |
|
353 |
+@article{pearson_comparison_1963, |
|
354 |
+ title = {Comparison of tests for randomness of points on a line}, |
|
355 |
+ volume = {50}, |
|
356 |
+ journal = {Biometrika}, |
|
357 |
+ author = {Pearson, {E.S.}}, |
|
358 |
+ year = {1963}, |
|
359 |
+ pages = {315--325} |
|
360 |
+} |
|
361 |
+ |
|
362 |
+@article{lee_inferring_2008, |
|
363 |
+ title = {Inferring Pathway Activity toward Precise Disease Classification}, |
|
364 |
+ volume = {4}, |
|
365 |
+ number = {11}, |
|
366 |
+ journal = {{PLoS} Comput Biol}, |
|
367 |
+ author = {Lee, Eunjung and Chuang, {Han-Yu} and Kim, {Jong-Won} and Ideker, Trey and Lee, Doheon}, |
|
368 |
+ year = {2008}, |
|
369 |
+ pages = {e1000217} |
|
370 |
+} |
|
371 |
+ |
|
372 |
+@article{barbie_systematic_2009, |
|
373 |
+ title = {Systematic {RNA} interference reveals that oncogenic {KRAS-driven} cancers require {TBK1}}, |
|
374 |
+ volume = {462}, |
|
375 |
+ number = {7269}, |
|
376 |
+ journal = {Nature}, |
|
377 |
+ author = {Barbie, David A. and Tamayo, Pablo and Boehm, Jesse S. and Kim, So Young and Moody, Susan E. and Dunn, Ian F. and Schinzel, Anna C. and Sandy, Peter and Meylan, Etienne and Scholl, Claudia and Fr{\textbackslash}textbackslashtextbar[ouml]{\textbackslash}textbackslashtextbarhling, Stefan and Chan, Edmond M. and Sos, Martin L. and Michel, Kathrin and Mermel, Craig and Silver, Serena J. and Weir, Barbara A. and Reiling, Jan H. and Sheng, Qing and Gupta, Piyush B. and Wadlow, Raymond C. and Le, Hanh and Hoersch, Sebastian and Wittner, Ben S. and Ramaswamy, Sridhar and Livingston, David M. and Sabatini, David M. and Meyerson, Matthew and Thomas, Roman K. and Lander, Eric S. and Mesirov, Jill P. and Root, David E. and Gilliland, D. Gary and Jacks, Tyler and Hahn, William C.}, |
|
378 |
+ year = {2009}, |
|
379 |
+ pages = {108--112} |
|
380 |
+} |
|
381 |
+ |
|
382 |
+@article{huang_genome-wide_2007, |
|
383 |
+ title = {A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity}, |
|
384 |
+ volume = {104}, |
|
385 |
+ author = {Huang, R Stephanie and Duan, Shiwei and Bleibel, Wasim K and Kistner, Emily O and Zhang, Wei and Clark, Tyson A and Chen, Tina X and Schweitzer, Anthony C and Blume, John E and Cox, Nancy J and Dolan, M Eileen}, |
|
386 |
+ number = {23}, |
|
387 |
+ journal = {Proceedings of the National Academy of Sciences of the United States of America}, |
|
388 |
+ year = {2007}, |
|
389 |
+ pages = {9758--9763} |
|
390 |
+} |
|
391 |
+ |
|
392 |
+@article{pickrell_understanding_2010, |
|
393 |
+ title = {Understanding mechanisms underlying human gene expression variation with {RNA} sequencing}, |
|
394 |
+ volume = {464}, |
|
395 |
+ number = {7289}, |
|
396 |
+ journal = {Nature}, |
|
397 |
+ author = {Pickrell, Joseph K. and Marioni, John C. and Pai, Athma A. and Degner, Jacob F. and Engelhardt, Barbara E. and Nkadori, Everlyne and Veyrieras, {Jean-Baptiste} and Stephens, Matthew and Gilad, Yoav and Pritchard, Jonathan K.}, |
|
398 |
+ year = {2010}, |
|
399 |
+ pages = {768--772} |
|
400 |
+} |
|
401 |
+ |
|
402 |
+@article{carrel_x-inactivation_2005, |
|
403 |
+ title = {X-inactivation profile reveals extensive variability in X-linked gene expression in females}, |
|
404 |
+ volume = {434}, |
|
405 |
+ number = {7031}, |
|
406 |
+ journal = {Nature}, |
|
407 |
+ author = {Carrel, Laura and Willard, Huntington F}, |
|
408 |
+ year = {2005}, |
|
409 |
+ pages = {400--404} |
|
410 |
+} |
|
411 |
+ |
|
412 |
+@article{skaletsky_male-specific_2003, |
|
413 |
+ title = {The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes}, |
|
414 |
+ volume = {423}, |
|
415 |
+ number = {6942}, |
|
416 |
+ journal = {Nature}, |
|
417 |
+ author = {Skaletsky, Helen and {Kuroda-Kawaguchi}, Tomoko and Minx, Patrick J. and Cordum, Holland S. and Hillier, {LaDeana} and Brown, Laura G. and Repping, Sjoerd and Pyntikova, Tatyana and Ali, Johar and Bieri, Tamberlyn and Chinwalla, Asif and Delehaunty, Andrew and Delehaunty, Kim and Du, Hui and Fewell, Ginger and Fulton, Lucinda and Fulton, Robert and Graves, Tina and Hou, {Shun-Fang} and Latrielle, Philip and Leonard, Shawn and Mardis, Elaine and Maupin, Rachel and {McPherson}, John and Miner, Tracie and Nash, William and Nguyen, Christine and Ozersky, Philip and Pepin, Kymberlie and Rock, Susan and Rohlfing, Tracy and Scott, Kelsi and Schultz, Brian and Strong, Cindy and {Tin-Wollam}, Aye and Yang, {Shiaw-Pyng} and Waterston, Robert H. and Wilson, Richard K. and Rozen, Steve and Page, David C.}, |
|
418 |
+ year = {2003}, |
|
419 |
+ pages = {825--837} |
|
420 |
+} |
... | ... |
@@ -72,7 +72,8 @@ void row_d(double* x, double* y, double* r, int size_density_n, int size_test_n, |
72 | 72 |
left_tail += rnaseq ? ppois(y[j], x[i]+bw, TRUE, FALSE) : precomputedCdf(y[j]-x[i], bw); |
73 | 73 |
} |
74 | 74 |
left_tail = left_tail / size_density_n; |
75 |
- r[j] = -1.0 * log((1.0-left_tail)/left_tail); |
|
75 |
+ /* r[j] = -1.0 * log((1.0-left_tail)/left_tail); */ /* log-odds not necessary */ |
|
76 |
+ r[j] = left_tail; |
|
76 | 77 |
} |
77 | 78 |
} |
78 | 79 |
|