Browse code

Started updating the vignette.

Robert Castelo authored on 05/02/2021 18:07:16
Showing 5 changed files

... ...
@@ -1,24 +1,21 @@
1 1
 Package: GSVA
2
-Version: 1.39.15
2
+Version: 1.39.16
3 3
 Title: Gene Set Variation Analysis for microarray and RNA-seq data
4 4
 Authors@R: c(person("Justin", "Guinney", role=c("aut", "cre"), email="justin.guinney@sagebase.org"),
5 5
              person("Robert", "Castelo", role="aut", email="robert.castelo@upf.edu"),
6 6
              person("Alexey", "Sergushichev", role="ctb", email="alsergbox@gmail.com"),
7 7
              person("Pablo Sebastian", "Rodriguez", role="ctb", email="pablosebastian.rodriguez@upf.edu"))
8 8
 Depends: R (>= 3.5.0)
9
-Imports: methods, stats, utils, graphics, S4Vectors, IRanges,
10
-         Biobase, SummarizedExperiment, GSEABase, Matrix,
11
-         parallel, BiocParallel, SingleCellExperiment, 
12
-         sparseMatrixStats, DelayedArray, DelayedMatrixStats,
13
-         HDF5Array, BiocSingular
14
-Suggests: 
15
-    RUnit, BiocGenerics, limma, RColorBrewer, genefilter, edgeR, GSVAdata,
16
-    shiny, shinythemes, ggplot2, data.table, plotly
9
+Imports: methods, stats, utils, graphics, BiocGenerics, S4Vectors, IRanges,
10
+         Biobase, SummarizedExperiment, GSEABase, Matrix, parallel,
11
+         BiocParallel, SingleCellExperiment, sparseMatrixStats, DelayedArray,
12
+         DelayedMatrixStats, HDF5Array, BiocSingular
13
+Suggests: RUnit, BiocStyle, knitr, markdown, limma, RColorBrewer, genefilter,
14
+          edgeR, GSVAdata, shiny, shinythemes, ggplot2, data.table, plotly
17 15
 Description: Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner.
18 16
 License: GPL (>= 2)
19
-LazyLoad: yes
17
+VignetteBuilder: knitr
20 18
 URL: https://github.com/rcastelo/GSVA
21 19
 BugReports: https://github.com/rcastelo/GSVA/issues
22 20
 Encoding: latin1
23
-biocViews: Microarray, Pathways, GeneSetEnrichment
24
-Config/testthat/edition: 3
21
+biocViews: FunctionalGenomics, Microarray, RNASeq, Pathways, GeneSetEnrichment
... ...
@@ -978,14 +978,13 @@ setGeneric("filterGeneSets", function(gSets, ...) standardGeneric("filterGeneSet
978 978
 
979 979
 setMethod("filterGeneSets", signature(gSets="list"),
980 980
           function(gSets, min.sz=1, max.sz=Inf) {
981
-	gSetsLen <- sapply(gSets,length)
981
+	gSetsLen <- lengths(gSets)
982 982
 	return (gSets[gSetsLen >= min.sz & gSetsLen <= max.sz])	
983 983
 })
984 984
 
985 985
 setMethod("filterGeneSets", signature(gSets="GeneSetCollection"),
986 986
           function(gSets, min.sz=1, max.sz=Inf) {
987
-	gSetsLen <- sapply(geneIds(gSets),length)
988
-	return (gSets[gSetsLen >= min.sz & gSetsLen <= max.sz])	
987
+  filterGeneSets(geneIds(gSets), min.sz., max.sz)
989 988
 })
990 989
 
991 990
 
992 991
new file mode 100644
... ...
@@ -0,0 +1,340 @@
1
+---
2
+title: "GSVA: gene set variation analysis for bulk expression data"
3
+author:
4
+- name: Robert Castelo
5
+  affiliation:
6
+  - &idupf Dept. of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, Spain
7
+  email: robert.castelo@upf.edu
8
+- name: Pablo Sebastian Rodriguez
9
+  affiliation: *idupf
10
+  email: pablosebastian.rodriguez@upf.edu
11
+- name: Justin Guinney 
12
+  affiliation: 
13
+  - Sage Bionetworks
14
+  email: justin.guinney@sagebase.org
15
+abstract: >
16
+  The GSVA package provides the implementation of four single-sample gene set
17
+  enrichment methods, concretely _zscore_, _plage_, _ssGSEA_ and its own called
18
+  _GSVA_. These methods transform an input gene-by-sample expression data matrix
19
+  into a gene-set-by-sample expression data matrix. Thereby enabling the
20
+  estimation of pathway activity for each sample and facilitating pathway-centric
21
+  analyses of gene expression data. In this vignette we illustrate how to use
22
+  the GSVA package with bulk microarray and RNA-seq expression data.
23
+date: "`r BiocStyle::doc_date()`"
24
+package: "`r pkg_ver('GSVA')`"
25
+vignette: >
26
+  %\VignetteIndexEntry{GSVA for bulk expression data}
27
+  %\VignetteEngine{knitr::rmarkdown}
28
+  %\VignetteEncoding{UTF-8}
29
+  %\VignetteKeywords{GeneExpression, Microarray, RNAseq, GeneSetEnrichment, Pathway}
30
+output:
31
+  BiocStyle::html_document:
32
+    toc: true
33
+    toc_float: true
34
+    number_sections: true
35
+bibliography: GSVA.bib
36
+---
37
+
38
+**License**: `r packageDescription("GSVA")[["License"]]`
39
+
40
+```{r setup, include=FALSE}
41
+options(width=80)
42
+knitr::opts_chunk$set(collapse=TRUE,
43
+                      message=FALSE)
44
+```
45
+
46
+# Quick start
47
+
48
+`r Biocpkg("GSVA")` is an R package distributed as part of the
49
+[Bioconductor](https://bioconductor.org) project. To install the package, start
50
+R and enter:
51
+
52
+```{r library_install, message=FALSE, cache=FALSE, eval=FALSE}
53
+install.packages("BiocManager")
54
+BiocManager::install("GSVA")
55
+```
56
+
57
+Once `r Biocpkg("GSVA")` is installed, it can be loaded with the following command.
58
+
59
+```{r load_library, message=FALSE, warning=FALSE, cache=FALSE}
60
+library(GSVA)
61
+```
62
+
63
+Given a gene expression data matrix with rows corresponding to genes and columns
64
+to samples, such as this one simulated from random Gaussian data:
65
+
66
+```{r}
67
+p <- 10000 ## number of genes
68
+n <- 30    ## number of samples
69
+## simulate expression values from a standard Gaussian distribution
70
+X <- matrix(rnorm(p*n), nrow=p,
71
+            dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
72
+X[1:5, 1:5]
73
+```
74
+
75
+Given a collection of gene sets stored, for instance, in a `list` object such as
76
+this one with gene sampled uniformly at random without replacement into the gene sets:
77
+
78
+```{r}
79
+## sample gene set sizes
80
+gs <- as.list(sample(10:100, size=100, replace=TRUE))
81
+## sample gene sets
82
+gs <- lapply(gs, function(n, p)
83
+                   paste0("g", sample(1:p, size=n, replace=FALSE)), p)
84
+names(gs) <- paste0("gs", 1:length(gs))
85
+```
86
+
87
+We can calculate GSVA enrichment scores as follows:
88
+
89
+```{r}
90
+gsva.es <- gsva(X, gs, verbose=FALSE)
91
+dim(gsva.es)
92
+gsva.es[1:5, 1:5]
93
+```
94
+
95
+So, the first argument to the `gsva()` function is the gene expression data matrix
96
+and the second the collection of gene sets. The `gsva()` function can take the
97
+expression data and gene sets using different specialized containers that facilitate
98
+the access and manipulation of molecular and phenotype data, as well as their associated
99
+metadata. Another advanced features include the use of on-disk and parallel backends to
100
+enable using GSVA on large molecular data sets and speed up computing time. You will
101
+find information on all these features in this vignette.
102
+
103
+# Introduction
104
+
105
+Gene set variation analysis (GSVA) provides an estimate of pathway activity
106
+by transforming an input gene-by-sample expression data matrix
107
+into a gene-set-by-sample one. This resulting expression data matrix can be
108
+then used with classical analytical methods such as differential expression,
109
+classification, survival analysis, clustering or correlation analysis in a
110
+pathway-centric manner. One can also perform sample-wise comparisons between
111
+pathways and other molecular data types such as microRNA expression or binding
112
+data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).
113
+
114
+The GSVA package provides an implementation of this approach for the following
115
+methods:
116
+
117
+* _plage_ [@tomfohr_pathway_2005]. Pathway level analysis of gene expression
118
+  (PLAGE) standardizes expression profiles over the samples and then, for each
119
+  gene set, it performs a singular value decomposition (SVD) over its genes.
120
+  The coefficients of the first right-singular vector are returned as the
121
+  estimates of pathway activity over the samples. Note that, because of how
122
+  SVD is calculated, the sign of its singular vectors is arbitrary.
123
+
124
+* _zscore_ [@lee_inferring_2008]. The z-score method standardizes expression
125
+  profiles over the samples and then, for each gene set, combines the
126
+  standardized values as follows. Given a gene set $\gamma=\{1,\dots,k\}$
127
+  with standardized values $z_1,\dots,z_k$ for each gene in a specific sample,
128
+  the combined z-score $Z_\gamma$ for the gene set $\gamma$ is defined as:
129
+  $$
130
+  Z_\gamma = \frac{\sum_{i=1}^k z_i}{\sqrt{k}}\,.
131
+  $$
132
+
133
+* _ssgsea_ [@barbie_systematic_2009]. Single sample GSEA (ssGSEA) is a
134
+  non-parametric method that calculates a gene set enrichment score per sample
135
+  as the normalized difference in empirical cumulative distribution functions
136
+  (CDFs) of gene expression ranks inside and outside the gene set. By default,
137
+  the implementation in the GSVA package follows the last step described in
138
+  [@barbie_systematic_2009, online methods, pg. 2] by which pathway scores are
139
+  normalized, dividing them by the range of calculated values. This normalization
140
+  step may be switched off using the argument `ssgsea.norm` in the call to the
141
+  `gsva()` function; see below.
142
+
143
+* _gsva_ [@haenzelmann_castelo_guinney_2013]. This is the default method of
144
+  the package and similarly to ssGSEA, is a non-parametric method that
145
+  uses the empirical CDFs of gene expression ranks inside and outside the gene
146
+  set, but it starts by calculating an expression-level statistic that brings
147
+  gene expression profiles with a different dynamic range to a common scale.
148
+
149
+The interested user may find full technical details about how these methods
150
+work in their corresponding article cited above. If you use any of them in a
151
+publication, please cite it with the given bibliographic reference.
152
+
153
+# Overview of the GSVA functionality
154
+
155
+The workhorse of the GSVA package is the function `gsva()`, which requires
156
+the following two input arguments:
157
+
158
+1. A normalized gene expression dataset, which can be provided in one of the
159
+   following containers:
160
+   * A `matrix` of expression values with genes corresponding to rows and samples
161
+     corresponding to columns.
162
+   * An `ExpressionSet` object, see package `r Biocpkg("Biobase")`.
163
+   * A `SummarizedExperiment` object, see package
164
+     `r Biocpkg("SummarizedExperiment")`.
165
+2. A collection of gene sets, which can be provided in one of the following
166
+   containers:
167
+   * A `list` object where each element corresponds to a gene set defined by a
168
+     vector of gene identifiers, and the element names correspond to the names of
169
+     the gene sets.
170
+   * A `GeneSetCollection` object, see package `r Biocpkg("GSEABase")`.
171
+
172
+One advantage of providing the input data using specialized containers such as
173
+`ExpressionSet`, `SummarizedExperiment` and `GeneSetCollection` is that the
174
+`gsva()` function will automatically map the gene identifiers between the
175
+expression data and the gene sets (internally calling the function
176
+`mapIdentifiers()` from the package `r Biocpkg("GSEABase")`), when they come
177
+from different standard nomenclatures, i.e., _Ensembl_ versus _Entrez_, provided
178
+the input objects contain the appropriate metadata; see next section.
179
+
180
+If either the input gene expression data is provided as a `matrix` object or
181
+the gene sets are provided in a `list` object, or both, it is then the
182
+responsibility of the user to ensure that both objects contain gene identifiers
183
+following the same standard nomenclature.
184
+
185
+Before the actual calculations take place, the `gsva()` function will apply
186
+the following filters:
187
+
188
+1. Discard genes in the input expression data matrix with constant expression.
189
+
190
+2. Discard genes in the input gene sets that do not map to a gene in the input
191
+   gene expression data matrix.
192
+
193
+3. Discard gene sets that, after applying the previous filter, do not meet a
194
+   minimum and maximum size, which by default is one for the minimum size and
195
+   has no limit in the maximum size.
196
+
197
+If as a result of this filter either no genes or gene sets are left, the
198
+`gsva()` function will prompt an error. A common cause for an error at this
199
+stage is that gene identifiers between the expression data matrix and the gene
200
+sets do not belong to the same standard nomenclature and could not be mapped.
201
+
202
+By default, the `gsva()` function employs the method described by
203
+@haenzelmann_castelo_guinney_2013 but this can be changed using the argument
204
+`method`, which can take values `gsva` (default), `zscore`, `plage` or
205
+`ssgsea`, corresponding to the methods briefly described in the introduction.
206
+
207
+When `method="gsva"` (default), the user can additionally tune the following
208
+parameters:
209
+
210
+* `kcdf`: The first step of the GSVA algorithm brings gene expression
211
+  profiles to a common scale by calculating an expression statistic through
212
+  a non-parametric estimation of the CDF across samples. Such a non-parametric
213
+  estimation employs a _kernel function_ and the `kcdf` parameter allows the
214
+  user to specify three possible values for that function: (1) `"Gaussian"`,
215
+  the default value, which is suitable for continuous expression data, such as
216
+  microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs,
217
+  log-RPKMs or log-TPMs units of expression; (2) `"Poisson"`, which is
218
+  suitable for integer counts, such as those derived from RNA-seq alignments; (3)
219
+  `"none"`, which will enforce a direct estimation of the CDF without a kernel
220
+  function.
221
+
222
+* `mx.diff`: The last step of the GSVA algorithm calculates the gene set
223
+  enrichment score from two Kolmogorov-Smirnov random walk statistics. This
224
+  parameter is a logical flag that allows the user to specify two possible ways
225
+  to do such calculation: (1) `TRUE`, the default value, where the enrichment
226
+  score is calculated as the magnitude difference between the largest positive
227
+  and negative random walk deviations; (2) `FALSE`, where the enrichment score
228
+  is calculated as the maximum distance of the random walk from zero.
229
+
230
+* `abs.ranking`: Logical flag used only when `mx.diff=TRUE`. By default,
231
+  `abs.ranking=FALSE` and it implies that a modified Kuiper statistic is used
232
+  to calculate enrichment scores, taking the magnitude difference between the
233
+  largest positive and negative random walk deviations. When `abs.ranking=TRUE`
234
+  the original Kuiper statistic is used, by whih the largest positive and
235
+  negative random walk devations add added together. In this case, gene sets
236
+  with genes enriched on either extreme (high or low) will be regarded as
237
+  highly activated.
238
+
239
+* `tau`: Exponent defining the weight of the tail in the random walk. By
240
+  default `tau=1`. When `method="ssgsea"`, this parameter is also used and its
241
+  default value becomes then `tau=0.25`.
242
+
243
+In general, the default values for the previous parameters are suitable for
244
+most analysis settings, which usually consist of normalized continuous
245
+expression values.
246
+
247
+# Gene sets definitions and mapping to gene identifiers
248
+
249
+# Example applications
250
+
251
+## Molecular signature identification
252
+
253
+In [@verhaak_integrated_2010] four subtypes of glioblastoma multiforme (GBM)
254
+-proneural, classical, neural and mesenchymal- were identified by the
255
+characterization of distinct gene-level expression patterns. Using four
256
+gene set signatures specific to brain cell types (astrocytes, oligodendrocytes,
257
+neurons and cultured astroglial cells), derived from murine models by
258
+@cahoy_transcriptome_2008, we replicate the analysis of @verhaak_integrated_2010
259
+by using GSVA to transform the gene expression measurements into enrichment
260
+scores for these four gene sets, without taking the sample subtype grouping
261
+into account. We start by having a quick glance to the data, which forms part of
262
+the `r Biocpkg("GSVAdata") package:
263
+
264
+```{r}
265
+library(GSVAdata)
266
+
267
+data(gbm_VerhaakEtAl)
268
+gbm_eset
269
+head(featureNames(gbm_eset))
270
+table(gbm_eset$subtype)
271
+data(brainTxDbSets)
272
+lengths(brainTxDbSets)
273
+lapply(brainTxDbSets, head)
274
+```
275
+
276
+GSVA enrichment scores for the gene sets contained in `brainTxDbSets`
277
+are calculated, in this case using `mx.diff=FALSE`,  as follows:
278
+
279
+```{r}
280
+gbm_es <- gsva(gbm_eset, brainTxDbSets, mx.diff=FALSE, verbose=FALSE)
281
+```
282
+
283
+Figure \@ref(fig:gbmSignature) shows the GSVA enrichment scores obtained for the
284
+up-regulated gene sets across the samples of the four GBM subtypes. As expected,
285
+the _neural_ class is associated with the neural gene set and the astrocytic
286
+gene sets. The _mesenchymal_ subtype is characterized by the expression of
287
+mesenchymal and microglial markers, thus we expect it to correlate with the
288
+astroglial gene set. The _proneural_ subtype shows high expression of
289
+oligodendrocytic development genes, thus it is not surprising that the
290
+oligodendrocytic gene set is highly enriched for ths group. Interestingly, the
291
+_classical_ group correlates highly with the astrocytic gene set. In
292
+summary, the resulting GSVA enrichment scores recapitulate accurately the
293
+molecular signatures from @verhaak_integrated_2010.
294
+
295
+```{r gbmSignature, height=500, width=700, fig.cap="Heatmap of GSVA scores for cell-type brain signatures from murine models (y-axis) across GBM samples grouped by GBM subtype."}
296
+library(RColorBrewer)
297
+subtypeOrder <- c("Proneural", "Neural", "Classical", "Mesenchymal")
298
+sampleOrderBySubtype <- sort(match(gbm_es$subtype, subtypeOrder),
299
+                             index.return=TRUE)$ix
300
+subtypeXtable <- table(gbm_es$subtype)
301
+subtypeColorLegend <- c(Proneural="red", Neural="green",
302
+                        Classical="blue", Mesenchymal="orange")
303
+geneSetOrder <- c("astroglia_up", "astrocytic_up", "neuronal_up",
304
+                  "oligodendrocytic_up")
305
+geneSetLabels <- gsub("_", " ", geneSetOrder)
306
+hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
307
+hmcol <- hmcol[length(hmcol):1]
308
+
309
+heatmap(exprs(gbm_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA,
310
+        Colv=NA, scale="row", margins=c(3,5), col=hmcol,
311
+        ColSideColors=rep(subtypeColorLegend[subtypeOrder],
312
+                          times=subtypeXtable[subtypeOrder]),
313
+        labCol="", gbm_es$subtype[sampleOrderBySubtype],
314
+        labRow=paste(toupper(substring(geneSetLabels, 1,1)),
315
+                     substring(geneSetLabels, 2), sep=""),
316
+        cexRow=2, main=" \n ")
317
+par(xpd=TRUE)
318
+text(0.23,1.21, "Proneural", col="red", cex=1.2)
319
+text(0.36,1.21, "Neural", col="green", cex=1.2)
320
+text(0.47,1.21, "Classical", col="blue", cex=1.2)
321
+text(0.62,1.21, "Mesenchymal", col="orange", cex=1.2)
322
+mtext("Gene sets", side=4, line=0, cex=1.5)
323
+mtext("Samples          ", side=1, line=4, cex=1.5)
324
+```
325
+
326
+
327
+# Parallel calculations
328
+
329
+# Frequently asked questions
330
+
331
+# Session information {.unnumbered}
332
+
333
+Here is the output of `sessionInfo()` on the system on which this document was
334
+compiled running pandoc `r rmarkdown::pandoc_version()`:
335
+
336
+```{r session_info, cache=FALSE}
337
+sessionInfo()
338
+```
339
+
340
+# References
0 341
deleted file mode 100644
... ...
@@ -1,960 +0,0 @@
1
-%\VignetteIndexEntry{Gene Set Variation Analysis}
2
-%\VignetteDepends(Biobase, methods, gplots, glmnet}
3
-%\VignetteKeywords{GSVA, GSEA, Expression, Microarray, Pathway}
4
-%\VignettePackage{GSVA}
5
-\documentclass[a4paper]{article}
6
-\usepackage{url}
7
-\usepackage{graphicx}
8
-\usepackage{longtable}
9
-\usepackage{hyperref}
10
-\usepackage{natbib}
11
-\usepackage{fullpage}
12
-
13
-\newcommand{\Rfunction}[1]{{\texttt{#1}}}
14
-\newcommand{\Robject}[1]{{\texttt{#1}}}
15
-\newcommand{\Rpackage}[1]{{\textsf{#1}}}
16
-\newcommand{\Rclass}[1]{{\textit{#1}}}
17
-
18
-\title{GSVA: The Gene Set Variation Analysis package \\ for microarray and RNA-seq data}
19
-\author{Sonja H\"anzelmann$^1$, Robert Castelo$^1$ and Justin Guinney$^2$}
20
-
21
-\begin{document}
22
-
23
-\SweaveOpts{eps=FALSE}
24
-
25
-\maketitle
26
-
27
-\begin{quote}
28
-{\scriptsize
29
-1. Research Program on Biomedical Informatics (GRIB), Hospital del Mar Research Institute (IMIM) and Universitat Pompeu Fabra, Parc de Recerca Biom\`edica de Barcelona, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain
30
-
31
-2. Sage Bionetworks, 1100 Fairview Ave N., Seattle, Washington, 98109 USA
32
-}
33
-\end{quote}
34
-
35
-\begin{abstract}
36
-The \Rpackage{GSVA} package implements a non-parametric unsupervised method,
37
-called Gene Set Variation Analysis (GSVA), for assessing gene set enrichment
38
-(GSE) in gene expression microarray and RNA-seq data. In contrast to most
39
-GSE methods, GSVA performs a change in coordinate systems, transforming the
40
-data from a gene by sample matrix to a gene set by sample matrix. Thereby
41
-allowing for the evaluation of pathway enrichment for each sample. This
42
-transformation is done without the use of a phenotype, thus facilitating very
43
-powerful and open-ended analyses in a now pathway centric manner. In this
44
-vignette we illustrate how to use the \Rpackage{GSVA} package to perform some
45
-of these analyses using published microarray and RNA-seq data already
46
-pre-processed and stored in the companion experimental data package
47
-\Rpackage{GSVAdata}.
48
-\end{abstract}
49
-
50
-<<options, echo=FALSE>>=
51
-options(width=60)
52
-pdf.options(useDingbats=FALSE)
53
-@
54
-
55
-\section{Introduction}
56
-
57
-Gene set enrichment analysis (GSEA)
58
-\citep[see][]{mootha_pgc_1alpha_responsive_2003, subramanian_gene_2005} is a
59
-method designed to assess the concerted behavior of functionally related genes
60
-forming a set, between two well-defined groups of samples. Because it does not
61
-rely on a ``gene list'' of interest but on the entire ranking of genes, GSEA
62
-has been shown to provide greater sensitivity to find gene expression changes
63
-of small magnitude that operate coordinately in specific sets of functionally
64
-related genes. However, due to the reduced costs in genome-wide gene-expression
65
-assays, data is being produced under more complex experimental designs that
66
-involve multiple RNA sources enriched with a wide spectrum of phenotypic and/or
67
-clinical information. The Cancer Genome Atlas (TCGA) project
68
-(see \url{http://cancergenome.nih.gov}) and the data deposited on it constitute
69
-a canonical example of this situation.
70
-
71
-To facilitate the functional enrichment analysis of this kind of data, we
72
-developed Gene Set Variation Analysis (GSVA) which allows the assessment of the
73
-underlying pathway activity variation by transforming the gene by sample matrix
74
-into a gene set by sample matrix without the \textit{a priori} knowledge of the
75
-experimental design. The method is both non-parametric and unsupervised, and
76
-bypasses the conventional approach of explicitly modeling phenotypes within
77
-enrichment scoring algorithms. Focus is therefore placed on the
78
-\textit{relative} enrichment of pathways across the sample space rather than
79
-the \textit{absolute} enrichment with respect to a phenotype. The value
80
-of this approach is that it permits the use of traditional analytical methods
81
-such as classification, survival analysis, clustering, and correlation analysis
82
-in a pathway focused manner. It also facilitates sample-wise comparisons between
83
-pathways and other complex data types such as microRNA expression or binding
84
-data, copy-number variation (CNV) data, or single nucleotide polymorphisms
85
-(SNPs). However, for case-control experiments, or data with a moderate to small
86
-sample size ($<30$), other GSE methods that explicitly include the phenotype in
87
-their model are more likely to provide greater statistical power to detect
88
-functional enrichment.
89
-
90
-In the rest of this vignette we describe briefly the methodology behind GSVA,
91
-give an overview of the functions implemented in the package and show a few
92
-applications. The interested reader is referred to \citep{haenzelmann_castelo_guinney_2013} for
93
-more comprehensive explanations and more complete data analysis examples with
94
-GSVA, as well as for citing GSVA if you use it in your own work.
95
-
96
-\section{GSVA enrichment scores}
97
-
98
-A schematic overview of the GSVA method is provided in Figure \ref{methods}, which shows the two main required inputs: a matrix $X=\{x_{ij}\}_{p\times n}$ of normalized expression values (see Methods for details on the pre-processing steps) for $p$ genes by $n$ samples, where typically $p\gg n$, and a collection of gene sets $\Gamma = \{\gamma_1, \dots, \gamma_m\}$. We shall denote by $x_i$ the expression profile of the $i$-th gene, by $x_{ij}$ the specific expression value of the $i$-th gene in the $j$-th sample, and by $\gamma_k$ the subset of row indices in $X$ such that $\gamma_k \subset \{1,\ldots\,p\}$ defines a set of genes forming a pathway or some other functional unit. Let $|\gamma_k |$ be the number of genes in $\gamma_k$.
99
-
100
-\begin{figure}[ht]
101
-\centerline{\includegraphics[width=\textwidth]{methods}}
102
-\caption{{\bf GSVA methods outline.}
103
-The input for the GSVA algorithm are a gene expression matrix in the form of log2
104
-microarray expression values or RNA-seq counts and a database of gene sets.
105
-1. Kernel estimation of the cumulative density function (kcdf). The two plots
106
-show two simulated expression profiles mimicking 6 samples from microarray and
107
-RNA-seq. The $x$-axis corresponds to expression values where each gene is lowly
108
-expressed in the four samples with lower values and highly expressed in the other
109
-two. The scale of the kcdf is on the left $y$-axis and the scale of the Gaussian
110
-and Poisson kernels is on the right $y$-axis.
111
-2. The expression-level statistic is rank ordered for each sample.
112
-3. For every gene set, the Kolmogorov-Smirnov-like rank statistic is calculated.
113
-The plot illustrates a gene set consisting of 3 genes out of a total number of 10
114
-with the sample-wise calculation of genes inside and outside of the gene set.
115
-4. The GSVA enrichment score is either the difference between the two sums or the
116
-maximum deviation from zero. The two plots show two simulations of the resulting
117
-scores under the null hypothesis of no gene expression change (see main text).
118
-The output of the algorithm is matrix containing pathway enrichment profiles for
119
-each gene set and sample. }
120
-\label{methods}
121
-\end{figure}
122
-
123
-GSVA starts by evaluating whether a gene $i$ is highly or lowly expressed in sample $j$ in the context of the sample population distribution. Probe effects can alter hybridization intensities in microarray data such that expression values can greatly differ between two non-expressed genes\cite{zilliox_gene_2007}. Analogous gene-specific biases, such as GC content or gene length have been described in RNA-seq data\cite{hansen_removing_2012}. To bring distinct expression profiles to a common scale, an expression-level statistic is calculated as follows. For each gene expression profile $x_i=\{x_{i1},\dots,x_{in}\}$, a non-parametric kernel estimation of its cumulative density function is performed using a Gaussian kernel \cite[pg.~148]{silverman_density_1986} in the case of microarray data:
124
-
125
-\begin{equation}
126
-\label{density}
127
-\hat{F}_{h_i}(x_{ij})=\frac{1}{n}\sum_{k=1}^n\int_{-\infty}^{\frac{x_{ij}-x_{ik}}{h_i}}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt\,,
128
-\end{equation}
129
-where $h_i$ is the gene-specific bandwidth parameter that controls the resolution of the kernel estimation, which is set to $h_i=s_i/4$, where $s_i$ is the sample standard deviation of the $i$-th gene (Figure \ref{methods}, step 1). In the case of RNA-seq data, a discrete Poisson kernel \cite{canale_bayesian_2011} is employed:
130
-
131
-\begin{equation}
132
-\hat{F}_r(x_{ij}) = \frac{1}{n} \sum_{k=1}^n \sum_{y=0}^{x_{ij}} \frac{e^{-(x_{ik}+r)}(x_{ik}+r)^y}{y!}\,,
133
-\end{equation}
134
-where $r=0.5$ in order to set the mode of the Poisson kernel at each $x_{ik}$, because the mode of a Poisson distribution with an integer mean $\lambda$ occurs at $\lambda$ and $\lambda-1$ and at the largest integer smaller than $\lambda$ when $\lambda$ is continuous.
135
-
136
-Let $z_{ij}$ denote the previous expression-level statistic $\hat{F}_{h_i}(x_{ij})$, or $\hat{F}_r(x_{ij})$, depending on whether $x_{ij}$ are continuous microarray, or discrete count RNA-seq values, respectively. The following step condenses expression-level statistics into gene sets by calculating sample-wise enrichment scores. To reduce the influence of potential outliers, we first convert $z_{ij}$ to ranks $z_{(i)j}$ for each sample $j$ and normalize further $r_{ij}=|p/2-z_{(i)j}|$ to make the ranks symmetric around zero (Figure~\ref{methods}, step 2). This is done to up-weight the two tails of the rank distribution when computing the final enrichment score.
137
-
138
-We assess the enrichment score similar to the GSEA and ASSESS methods \cite{subramanian_gene_2005,edelman_analysis_2006} using the Kolmogorov-Smirnov (KS) like random walk statistic (Figure~\ref{methods}, step 3):
139
-\begin{equation}
140
-\label{walk}
141
-\nu_{jk}(\ell) = \frac{\sum_{i=1}^\ell |r_{ij}|^{\tau} I(g_{(i)} \in
142
-\gamma_k)}{\sum_{i=1}^p |r_{ij}|^{\tau}I(g_{(i)} \in \gamma_k)} -
143
-\frac{\sum_{i=1}^\ell I(g_{(i)} \not\in \gamma_k)}{p-|\gamma_k|},
144
-\end{equation}
145
-where $\tau$ is a parameter describing the weight of the tail in the random walk (default $\tau = 1$), $\gamma_k$ is the $k$-th gene set, $I(g_{(i)} \in \gamma_k)$ is the indicator function on whether the $i$-th gene (the gene corresponding to the $i$-th ranked expression-level statistic) belongs to gene set $\gamma_k$, $|\gamma_k|$ is the number of genes in the $k$-th gene set, and $p$ is the number of genes in the data set. Conceptually, Eq.~\ref{walk} produces a distribution over the genes to assess if the genes in the gene set are more likely to be found at either tail of the rank distribution (see \cite{subramanian_gene_2005,edelman_analysis_2006} for a more detailed description).
146
-
147
-We offer two approaches for turning the KS like random walk statistic into an enrichment statistic (ES) (also called GSVA score), the classical maximum deviation method \cite{subramanian_gene_2005,edelman_analysis_2006,verhaak_integrated_2010} and a normalized ES. The first ES is the maximum deviation from zero of the random walk of the $j$-th sample with respect to the $k$-th gene set :
148
-
149
-\begin{equation}
150
-\label{escore}
151
-ES^{\mbox{\tiny{max}}}_{jk} = \nu_{jk}[\arg \max_{\ell=1,\dots,p} \left|\nu_{jk}(\ell)\right|].
152
-\end{equation}
153
-For each gene set $k$, this approach produces a distribution of enrichment scores that is bimodal (Figure~\ref{methods}, step 4, top panel). This is an intrinsic property of the KS like random walk, which generates non-zero maximum deviations under the null distribution. In GSEA \cite{subramanian_gene_2005} it is also observed that the empirical null distribution obtained by permuting phenotypes is bimodal and, for this reason, significance is determined independently using the positive and negative sides of the null distribution. In our case, we would like to provide a standard Gaussian distribution of enrichment scores under the null hypothesis of no change in pathway activity throughout the sample population. For this purpose we propose a second, alternative score that produces an ES distribution approximating this requirement (Figure~\ref{methods}, step 4, bottom panel):
154
-
155
-\begin{equation}
156
-\label{escore_2}
157
-ES^{\mbox{\tiny{diff}}}_{jk} = \left|ES^{+}_{jk}\right| - \left|ES^{-}_{jk}\right|=\max_{\ell=1,\dots,p}(0,\nu_{jk}(\ell)) - \min_{\ell=1,\dots,p}(0,\nu_{jk}(\ell))\,,
158
-\end{equation}
159
-where $ES_{jk}^{+}$ and $ES_{jk}^{-}$ are the largest positive and negative random walk deviations from zero, respectively, for sample $j$ and gene set $k$.  This statistic may be compared to the Kuiper test statistic \cite{pearson_comparison_1963}, which sums the maximum and minimum deviations to make the test statistic more sensitive in the tails.  In contrast, our test statistic penalizes deviations that are large in both tails, and provides a ``normalization'' of the enrichment score by subtracting potential noise. There is a clear biological interpretation of this statistic, it emphasizes genes in pathways that are concordantly activated in one direction only, either over-expressed or under-expressed relative to the overall population. For pathways containing genes strongly acting in both directions, the deviations will cancel each other out and show little or no enrichment. Because this statistic is unimodal and approximately normal, downstream analyses which may impose distributional assumptions on the data are possible.
160
-
161
-Figure~\ref{methods}, step 4 shows a simple simulation where standard Gaussian deviates are independenty sampled from $p=20,000$ genes and $n=30$ samples, thus mimicking a null distribution of no change in gene expression. One hundred gene sets are uniformly sampled at random from the $p$ genes with sizes ranging from 10 to 100 genes. Using these two inputs, we calculate the maximum deviation ES and the normalized ES. The resulting distributions are depicted in Figure~\ref{methods}, step 4 and in the larger figure below, illustrating the previous description.
162
-
163
-<<>>=
164
-library(GSVA)
165
-
166
-p <- 20000    ## number of genes
167
-n <- 30       ## number of samples
168
-nGS <- 100    ## number of gene sets
169
-min.sz <- 10  ## minimum gene set size
170
-max.sz <- 100 ## maximum gene set size
171
-X <- matrix(rnorm(p*n), nrow=p, dimnames=list(1:p, 1:n))
172
-dim(X)
173
-gs <- as.list(sample(min.sz:max.sz, size=nGS, replace=TRUE)) ## sample gene set sizes
174
-gs <- lapply(gs, function(n, p) sample(1:p, size=n, replace=FALSE), p) ## sample gene sets
175
-es.max <- gsva(X, gs, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)
176
-es.dif <- gsva(X, gs, mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
177
-@
178
-
179
-\begin{center}
180
-<<maxvsdif, fig=TRUE, png=TRUE, echo=TRUE, height=5, width=8>>=
181
-par(mfrow=c(1,2), mar=c(4, 4, 4, 1))
182
-plot(density(as.vector(es.max)), main="Maximum deviation from zero",
183
-     xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
184
-axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
185
-plot(density(as.vector(es.dif)), main="Difference between largest\npositive and negative deviations",
186
-     xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
187
-axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
188
-@
189
-\end{center}
190
-
191
-\bigskip
192
-Although the GSVA algorithm itself does not evaluate statistical significance for the
193
-enrichment of gene sets, significance with respect to one or more phenotypes can be easily
194
-evaluated using conventional statistical models. Likewise, false discovery rates can be
195
-estimated by permuting the sample labels (Methods).  Examples of these techniques are
196
-provided in the following section.
197
-
198
-\section{Overview of the package}
199
-
200
-The \Rpackage{GSVA} package implements the methodology described in the previous
201
-section in the function \Rfunction{gsva()} which requires two main input
202
-arguments: the gene expression data and a collection of gene sets. The
203
-expression data can be provided either as a \Rclass{matrix} object of genes
204
-(rows) by sample (columns) expression values, or as an \Rclass{ExpressionSet}
205
-object. The collection of gene sets can be provided either as a \Rclass{list}
206
-object with names identifying gene sets and each entry of the list containing
207
-the gene identifiers of the genes forming the corresponding set, or as a
208
-\Rclass{GeneSetCollection} object as defined in the \Rpackage{GSEABase} package.
209
-
210
-When the two main arguments are an \Rclass{ExpressionSet} object and a
211
-\Rclass{GeneSetCollection} object, the \Rfunction{gsva()} function will first
212
-translate the gene identifiers used in the \Rclass{GeneSetCollection} object
213
-into the corresponding feature identifiers of the \Rclass{ExpressionSet} object,
214
-according to its corresponding annotation package. This translation is carried
215
-out by an internal call to the function \Rfunction{mapIdentifiers()} from the
216
-\Rpackage{GSEABase} package. This means that both input arguments may specify
217
-features with different types of identifiers, such as Entrez IDs and probeset IDs,
218
-and the \Rpackage{GSEABase} package will take care of mapping them to each other.
219
-
220
-A second filtering step is applied that removes genes without matching features
221
-in the \Rclass{ExpressionSet} object. If the expression data is given as a
222
-\Rclass{matrix} object then only the latter filtering step will be applied by the
223
-\Rfunction{gsva()} function and, therefore, it will be the responsibility of the
224
-user to have the same type of identifiers in both the expression data and the
225
-gene sets.
226
-
227
-After these automatic filtering steps, we may additionally filter out gene sets
228
-that do not meet a minimum and/or maximum size specified by the optional
229
-arguments \Robject{min.sz} and \Robject{max.sz} which are set by default to 1
230
-and \Robject{Inf}, respectively. Finally, the \Rfunction{gsva()} function will
231
-carry out the calculations specified in the previous section and return a
232
-gene set by sample matrix of GSVA enrichment scores in the form of a
233
-\Rclass{matrix} object when this was the class of the input expression data
234
-object. Otherwise, it will return an \Rclass{ExpressionSet} object inheriting
235
-all the corresponding phenotypic information from the input data.
236
-
237
-An important argument of the \Rfunction{gsva()} function is the flag
238
-\Robject{mx.diff} which is set to \Robject{TRUE} by default. Under this default
239
-setting, GSVA enrichment scores are calculated using Equation~\ref{escore_2}, and
240
-therefore, are more amenable by analysis techniques that assume the data to be
241
-normally distributed. When setting \Robject{mx.diff=FALSE}, then
242
-Equation~\ref{escore} is employed, calculating enrichment in an analogous way to
243
-classical GSEA which typically provides a bimodal distribution of GSVA enrichment
244
-scores for each gene.
245
-
246
-Since the calculations for each gene set are independent from each other, the
247
-\Rfunction{gsva()} function offers two possibilities to perform them in
248
-parallel. One consists of loading the library \Rpackage{snow}, which will enable
249
-the parallelization of the calculations through a cluster of computers. In order
250
-to activate this option we should specify in the argument \Robject{parallel.sz}
251
-the number of processors we want to use (default is zero which means no
252
-parallelization is going to be employed). The other is loading the library
253
-\Rpackage{parallel} and then the \Rfunction{gsva()} function will use the core
254
-processors of the computer where R is running. If we want to limit
255
-\Rfunction{gsva()} in the number of core processors that should be used, we can
256
-do it by specifying the number of cores in the \Robject{parallel.sz} argument.
257
-
258
-The other two functions of the \Rpackage{GSVA} package are
259
-\Rfunction{filterGeneSets()} and \Rfunction{computeGeneSetsOverlaps()} that
260
-serve to explicitly filter gene sets by size and by pair-wise overlap,
261
-respectively. Note that the size filter can also be applied within the
262
-\Rfunction{gsva()} function call.
263
-
264
-The \Rfunction{gsva()} function also offers the following three other unsupervised
265
-GSE methods that calculate single sample pathway summaries of expression and which
266
-can be selected through the \Robject{method} argument:
267
-
268
-\begin{itemize}
269
-  \item \Robject{method="plage"} \citep{tomfohr_pathway_2005}. Pathway level analysis
270
-        of gene expression (PLAGE) standardizes first expression profiles into z-scores
271
-        over the samples and then calculates the singular value decomposition
272
-        $Z_\gamma=UDV'$ on the z-scores of the genes in the gene set. The coefficients
273
-        of the first right-singular vector (first column of $V$) are taken as the gene
274
-        set summaries of expression over the samples.
275
-  \item \Robject{method="zscore"} \citep{lee_inferring_2008}. The combined z-score method
276
-        also, as PLAGE, standardizes first expression profiles into z-scores over the samples,
277
-        but combines them together for each gene set at each individual sample as follows.
278
-        Given a gene set $\gamma=\{1,\dots,k\}$ with z-scores $Z_1,\dots,Z_k$ for each gene,
279
-        the combined z-score $Z_\gamma$ for the gene set $\gamma$ is defined as:
280
-        \begin{equation}
281
-        Z_\gamma = \frac{\sum_{i=1}^k Z_i}{\sqrt{k}}\,.
282
-        \end{equation}
283
-  \item \Robject{method="ssgsea"} \citep{barbie_systematic_2009}. Single sample GSEA (ssGSEA)
284
-        calculates a gene set enrichment score per sample as the normalized difference in
285
-        empirical cumulative distribution functions of gene expression ranks inside and
286
-        outside the gene set.
287
-\end{itemize}
288
-
289
-By default \Robject{method="gsva"} and the \Rfunction{gsva()} function uses the GSVA algorithm.
290
-
291
-\section{Applications}
292
-
293
-In this section we illustrate the following applications of \Rpackage{GSVA}:
294
-
295
-\begin{itemize}
296
-  \item Functional enrichment between two subtypes of leukemia.
297
-  \item Identification of molecular signatures in distinct glioblastoma subtypes.
298
-\end{itemize}
299
-
300
-Throughout this vignette we will use the C2 collection of curated gene sets that
301
-form part of the Molecular Signatures Database (MSigDB) version 3.0. This
302
-particular collection of gene sets is provided as a \Rclass{GeneSetCollection}
303
-object called \Robject{c2BroadSets} in the accompanying experimental data package
304
-\Rpackage{GSVAdata}, which stores these and other data employed in this vignette.
305
-These data can be loaded as follows:
306
-
307
-<<results=hide>>=
308
-library(GSEABase)
309
-library(GSVAdata)
310
-
311
-data(c2BroadSets)
312
-c2BroadSets
313
-@
314
-where we observe that \Robject{c2BroadSets} contains \Sexpr{length(c2BroadSets)}
315
-gene sets. We also need to load the following additional libraries:
316
-
317
-<<results=hide>>=
318
-library(Biobase)
319
-library(genefilter)
320
-library(limma)
321
-library(RColorBrewer)
322
-library(GSVA)
323
-@
324
-As a final setup step for this vignette, we will employ the
325
-\Rfunction{cache()} function from the \Rpackage{Biobase} package in order to
326
-load some pre-computed results and speed up the building time of the vignette:
327
-
328
-<<>>=
329
-cacheDir <- system.file("extdata", package="GSVA")
330
-cachePrefix <- "cache4vignette_"
331
-@
332
-In order to enforce re-calculating everything, either the call to the
333
-\Rfunction{cache()} function should be replaced by its first argument, or the
334
-following command should be written in the R console at this point:
335
-
336
-<<eval=FALSE>>=
337
-file.remove(paste(cacheDir, list.files(cacheDir, pattern=cachePrefix), sep="/"))
338
-@
339
-
340
-\subsection{Functional enrichment}
341
-
342
-In this section we illustrate how to identify functionally enriched gene sets
343
-between two phenotypes. As in most of the applications we start by calculating
344
-GSVA enrichment scores and afterwards, we will employ the linear modeling
345
-techniques implemented in the \Rpackage{limma} package to find the enriched gene
346
-sets.
347
-
348
-The data set we use in this section corresponds to the microarray data from
349
-\citep{armstrong_mll_2002} which consists of 37 different individuals
350
-with human acute leukemia, where 20 of them have conventional childhood acute
351
-lymphoblastic leukemia (ALL) and the other 17 are affected with the MLL
352
-(mixed-lineage leukemia gene) translocation. This leukemia data set is stored as
353
-an \verb+ExpressionSet+ object called \Robject{leukemia} in the
354
-\Rpackage{GSVAdata} package and details on how the data was pre-processed can be
355
-found in the corresponding help page. Enclosed with the RMA expression values we
356
-provide some metadata including the main phenotype corresponding to the leukemia
357
-sample subtype.
358
-
359
-<<>>=
360
-data(leukemia)
361
-leukemia_eset
362
-head(pData(leukemia_eset))
363
-table(leukemia_eset$subtype)
364
-@
365
-Let's examine the variability of the expression profiles across samples by
366
-plotting the cumulative distribution of IQR values as shown in Figure~\ref{figIQR}.
367
-About 50\% of the probesets show very limited variability across samples
368
-and, therefore, in the following non-specific filtering step we remove this
369
-fraction from further analysis.
370
-
371
-<<figIQR, echo=FALSE, results=hide>>=
372
-png(filename="GSVA-figIQR.png", width=500, height=500, res=150)
373
-IQRs <- esApply(leukemia_eset, 1, IQR)
374
-plot.ecdf(IQRs, pch=".", xlab="Interquartile range (IQR)", main="Leukemia data")
375
-abline(v=quantile(IQRs, prob=0.5), lwd=2, col="red")
376
-dev.off()
377
-@
378
-\begin{figure}[ht]
379
-\centerline{\includegraphics[width=0.5\textwidth]{GSVA-figIQR}}
380
-\caption{Empirical cumulative distribution of the interquartile range (IQR) of
381
-expression values in the leukemia data. The vertical red bar is located at the
382
-50\% quantile value of the cumulative distribution.}
383
-\label{figIQR}
384
-\end{figure}
385
-
386
-We carry out a non-specific filtering step by discarding the 50\% of the
387
-probesets with smaller variability, probesets without Entrez ID annotation,
388
-probesets whose associated Entrez ID is duplicated in the annotation, and
389
-Affymetrix quality control probes:
390
-
391
-<<>>=
392
-filtered_eset <- nsFilter(leukemia_eset, require.entrez=TRUE, remove.dupEntrez=TRUE,
393
-                          var.func=IQR, var.filter=TRUE, var.cutoff=0.5, filterByQuantile=TRUE,
394
-                          feature.exclude="^AFFX")
395
-filtered_eset
396
-leukemia_filtered_eset <- filtered_eset$eset
397
-@
398
-
399
-The calculation of GSVA enrichment scores is performed in one single call to the
400
-\verb+gsva()+ function. However, one should take into account that this function
401
-performs further non-specific filtering steps prior to the actual calculations.
402
-On the one hand, it matches gene identifiers between gene sets and gene
403
-expression values. On the other hand, it discards gene sets that do not meet
404
-minimum and maximum gene set size requirements specified with the arguments
405
-\verb+min.sz+ and \verb+max.sz+, respectively, which, in the call below, are
406
-set to 10 and 500 genes. Because we want to use \Rpackage{limma} on the resulting
407
-GSVA enrichment scores, we leave deliberately unchanged the default argument
408
-\Robject{mx.diff=TRUE} to obtain approximately normally distributed ES.
409
-
410
-<<>>=
411
-cache(leukemia_es <- gsva(leukemia_filtered_eset, c2BroadSets,
412
-                           min.sz=10, max.sz=500, verbose=TRUE),
413
-                           dir=cacheDir, prefix=cachePrefix)
414
-@
415
-We test whether there is a difference between the GSVA enrichment scores from each
416
-pair of phenotypes using a simple linear model and moderated t-statistics computed
417
-by the \verb+limma+ package using an empirical Bayes shrinkage method
418
-\citep[see][]{Smyth_2004}. We are going to examine both, changes at gene level
419
-and changes at pathway level and since, as we shall see below, there are plenty
420
-of them, we are going to employ the following stringent cut-offs to attain a high
421
-level of statistical and biological significance:
422
-
423
-<<>>=
424
-adjPvalueCutoff <- 0.001
425
-logFCcutoff <- log2(2)
426
-@
427
-where we will use the latter only for the gene-level differential expression
428
-analysis.
429
-
430
-<<>>=
431
-design <- model.matrix(~ factor(leukemia_es$subtype))
432
-colnames(design) <- c("ALL", "MLLvsALL")
433
-fit <- lmFit(leukemia_es, design)
434
-fit <- eBayes(fit)
435
-allGeneSets <- topTable(fit, coef="MLLvsALL", number=Inf)
436
-DEgeneSets <- topTable(fit, coef="MLLvsALL", number=Inf,
437
-                       p.value=adjPvalueCutoff, adjust="BH")
438
-res <- decideTests(fit, p.value=adjPvalueCutoff)
439
-summary(res)
440
-@
441
-Thus, there are \Sexpr{sum(res[, "MLLvsALL"]!=0)} MSigDB C2 curated pathways that
442
-are differentially activated between MLL and ALL at \Sexpr{adjPvalueCutoff*100}\%
443
-FDR. When we carry out the corresponding differential expression analysis at gene level:
444
-
445
-<<>>=
446
-logFCcutoff <- log2(2)
447
-design <- model.matrix(~ factor(leukemia_eset$subtype))
448
-colnames(design) <- c("ALL", "MLLvsALL")
449
-fit <- lmFit(leukemia_filtered_eset, design)
450
-fit <- eBayes(fit)
451
-allGenes <- topTable(fit, coef="MLLvsALL", number=Inf)
452
-DEgenes <- topTable(fit, coef="MLLvsALL", number=Inf,
453
-                    p.value=adjPvalueCutoff, adjust="BH", lfc=logFCcutoff)
454
-res <- decideTests(fit, p.value=adjPvalueCutoff, lfc=logFCcutoff)
455
-summary(res)
456
-@
457
-Here, \Sexpr{sum(res[, "MLLvsALL"]!=0)} genes show up as being differentially
458
-expressed with a minimum fold-change of \Sexpr{2^logFCcutoff} at \Sexpr{adjPvalueCutoff*100}\%
459
-FDR. We illustrate the genes and pathways that are changing by means of volcano
460
-plots (Fig.~\ref{leukemiaVolcano}.
461
-
462
-<<leukemiaVolcano, echo=FALSE, results=hide>>=
463
-png(filename="GSVA-leukemiaVolcano.png", width=800, height=500)
464
-par(mfrow=c(1,2))
465
-plot(allGeneSets$logFC, -log10(allGeneSets$P.Value), pch=".", cex=4, col=grey(0.75),
466
-     main="Gene sets", xlab="GSVA enrichment score difference", ylab=expression(-log[10]~~~Raw~P-value))
467
-abline(h=-log10(max(allGeneSets$P.Value[allGeneSets$adj.P.Val <= adjPvalueCutoff])),
468
-       col=grey(0.5), lwd=1, lty=2)
469
-points(allGeneSets$logFC[match(rownames(DEgeneSets), rownames(allGeneSets))],
470
-       -log10(allGeneSets$P.Value[match(rownames(DEgeneSets), rownames(allGeneSets))]), pch=".",
471
-       cex=4, col="red")
472
-text(max(allGeneSets$logFC)*0.85,
473
-         -log10(max(allGeneSets$P.Value[allGeneSets$adj.P.Val <= adjPvalueCutoff])),
474
-         sprintf("%.1f%% FDR", 100*adjPvalueCutoff), pos=1)
475
-
476
-plot(allGenes$logFC, -log10(allGenes$P.Value), pch=".", cex=4, col=grey(0.75),
477
-     main="Genes", xlab="Log fold-change", ylab=expression(-log[10]~~~Raw~P-value))
478
-abline(h=-log10(max(allGenes$P.Value[allGenes$adj.P.Val <= adjPvalueCutoff])),
479
-       col=grey(0.5), lwd=1, lty=2)
480
-abline(v=c(-logFCcutoff, logFCcutoff), col=grey(0.5), lwd=1, lty=2)
481
-points(allGenes$logFC[match(rownames(DEgenes), rownames(allGenes))],
482
-       -log10(allGenes$P.Value[match(rownames(DEgenes), rownames(allGenes))]), pch=".",
483
-       cex=4, col="red")
484
-text(max(allGenes$logFC)*0.85,
485
-         -log10(max(allGenes$P.Value[allGenes$adj.P.Val <= adjPvalueCutoff])),
486
-         sprintf("%.1f%% FDR", 100*adjPvalueCutoff), pos=1)
487
-dev.off()
488
-@
489
-\begin{figure}
490
-\centerline{\includegraphics[width=0.8\textwidth]{GSVA-leukemiaVolcano}}
491
-\caption{Volcano plots for differential pathway activation (left) and differential
492
-         gene expression (right) in the leukemia data set.}
493
-\label{leukemiaVolcano}
494
-\end{figure}
495
-
496
-The signatures of both, the differentially activated pathways reported by the
497
-GSVA analysis and of the differentially expressed genes are shown in Figures
498
-\ref{leukemiaHeatmapGeneSets} and \ref{leukemiaHeatmapGenes}, respectively.
499
-Many of the gene sets and pathways reported in Figure~\ref{leukemiaHeatmapGeneSets}
500
-are directly related to ALL and MLL.
501
-
502
-<<leukemiaHeatmapGeneSets, echo=FALSE, results=hide>>=
503
-png(filename="GSVA-leukemiaHeatmapGeneSets.png", width=500, height=500)
504
-GSVAsco <- exprs(leukemia_es[rownames(DEgeneSets), ])
505
-colorLegend <- c("darkred", "darkblue")
506
-names(colorLegend) <- c("ALL", "MLL")
507
-sample.color.map <- colorLegend[pData(leukemia_es)[, "subtype"]]
508
-names(sample.color.map) <- colnames(GSVAsco)
509
-sampleClustering <- hclust(as.dist(1-cor(GSVAsco, method="spearman")), method="complete")
510
-geneSetClustering <- hclust(as.dist(1-cor(t(GSVAsco), method="pearson")), method="complete")
511
-heatmap(GSVAsco, ColSideColors=sample.color.map, xlab="samples",
512
-        ylab="Gene sets and pathways", margins=c(2, 20),
513
-        labRow=substr(gsub("_", " ", gsub("^KEGG_|^REACTOME_|^BIOCARTA_", "", rownames(GSVAsco))), 1, 35),
514
-        labCol="", scale="row",
515
-        Colv=as.dendrogram(sampleClustering), Rowv=as.dendrogram(geneSetClustering))
516
-legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
517
-dev.off()
518
-@
519
-\begin{figure}[ht]
520
-\centerline{\includegraphics[width=0.7\textwidth]{GSVA-leukemiaHeatmapGeneSets}}
521
-\caption{Heatmap of differentially activated pathways at \Sexpr{adjPvalueCutoff*100}\% FDR
522
-in the Leukemia data set.}
523
-\label{leukemiaHeatmapGeneSets}
524
-\end{figure}
525
-
526
-
527
-<<leukemiaHeatmapGenes, echo=FALSE, results=hide>>=
528
-png(filename="GSVA-leukemiaHeatmapGenes.png", width=500, height=500)
529
-exps <- exprs(leukemia_eset[rownames(DEgenes), ])
530
-colorLegend <- c("darkred", "darkblue")
531
-names(colorLegend) <- c("ALL", "MLL")
532
-sample.color.map <- colorLegend[pData(leukemia_eset)[, "subtype"]]
533
-names(sample.color.map) <- colnames(exps)
534
-sampleClustering <- hclust(as.dist(1-cor(exps, method="spearman")), method="complete")
535
-geneClustering <- hclust(as.dist(1-cor(t(exps), method="pearson")), method="complete")
536
-heatmap(exps, ColSideColors=sample.color.map, xlab="samples", ylab="Genes",
537
-        labRow="", labCol="", scale="row", Colv=as.dendrogram(sampleClustering),
538
-         Rowv=as.dendrogram(geneClustering), margins=c(2,2))
539
-legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
540
-dev.off()
541
-@
542
-\begin{figure}[ht]
543
-\centerline{\includegraphics[width=0.5\textwidth]{GSVA-leukemiaHeatmapGenes}}
544
-\caption{Heatmap of differentially expressed genes with a minimum fold-change of
545
-\Sexpr{2^logFCcutoff} at \Sexpr{adjPvalueCutoff*100}\% FDR in the leukemia data set.}
546
-\label{leukemiaHeatmapGenes}
547
-\end{figure}
548
-
549
-\subsection{Molecular signature identification}
550
-
551
-In \citep{verhaak_integrated_2010} four subtypes of Glioblastoma multiforme
552
-(GBM) - proneural, classical, neural and mesenchymal - were identified by
553
-the characterization of distinct gene-level expression patterns. Using eight
554
-gene set signatures specific to brain cell types - astrocytes, oligodendrocytes,
555
-neurons and cultured astroglial cells - derived from murine models by
556
-\citep{cahoy_transcriptome_2008}, we replicate the analysis of
557
-\citep{verhaak_integrated_2010} by employing GSVA to transform the gene
558
-expression measurements into enrichment scores for these eight gene sets, without
559
-taking the sample subtype grouping into account. We start by loading and have
560
-a quick glance to the data which forms part of the \verb+GSVAdata+ package:
561
-
562
-<<>>=
563
-data(gbm_VerhaakEtAl)
564
-gbm_eset
565
-head(featureNames(gbm_eset))
566
-table(gbm_eset$subtype)
567
-data(brainTxDbSets)
568
-sapply(brainTxDbSets, length)
569
-lapply(brainTxDbSets, head)
570
-@
571
-GSVA enrichment scores for the gene sets contained in \Robject{brainTxDbSets}
572
-are calculated, in this case using \Robject{mx.diff=FALSE},  as follows:
573
-
574
-<<>>=
575
-gbm_es <- gsva(gbm_eset, brainTxDbSets, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)
576
-@
577
-Figure \ref{gbmSignature} shows the GSVA enrichment scores obtained for the
578
-up-regulated gene sets across the samples of the four GBM subtypes. As expected,
579
-the \emph{neural} class is associated with the neural gene set and the astrocytic
580
-gene sets. The \emph{mesenchymal} subtype is characterized by the expression of
581
-mesenchymal and microglial markers, thus we expect it to correlate with the
582
-astroglial gene set. The \emph{proneural} subtype shows high expression of
583
-oligodendrocytic development genes, thus it is not surprising that the
584
-oligodendrocytic gene set is highly enriched for ths group. Interestingly, the
585
-\emph{classical} group correlates highly with the astrocytic gene set. In
586
-summary, the resulting GSVA enrichment scores recapitulate accurately the
587
-molecular signatures from \citet{verhaak_integrated_2010}.
588
-
589
-<<gbmSignature, echo=FALSE, results=hide>>=
590
-png(filename="GSVA-gbmSignature.png", width=700, height=500)
591
-subtypeOrder <- c("Proneural", "Neural", "Classical", "Mesenchymal")
592
-sampleOrderBySubtype <- sort(match(gbm_es$subtype, subtypeOrder), index.return=TRUE)$ix
593
-subtypeXtable <- table(gbm_es$subtype)
594
-subtypeColorLegend <- c(Proneural="red", Neural="green", Classical="blue", Mesenchymal="orange")
595
-geneSetOrder <- c("astroglia_up", "astrocytic_up", "neuronal_up", "oligodendrocytic_up")
596
-geneSetLabels <- gsub("_", " ", geneSetOrder)
597
-hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
598
-hmcol <- hmcol[length(hmcol):1]
599
-
600
-heatmap(exprs(gbm_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA, Colv=NA,
601
-        scale="row", margins=c(3,5), col=hmcol,
602
-		    ColSideColors=rep(subtypeColorLegend[subtypeOrder], times=subtypeXtable[subtypeOrder]),
603
-				labCol="", gbm_es$subtype[sampleOrderBySubtype],
604
-        labRow=paste(toupper(substring(geneSetLabels, 1,1)), substring(geneSetLabels, 2), sep=""),
605
-        cexRow=2, main=" \n ")
606
-par(xpd=TRUE)
607
-text(0.22,1.11, "Proneural", col="red", cex=1.2)
608
-text(0.36,1.11, "Neural", col="green", cex=1.2)
609
-text(0.48,1.11, "Classical", col="blue", cex=1.2)
610
-text(0.66,1.11, "Mesenchymal", col="orange", cex=1.2)
611
-mtext("Gene sets", side=4, line=0, cex=1.5)
612
-mtext("Samples          ", side=1, line=4, cex=1.5)
613
-dev.off()
614
-@
615
-\begin{figure}
616
-\centerline{\includegraphics[width=0.6\textwidth]{GSVA-gbmSignature}}
617
-\caption{Heatmap of GSVA scores for cell-type brain signatures from murine models (y-axis)
618
-across GBM samples grouped by GBM subtype.}
619
-\label{gbmSignature}
620
-\end{figure}
621
-
622
-\section{Comparison with other methods}
623
-
624
-In this section we compare with simulated data the performance of GSVA with other methods
625
-producing pathway summaries of gene expression, concretely, PLAGE, the combined z-score and
626
-ssGSEA which are available through the argument \Robject{method} of the function
627
-\Rfunction{gsva()}. We employ the following simple linear additive model for simulating
628
-normalized microarray data on $p$ genes and $n$ samples divided in two groups representing
629
-a case-control scenario:
630
-
631
-\begin{equation}
632
-y_{ij} = \alpha_i + \beta_j + \epsilon_{ij}\,,
633
-\end{equation}
634
-where $\alpha_i\sim\mathcal{N}(0, 1)$ is a gene-specific effect, such as a probe-effect,
635
-with $i=1,\dots,p$, $\beta_j\sim\mathcal{N}(\mu_j, 0.5)$ is a sample-effect with $j=1,2$ and
636
-$e_{ij}\sim\mathcal{N}(0,1)$ corresponds to random noise.
637
-
638
-We will assess the statistical power to detect one differentially expressed (DE) gene set formed by
639
-30 genes, out of $p=1000$, as a function of the sample size and two varying conditions: the
640
-fraction of differentially expressed genes in the gene set (50\% and 80\%) and the signal-to-noise
641
-ratio expressed as the magnitude of the mean sample effect for one of the sample groups
642
-($\mu_1=0$ and either $\mu_2=0.5$ or $\mu_2=1$). Simulatenously, we will assess the empirical
643
-type-I error rate by building using one gene set of 30 non-DE genes.
644
-
645
-The following function enables simulating such data, computes the corresponding GSE scores with
646
-each method, performs a $t$-test on the tested gene sets between the two groups of samples for each
647
-method and returns the corresponding $p$-values:
648
-
649
-<<>>=
650
-runSim <- function(p, n, gs.sz, S2N, fracDEgs) {
651
-  sizeDEgs <- round(fracDEgs * gs.sz)
652
-  group.n <- round(n / 2)
653
-
654
-  sampleEffect <- rnorm(n, mean=0, sd=1)
655
-  sampleEffectDE <- rnorm(n, mean=S2N, sd=0.5)
656
-  probeEffect <- rnorm(p, mean=0, sd=1)
657
-  noise <- matrix(rnorm(p*n, mean=0, sd=1), nrow=p, ncol=n)
658
-  noiseDE <- matrix(rnorm(p*n, mean=0, sd=1), nrow=p, ncol=n)
659
-  M <- outer(probeEffect, sampleEffect, "+") + noise
660
-  M2 <- outer(probeEffect, sampleEffectDE, "+") + noiseDE
661
-  M[1:sizeDEgs, 1:group.n] <- M2[1:sizeDEgs, 1:group.n]
662
-
663
-  rownames(M) <- paste0("g", 1:nrow(M))
664
-  geneSets <- list(H1GeneSet=paste0("g", 1:(gs.sz)),
665
-                   H0GeneSet=paste0("g", (gs.sz+1):(2*gs.sz)))
666
-
667
-  es.gsva <- gsva(M, geneSets, verbose=FALSE, parallel.sz=1)
668
-  es.ss <- gsva(M, geneSets, method="ssgsea", verbose=FALSE, parallel.sz=1)
669
-  es.z <- gsva(M, geneSets, method="zscore", verbose=FALSE, parallel.sz=1)
670
-  es.plage <- gsva(M, geneSets, method="plage", verbose=FALSE, parallel.sz=1)
671
-
672
-  h1.gsva.pval <- t.test(es.gsva["H1GeneSet", 1:group.n],es.gsva["H1GeneSet", (group.n+1):n])$p.value
673
-  h1.ssgsea.pval <- t.test(es.ss["H1GeneSet", 1:group.n],es.ss["H1GeneSet", (group.n+1):n])$p.value
674
-  h1.zscore.pval <- t.test(es.z["H1GeneSet", 1:group.n],es.z["H1GeneSet", (group.n+1):n])$p.value
675
-  h1.plage.pval <- t.test(es.plage["H1GeneSet", 1:group.n],es.plage["H1GeneSet", (group.n+1):n])$p.value
676
-
677
-  h0.gsva.pval <- t.test(es.gsva["H0GeneSet", 1:group.n],es.gsva["H0GeneSet", (group.n+1):n])$p.value
678
-  h0.ssgsea.pval <- t.test(es.ss["H0GeneSet", 1:group.n],es.ss["H0GeneSet", (group.n+1):n])$p.value
679
-  h0.zscore.pval <- t.test(es.z["H0GeneSet", 1:group.n],es.z["H0GeneSet", (group.n+1):n])$p.value
680
-  h0.plage.pval <- t.test(es.plage["H0GeneSet", 1:group.n],es.plage["H0GeneSet", (group.n+1):n])$p.value
681
-
682
-  c(h1.gsva.pval, h1.ssgsea.pval, h1.zscore.pval, h1.plage.pval,
683
-    h0.gsva.pval, h0.ssgsea.pval, h0.zscore.pval, h0.plage.pval)
684
-}
685
-@
686
-The next function takes the $p$-values of the output of the previous function and
687
-estimates the statistical power as the fraction of non-rejections on the DE gene set,
688
-and the empirical type-I error rate as the fraction of rejections on the non-DE gene set,
689
-at a significant level $\alpha=0.05$.
690
-
691
-<<>>=
692
-estPwrTypIerr <- function(pvals, alpha=0.05) {
693
-  N <- ncol(pvals)
694
-  c(1 - sum(pvals[1, ] > alpha)/N, 1 - sum(pvals[2, ] > alpha)/N,1 - sum(pvals[3, ] > alpha)/N, 1 - sum(pvals[4, ] > alpha)/N,
695
-        sum(pvals[5, ] <= alpha)/N, sum(pvals[6, ] <= alpha)/N, sum(pvals[7, ] <= alpha)/N, sum(pvals[8, ] <= alpha)/N)
696
-}
697
-@
698
-Finally, we perform the simulation on each of the four described scenarios 60 times using
699
-the code below. The results in Fig.~\ref{simpower} show that GSVA attains higher statistical
700
-power than the other three methods in each of the simulated scenarios, while providing a
701
-similar control of the type-I error rate. Note that the fluctuations of this latter estimate
702
-are due to the limited number of times we repeat the simulation; see \citet{haenzelmann_castelo_guinney_2013}
703
-for more stable estimates obtained with a much larger number of repeated simulations.
704
-
705
-<<>>=
706
-set.seed(1234)
707
-
708
-exp1 <- cbind(estPwrTypIerr(replicate(60, runSim(1000, 10, gs.sz=30, S2N=0.5, fracDEgs=0.5))),
709
-              estPwrTypIerr(replicate(60, runSim(1000, 20, gs.sz=30, S2N=0.5, fracDEgs=0.5))),
710
-              estPwrTypIerr(replicate(60, runSim(1000, 40, gs.sz=30, S2N=0.5, fracDEgs=0.5))),
711
-              estPwrTypIerr(replicate(60, runSim(1000, 60, gs.sz=30, S2N=0.5, fracDEgs=0.5))))
712
-
713
-exp2 <- cbind(estPwrTypIerr(replicate(60, runSim(1000, 10, gs.sz=30, S2N=1.0, fracDEgs=0.5))),
714
-              estPwrTypIerr(replicate(60, runSim(1000, 20, gs.sz=30, S2N=1.0, fracDEgs=0.5))),
715
-              estPwrTypIerr(replicate(60, runSim(1000, 40, gs.sz=30, S2N=1.0, fracDEgs=0.5))),
716
-              estPwrTypIerr(replicate(60, runSim(1000, 60, gs.sz=30, S2N=1.0, fracDEgs=0.5))))
717
-
718
-exp3 <- cbind(estPwrTypIerr(replicate(60, runSim(1000, 10, gs.sz=30, S2N=0.5, fracDEgs=0.8))),
719
-              estPwrTypIerr(replicate(60, runSim(1000, 20, gs.sz=30, S2N=0.5, fracDEgs=0.8))),
720
-              estPwrTypIerr(replicate(60, runSim(1000, 40, gs.sz=30, S2N=0.5, fracDEgs=0.8))),
721
-              estPwrTypIerr(replicate(60, runSim(1000, 60, gs.sz=30, S2N=0.5, fracDEgs=0.8))))
722
-
723
-exp4 <- cbind(estPwrTypIerr(replicate(60, runSim(1000, 10, gs.sz=30, S2N=1.0, fracDEgs=0.8))),
724
-              estPwrTypIerr(replicate(60, runSim(1000, 20, gs.sz=30, S2N=1.0, fracDEgs=0.8))),
725
-              estPwrTypIerr(replicate(60, runSim(1000, 40, gs.sz=30, S2N=1.0, fracDEgs=0.8))),
726
-              estPwrTypIerr(replicate(60, runSim(1000, 60, gs.sz=30, S2N=1.0, fracDEgs=0.8))))
727
-@
728
-
729
-<<powertype1errsim, fig=TRUE, echo=FALSE, results=hide, include=FALSE, height=8, width=5>>=
730
-plotPower <- function(statpower, main, legendposition="bottomright", ...) {
731
-  plot(statpower[1,], ylim=c(0, 1.0), type="b", lwd=2, pch=1, main=main,
732
-       col="blue", ylab="Statistcal Power", xlab="Sample Size", xaxt="n")
733
-  lines(statpower[2,], col="red", type="b", lwd=2, pch=2)
734
-  lines(statpower[3,], col="darkgreen", type="b", lwd=2, pch=3)
735
-  lines(statpower[4,], col="lightgreen", type="b", lwd=2, pch=4)
736
-  if (!is.null(legendposition))
737
-    legend(legendposition, c("GSVA","ssGSEA","z-score","PLAGE"),
738
-           col=c("blue","red","darkgreen","lightgreen"),pch=1:4,lty=1,lwd=2,inset=0.02)
739
-  axis(1,at=1:4, labels=c("10","20","40","60"))
740
-}
741
-
742
-plotType1Error <- function(tmp, title, legendposition="bottomright", alpha=0.05, ...){
743
-  plot(tmp[5,],ylim=c(0, 0.2),type="b",lwd=2,pch=1,
744
-       col="blue",ylab="Empirical Type-I Error",xlab="Sample Size",xaxt="n",main=title, ...)
745
-  lines(tmp[6,],col="red",type="b",lwd=2,pch=2)
746
-  lines(tmp[7,],col="darkgreen",type="b",lwd=2,pch=3)
747
-  lines(tmp[8,],col="lightgreen",type="b",lwd=2,pch=4)
748
-  if (!is.null(legendposition))
749
-    legend(legendposition,c("GSVA","ssGSEA","z-score","PLAGE"),col=c("blue","red","darkgreen","lightgreen"),pch=1:4,lty=1,lwd=2,inset=0.02)
750
-  axis(1,at=c(1:dim(tmp)[2]), labels=c("10","20","40","60"))
751
-  abline(h=alpha, lty=2)
752
-}
753
-
754
-labelPlot <- function(lab, font, cex, offsetx=0.05, offsety=0.05) {
755
-  par(xpd=TRUE)
756
-  w <- par("usr")[2] - par("usr")[1]
757
-  h <- par("usr")[4] - par("usr")[3]
758
-  text(par("usr")[1]-w*offsetx, par("usr")[4]+h*offsety, lab, font=font, cex=cex)
759
-  par(xpd=FALSE)
760
-}
761
-
762
-par(mfrow=c(4,2), mar=c(4, 4, 2, 1))
763
-plotPower(exp1, main="", legendposition=NULL, las=1)
764
-labelPlot("A", 2, 2, 0.2, 0.15)
765
-plotType1Error(exp1,"",legendposition="topright", las=1)
766
-labelPlot("B", 2, 2, 0.2, 0.15)
767
-plotPower(exp2, main="", legendposition=NULL, las=1)
768
-labelPlot("C", 2, 2, 0.2, 0.15)
769
-plotType1Error(exp2,"",legendposition="topright", las=1)
770
-labelPlot("D", 2, 2, 0.2, 0.15)
771
-plotPower(exp3, main="", legendposition=NULL, las=1)
772
-labelPlot("E", 2, 2, 0.2, 0.15)
773
-plotType1Error(exp3,"",legendposition="topright", las=1)
774
-labelPlot("F", 2, 2, 0.2, 0.15)
775
-plotPower(exp4, main="", legendposition=NULL, las=1)
776
-labelPlot("G", 2, 2, 0.2, 0.15)
777
-plotType1Error(exp4,"",legendposition="topright", las=1)
778
-labelPlot("H", 2, 2, 0.2, 0.15)
779
-@
780
-\begin{figure}[p]
781
-\centerline{\includegraphics[height=0.8\textheight]{GSVA-powertype1errsim}}
782
-\caption{{\bf Comparison of the statistical power and type-I error rate between GSVA, PLAGE,
783
-single sample GSEA (ssGSEA) and combined z-score (zscore).}
784
-The averaged results of 60 simulations are depicted as function of the sample size on the
785
-$x$-axis, for each of the GSE methods. On the $y$-axis either the statistical power (A, C, E, G)
786
-or the empirical type-I error rate (B, D, F, H) is shown. GSE scores were calculated with each
787
-method with respect to two gene sets, one of them differentially expressed (DE) and the other one
788
-not. Statistical power and empirical type-I error rates were estimated by performing a $t$-test on
789
-the DE and non-DE gene sets, respectively, at a significance level of $\alpha=0.05$. These
790
-simulations were carried out under the following four different scenarios for the DE gene set:
791
-(A,B) weak signal-to-noise ratio, 50\% of DE genes in the DE gene set; (C,D) strong
792
-signal-to-noise ratio, 50\% of DE genes in the DE gene set; (E, F) weak signal-to-noise ratio,
793
-80\% of DE genes in the DE gene set; (G, H) strong signal-to-noise ratio, 80\% of DE gene in the
794
-DE gene set.}
795
-\label{simpower}
796
-\end{figure}
797
-
798
-\section{GSVA for RNA-Seq data}
799
-
800
-In this section we illustrate how to use GSVA with RNA-seq data and, more importantly, how the
801
-method provides pathway activity profiles analogous to the ones obtained from microarray data by
802
-using samples of lymphoblastoid cell lines (LCL) from HapMap individuals which have been profiled
803
-using both technologies \cite{huang_genome-wide_2007, pickrell_understanding_2010}. These data
804
-form part of the experimental package \Rpackage{GSVAdata} and the corresponding help pages contain
805
-details on how the data were processed. We start loading these data and verifying that
806
-they indeed contain expression data for the same genes and samples, as follows:
807
-
808
-<<>>=
809
-data(commonPickrellHuang)
810
-
811
-stopifnot(identical(featureNames(huangArrayRMAnoBatchCommon_eset),
812
-                    featureNames(pickrellCountsArgonneCQNcommon_eset)))
813
-stopifnot(identical(sampleNames(huangArrayRMAnoBatchCommon_eset),
814
-                    sampleNames(pickrellCountsArgonneCQNcommon_eset)))
815
-@
816
-Next, for the current analysis we use the subset of canonical pathways from the C2
817
-collection of MSigDB Gene Sets. These correspond to the following pathways from
818
-KEGG, REACTOME and BIOCARTA:
819
-
820
-<<>>=
821
-canonicalC2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
822
-                                      grep("^REACTOME", names(c2BroadSets)),
823
-                                      grep("^BIOCARTA", names(c2BroadSets)))]
824
-canonicalC2BroadSets
825
-@
826
-Additionally, we extend this collection of gene sets with two formed by genes
827
-with sex-specific expression:
828
-
829
-<<<>>=
830
-data(genderGenesEntrez)
831
-
832
-MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
833
-               collectionType=BroadCollection(category="c2"), setName="MSY")
834
-MSY
835
-XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
836
-               collectionType=BroadCollection(category="c2"), setName="XiE")
837
-XiE
838
-
839
-
840
-canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))
841
-canonicalC2BroadSets
842
-@
843
-We calculate now GSVA enrichment scores for these gene sets using first the microarray
844
-data and then the RNA-seq integer count data. Note that the only requirement to do the
845
-latter is to set the argument \Robject{kcdf="Poisson"} which is \Robject{"Gaussian"} by
846
-default. Note, however, that if our RNA-seq derived expression levels would be continous,
847
-such as log-CPMs, log-RPKMs or log-TPMs, the the default value of the \Robject{kcdf}
848
-argument should remain unchanged.
849
-
850
-<<<>>=
851
-esmicro <- gsva(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
852
-                mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
853
-dim(esmicro)
854
-esrnaseq <- gsva(pickrellCountsArgonneCQNcommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
855
-                 kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
856
-dim(esrnaseq)
857
-@
858
-To compare expression values from both technologies we are going to transform
859
-the RNA-seq read counts into RPKM values. For this purpose we need gene length and G+C content
860
-information also stored in the \Rpackage{GSVAdata} package and use the \Rfunction{cpm()}
861
-function from the \Rpackage{edgeR} package. Note that RPKMs can only be calculated for those
862
-genes for which the gene length and G+C content information is available:
863
-
864
-<<>>=
865
-library(edgeR)
866
-
867
-data(annotEntrez220212)
868
-head(annotEntrez220212)
869
-
870
-cpm <- cpm(exprs(pickrellCountsArgonneCQNcommon_eset))
871
-dim(cpm)
872
-
873
-common <- intersect(rownames(cpm), rownames(annotEntrez220212))
874
-length(common)
875
-
876
-rpkm <- sweep(cpm[common, ], 1, annotEntrez220212[common, "Length"] / 10^3, FUN="/")
877
-dim(rpkm)
878
-
879
-dim(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ])
880
-@
881
-We finally calculate Spearman correlations between gene and gene-level expression values
882
-and gene set level GSVA enrichment scores produced from data obtained by microarray and
883
-RNA-seq technologies:
884
-
885
-<<>>=
886
-corsrowsgene <- sapply(1:nrow(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]),
887
-                       function(i, expmicro, exprnaseq) cor(expmicro[i, ], exprnaseq[i, ], method="pearson"),
888
-                       exprs(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]), log2(rpkm+0.1))
889
-names(corsrowsgene) <- rownames(rpkm)
890
-
891
-corsrowsgs <- sapply(1:nrow(esmicro),
892
-                     function(i, esmicro, esrnaseq) cor(esmicro[i, ], esrnaseq[i, ], method="spearman"),
893
-                     exprs(esmicro), exprs(esrnaseq))
894
-names(corsrowsgs) <- rownames(esmicro)
895
-@
896
-In panels A and B of Figure~\ref{rnaseqcomp} we can see the distribution of these correlations at
897
-gene and gene set level. They show that GSVA enrichment scores correlate similarly to gene
898
-expression levels produced by both profiling technologies.
899
-
900
-<<RNAseqComp, echo=FALSE, results=hide>>=
901
-png(filename="GSVA-RNAseqComp.png", width=1100, height=1100, res=150)
902
-par(mfrow=c(2,2), mar=c(4, 5, 3, 2))
903
-hist(corsrowsgene, xlab="Spearman correlation", main="Gene level\n(RNA-seq RPKM vs Microarray RMA)",
904
-     xlim=c(-1, 1), col="grey", las=1)
905
-par(xpd=TRUE)
906
-text(par("usr")[1]*1.5, par("usr")[4]*1.1, "A", font=2, cex=2)
907
-par(xpd=FALSE)
908
-hist(corsrowsgs, xlab="Spearman correlation", main="Gene set level\n(GSVA enrichment scores)",
909
-     xlim=c(-1, 1), col="grey", las=1)
910
-par(xpd=TRUE)
911
-text(par("usr")[1]*1.5, par("usr")[4]*1.1, "B", font=2, cex=2)
912
-par(xpd=FALSE)
913
-plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
914
-     main=sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])), las=1, type="n")
915
-     sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ]))
916
-abline(lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ]), lwd=2, lty=2, col="grey")
917
-points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"],
918
-       exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1)
919
-points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
920
-       exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1)
921
-par(xpd=TRUE)
922
-text(par("usr")[1]*1.5, par("usr")[4]*1.1, "C", font=2, cex=2)
923
-par(xpd=FALSE)
924
-plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
925
-     main=sprintf("XiE R=%.2f", cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])), las=1, type="n")
926
-abline(lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ]), lwd=2, lty=2, col="grey")
927
-points(exprs(esrnaseq["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
928
-       exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1)
929
-points(exprs(esrnaseq)["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
930
-       exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1)
931
-par(xpd=TRUE)
932
-text(par("usr")[1]*1.5, par("usr")[4]*1.1, "D", font=2, cex=2)
933
-par(xpd=FALSE)
934
-dev.off()
935
-@
936
-\begin{figure}[p]
937
-\centerline{\includegraphics[height=0.5\textheight]{GSVA-RNAseqComp}}
938
-\caption{{\bf GSVA for RNA-seq (Argonne).} A. Distribution of Spearman correlation values between gene expression profiles of RNA-seq and microarray data. B. Distribution of Spearman correlation values between GSVA enrichment scores of gene sets calculated from RNA-seq and microarray data. C and D. Comparison of GSVA enrichment scores obtained from microarray and RNA-seq data for two gene sets containing genes with sex-specific expression: MSY formed by genes from the male-specific region of the Y chromosome, thus male-specific, and XiE formed by genes that escape X-inactivation in females, thus female-specific. Red and blue dots represent female and male samples, respectively. In both cases GSVA-scores show very high correlation between the two profiling technologies where female samples show higher enrichment scores in the female-specific gene set and male samples show higher enrichment scores in the male-specific gene set.}
939
-\label{rnaseqcomp}
940
-\end{figure}
941
-
942
-We also examined the two gene sets containing gender specific genes in detail: those that escape
943
-X-inactivation in female samples \citep{carrel_x-inactivation_2005} and those that are located on
944
-the male-specific region of the Y chrosomome \citep{skaletsky_male-specific_2003}. In panels C and D
945
-of Figure~\ref{rnaseqcomp} we can see how microarray and RNA-seq enrichment scores correlate very
946
-well in these gene sets, with $\rho=0.82$ for the male-specific gene set and $\rho=0.78$ for the
947
-female-specific gene set. Male and female samples show higher GSVA enrichment scores in their
948
-corresponding gene set. This demonstrates the flexibility of GSVA to enable analogous unsupervised
949
-and single sample GSE analyses in data coming from both, microarray and RNA-seq technologies.
950
-
951
-\section{Session Information}
952
-
953
-<<info, results=tex>>=
954
-toLatex(sessionInfo())
955
-@
956
-
957
-\bibliography{GSVA}
958
-\bibliographystyle{apalike}
959
-
960
-\end{document}
... ...
@@ -363,13 +363,14 @@
363 363
 }
364 364
 
365