...
|
...
|
@@ -63,8 +63,9 @@ Once `r Biocpkg("GSVA")` is installed, it can be loaded with the following comma
|
63
|
63
|
library(GSVA)
|
64
|
64
|
```
|
65
|
65
|
|
66
|
|
-Given a gene expression data matrix with rows corresponding to genes and columns
|
67
|
|
-to samples, such as this one simulated from random Gaussian data:
|
|
66
|
+Given a gene expression data matrix, which we shall call `X`, with rows
|
|
67
|
+corresponding to genes and columns to samples, such as this one simulated from
|
|
68
|
+random Gaussian data:
|
68
|
69
|
|
69
|
70
|
```{r}
|
70
|
71
|
p <- 10000 ## number of genes
|
...
|
...
|
@@ -75,8 +76,9 @@ X <- matrix(rnorm(p*n), nrow=p,
|
75
|
76
|
X[1:5, 1:5]
|
76
|
77
|
```
|
77
|
78
|
|
78
|
|
-Given a collection of gene sets stored, for instance, in a `list` object such as
|
79
|
|
-this one with genes sampled uniformly at random without replacement into the gene sets:
|
|
79
|
+Given a collection of gene sets stored, for instance, in a `list` object, which
|
|
80
|
+we shall call `gs`, with genes sampled uniformly at random without replacement
|
|
81
|
+into 100 different gene sets:
|
80
|
82
|
|
81
|
83
|
```{r}
|
82
|
84
|
## sample gene set sizes
|
...
|
...
|
@@ -95,24 +97,26 @@ dim(gsva.es)
|
95
|
97
|
gsva.es[1:5, 1:5]
|
96
|
98
|
```
|
97
|
99
|
|
98
|
|
-So, the first argument to the `gsva()` function is the gene expression data matrix
|
99
|
|
-and the second the collection of gene sets. The `gsva()` function can take the input
|
100
|
|
-expression data and gene sets using different specialized containers that facilitate
|
101
|
|
-the access and manipulation of molecular and phenotype data, as well as their associated
|
102
|
|
-metadata. Another advanced features include the use of on-disk and parallel backends to
|
103
|
|
-enable using GSVA on large molecular data sets and speed up computing time. You will
|
104
|
|
-find information on all these features in this vignette.
|
|
100
|
+The first argument to the `gsva()` function is the gene expression data matrix
|
|
101
|
+and the second the collection of gene sets. The `gsva()` function can take the
|
|
102
|
+input expression data and gene sets using different specialized containers that
|
|
103
|
+facilitate the access and manipulation of molecular and phenotype data, as well
|
|
104
|
+as their associated metadata. Another advanced features include the use of
|
|
105
|
+on-disk and parallel backends to enable, respectively, using GSVA on large
|
|
106
|
+molecular data sets and speed up computing time. You will find information on
|
|
107
|
+these features in this vignette.
|
105
|
108
|
|
106
|
109
|
# Introduction
|
107
|
110
|
|
108
|
111
|
Gene set variation analysis (GSVA) provides an estimate of pathway activity
|
109
|
|
-by transforming an input gene-by-sample expression data matrix
|
110
|
|
-into a gene-set-by-sample one. This resulting expression data matrix can be
|
111
|
|
-then used with classical analytical methods such as differential expression,
|
112
|
|
-classification, survival analysis, clustering or correlation analysis in a
|
113
|
|
-pathway-centric manner. One can also perform sample-wise comparisons between
|
114
|
|
-pathways and other molecular data types such as microRNA expression or binding
|
115
|
|
-data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).
|
|
112
|
+by transforming an input gene-by-sample expression data matrix into a
|
|
113
|
+corresponding gene-set-by-sample expression data matrix. This resulting
|
|
114
|
+expression data matrix can be then used with classical analytical methods such
|
|
115
|
+as differential expression, classification, survival analysis, clustering or
|
|
116
|
+correlation analysis in a pathway-centric manner. One can also perform
|
|
117
|
+sample-wise comparisons between pathways and other molecular data types such
|
|
118
|
+as microRNA expression or binding data, copy-number variation (CNV) data or
|
|
119
|
+single nucleotide polymorphisms (SNPs).
|
116
|
120
|
|
117
|
121
|
The GSVA package provides an implementation of this approach for the following
|
118
|
122
|
methods:
|
...
|
...
|
@@ -151,7 +155,7 @@ methods:
|
151
|
155
|
|
152
|
156
|
The interested user may find full technical details about how these methods
|
153
|
157
|
work in their corresponding articles cited above. If you use any of them in a
|
154
|
|
-publication, please cite it with the given bibliographic reference.
|
|
158
|
+publication, please cite them with the given bibliographic reference.
|
155
|
159
|
|
156
|
160
|
# Overview of the GSVA functionality
|
157
|
161
|
|
...
|
...
|
@@ -197,13 +201,14 @@ the following filters:
|
197
|
201
|
minimum and maximum size, which by default is one for the minimum size and
|
198
|
202
|
has no limit for the maximum size.
|
199
|
203
|
|
200
|
|
-If, as a result of this filter, either no genes or gene sets are left, the
|
201
|
|
-`gsva()` function will prompt an error. A common cause for an error at this
|
202
|
|
-stage is that gene identifiers between the expression data matrix and the gene
|
203
|
|
-sets do not belong to the same standard nomenclature and could not be mapped,
|
204
|
|
-because either the input data were not provided using some of the specialized
|
205
|
|
-containers described above or the necessary metadata in those containers to
|
206
|
|
-successfully map gene identifiers is missing.
|
|
204
|
+If, as a result of applying these three filters, either no genes or gene sets
|
|
205
|
+are left, the `gsva()` function will prompt an error. A common cause for such
|
|
206
|
+an error at this stage is that gene identifiers between the expression data
|
|
207
|
+matrix and the gene sets do not belong to the same standard nomenclature and
|
|
208
|
+could not be mapped. This may happen because either the input data were not
|
|
209
|
+provided using some of the specialized containers described above or the
|
|
210
|
+necessary metadata in those containers that allows the software to successfully
|
|
211
|
+map gene identifiers, is missing.
|
207
|
212
|
|
208
|
213
|
By default, the `gsva()` function employs the method described by
|
209
|
214
|
@haenzelmann_castelo_guinney_2013 but this can be changed using the argument
|
...
|
...
|
@@ -253,12 +258,12 @@ continuous expression values.
|
253
|
258
|
|
254
|
259
|
# Gene set definitions and gene identifier mapping
|
255
|
260
|
|
256
|
|
-Gene sets constitute a simple, yet useful, way to define pathways, essentially
|
257
|
|
-because we use pathway membership definitions only, neglecting the information
|
258
|
|
-on molecular interactions. Gene set definitions are a crucial input to any gene
|
259
|
|
-set enrichment analysis because if our gene sets do not capture the biological
|
|
261
|
+Gene sets constitute a simple, yet useful, way to define pathways because we
|
|
262
|
+use pathway membership definitions only, neglecting the information on molecular
|
|
263
|
+interactions. Gene set definitions are a crucial input to any gene set
|
|
264
|
+enrichment analysis because if our gene sets do not capture the biological
|
260
|
265
|
processes we are studying, we will likely not find any relevant insights in our
|
261
|
|
-data.
|
|
266
|
+data from an analysis based on these gene sets.
|
262
|
267
|
|
263
|
268
|
There are multiple sources of gene sets, the most popular ones being
|
264
|
269
|
[The Gene Ontology (GO) project](http://geneontology.org) and
|