... | ... |
@@ -1,17 +1,18 @@ |
1 | 1 |
Package: GSVA |
2 |
-Version: 1.39.17 |
|
2 |
+Version: 1.39.18 |
|
3 | 3 |
Title: Gene Set Variation Analysis for microarray and RNA-seq data |
4 | 4 |
Authors@R: c(person("Justin", "Guinney", role=c("aut", "cre"), email="justin.guinney@sagebase.org"), |
5 | 5 |
person("Robert", "Castelo", role="aut", email="robert.castelo@upf.edu"), |
6 | 6 |
person("Alexey", "Sergushichev", role="ctb", email="alsergbox@gmail.com"), |
7 | 7 |
person("Pablo Sebastian", "Rodriguez", role="ctb", email="pablosebastian.rodriguez@upf.edu")) |
8 | 8 |
Depends: R (>= 3.5.0) |
9 |
-Imports: methods, stats, utils, graphics, BiocGenerics, S4Vectors, IRanges, |
|
9 |
+Imports: methods, stats, utils, graphics, S4Vectors, IRanges, |
|
10 | 10 |
Biobase, SummarizedExperiment, GSEABase, Matrix, parallel, |
11 | 11 |
BiocParallel, SingleCellExperiment, sparseMatrixStats, DelayedArray, |
12 | 12 |
DelayedMatrixStats, HDF5Array, BiocSingular |
13 |
-Suggests: RUnit, BiocStyle, knitr, markdown, limma, RColorBrewer, genefilter, |
|
14 |
- edgeR, GSVAdata, shiny, shinythemes, ggplot2, data.table, plotly |
|
13 |
+Suggests: BiocGenerics, RUnit, BiocStyle, knitr, markdown, limma, RColorBrewer, |
|
14 |
+ genefilter, edgeR, GSVAdata, shiny, shinythemes, ggplot2, data.table, |
|
15 |
+ plotly |
|
15 | 16 |
Description: Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner. |
16 | 17 |
License: GPL (>= 2) |
17 | 18 |
VignetteBuilder: knitr |
... | ... |
@@ -222,7 +222,6 @@ setMethod("gsva", signature(expr="SummarizedExperiment", gset.idx.list="GeneSetC |
222 | 222 |
## 'annotation' argument |
223 | 223 |
mapped.gset.idx.list <- mapIdentifiers(gset.idx.list, |
224 | 224 |
AnnoOrEntrezIdentifier(annotpkg)) |
225 |
- mapped.gset.idx.list <- geneIds(mapped.gset.idx.list) |
|
226 | 225 |
} else { |
227 | 226 |
mapped.gset.idx.list <- gset.idx.list |
228 | 227 |
if (verbose) { |
... | ... |
@@ -230,6 +229,7 @@ setMethod("gsva", signature(expr="SummarizedExperiment", gset.idx.list="GeneSetC |
230 | 229 |
"Attempting to directly match identifiers in 'expr' to gene sets.", sep="\n") |
231 | 230 |
} |
232 | 231 |
} |
232 |
+ mapped.gset.idx.list <- geneIds(mapped.gset.idx.list) |
|
233 | 233 |
|
234 | 234 |
## map to the actual features for which expression data is available |
235 | 235 |
mapped.gset.idx.list <- .mapGeneSetsToFeatures(mapped.gset.idx.list, rownames(expr)) |
... | ... |
@@ -149,7 +149,7 @@ rankHDF5 <- function(X){ |
149 | 149 |
.fastRndWalk2 <- function(gSetIdx, geneRanking, ra_block) { |
150 | 150 |
n <- length(geneRanking) |
151 | 151 |
k <- length(gSetIdx) |
152 |
- idxs <- sort.int(fastmatch::fmatch(gSetIdx, geneRanking)) |
|
152 |
+ idxs <- sort.int(match(gSetIdx, geneRanking)) |
|
153 | 153 |
stepCDFinGeneSet2 <- |
154 | 154 |
sum(ra_block[geneRanking[idxs]] * (n - idxs + 1)) / |
155 | 155 |
sum((ra_block[geneRanking[idxs]])) |
... | ... |
@@ -29,7 +29,7 @@ test_ssgsea <- function() { |
29 | 29 |
function(x, y) na.omit(match(x, y)), |
30 | 30 |
rownames(y)) |
31 | 31 |
fast.gset.idx.list <- lapply(geneSets, |
32 |
- function(x, y) na.omit(fastmatch::fmatch(x, y)), |
|
32 |
+ function(x, y) na.omit(match(x, y)), |
|
33 | 33 |
rownames(y)) |
34 | 34 |
checkIdentical(gset.idx.list, fast.gset.idx.list) |
35 | 35 |
|
... | ... |
@@ -1,5 +1,5 @@ |
1 | 1 |
--- |
2 |
-title: "GSVA: gene set variation analysis for bulk expression data" |
|
2 |
+title: "GSVA: gene set variation analysis for molecular profiling data" |
|
3 | 3 |
author: |
4 | 4 |
- name: Robert Castelo |
5 | 5 |
affiliation: |
... | ... |
@@ -18,7 +18,9 @@ abstract: > |
18 | 18 |
_GSVA_. These methods transform an input gene-by-sample expression data matrix |
19 | 19 |
into a gene-set-by-sample expression data matrix. Thereby enabling the |
20 | 20 |
estimation of pathway activity for each sample and facilitating pathway-centric |
21 |
- analyses of gene expression data. In this vignette we illustrate how to use |
|
21 |
+ analyses of gene expression data. While this methodology was initially developed |
|
22 |
+ for gene expression data, it is readily aplicable to other types of molecular |
|
23 |
+ profiling data. In this vignette we illustrate how to use |
|
22 | 24 |
the GSVA package with bulk microarray and RNA-seq expression data. |
23 | 25 |
date: "`r BiocStyle::doc_date()`" |
24 | 26 |
package: "`r pkg_ver('GSVA')`" |
... | ... |
@@ -73,7 +75,7 @@ X[1:5, 1:5] |
73 | 75 |
``` |
74 | 76 |
|
75 | 77 |
Given a collection of gene sets stored, for instance, in a `list` object such as |
76 |
-this one with gene sampled uniformly at random without replacement into the gene sets: |
|
78 |
+this one with genes sampled uniformly at random without replacement into the gene sets: |
|
77 | 79 |
|
78 | 80 |
```{r} |
79 | 81 |
## sample gene set sizes |
... | ... |
@@ -93,7 +95,7 @@ gsva.es[1:5, 1:5] |
93 | 95 |
``` |
94 | 96 |
|
95 | 97 |
So, the first argument to the `gsva()` function is the gene expression data matrix |
96 |
-and the second the collection of gene sets. The `gsva()` function can take the |
|
98 |
+and the second the collection of gene sets. The `gsva()` function can take the input |
|
97 | 99 |
expression data and gene sets using different specialized containers that facilitate |
98 | 100 |
the access and manipulation of molecular and phenotype data, as well as their associated |
99 | 101 |
metadata. Another advanced features include the use of on-disk and parallel backends to |
... | ... |
@@ -144,10 +146,10 @@ methods: |
144 | 146 |
the package and similarly to ssGSEA, is a non-parametric method that |
145 | 147 |
uses the empirical CDFs of gene expression ranks inside and outside the gene |
146 | 148 |
set, but it starts by calculating an expression-level statistic that brings |
147 |
- gene expression profiles with a different dynamic range to a common scale. |
|
149 |
+ gene expression profiles with different dynamic ranges to a common scale. |
|
148 | 150 |
|
149 | 151 |
The interested user may find full technical details about how these methods |
150 |
-work in their corresponding article cited above. If you use any of them in a |
|
152 |
+work in their corresponding articles cited above. If you use any of them in a |
|
151 | 153 |
publication, please cite it with the given bibliographic reference. |
152 | 154 |
|
153 | 155 |
# Overview of the GSVA functionality |
... | ... |
@@ -159,15 +161,15 @@ the following two input arguments: |
159 | 161 |
following containers: |
160 | 162 |
* A `matrix` of expression values with genes corresponding to rows and samples |
161 | 163 |
corresponding to columns. |
162 |
- * An `ExpressionSet` object, see package `r Biocpkg("Biobase")`. |
|
164 |
+ * An `ExpressionSet` object; see package `r Biocpkg("Biobase")`. |
|
163 | 165 |
* A `SummarizedExperiment` object, see package |
164 | 166 |
`r Biocpkg("SummarizedExperiment")`. |
165 |
-2. A collection of gene sets, which can be provided in one of the following |
|
167 |
+2. A collection of gene sets; which can be provided in one of the following |
|
166 | 168 |
containers: |
167 | 169 |
* A `list` object where each element corresponds to a gene set defined by a |
168 | 170 |
vector of gene identifiers, and the element names correspond to the names of |
169 | 171 |
the gene sets. |
170 |
- * A `GeneSetCollection` object, see package `r Biocpkg("GSEABase")`. |
|
172 |
+ * A `GeneSetCollection` object; see package `r Biocpkg("GSEABase")`. |
|
171 | 173 |
|
172 | 174 |
One advantage of providing the input data using specialized containers such as |
173 | 175 |
`ExpressionSet`, `SummarizedExperiment` and `GeneSetCollection` is that the |
... | ... |
@@ -190,14 +192,17 @@ the following filters: |
190 | 192 |
2. Discard genes in the input gene sets that do not map to a gene in the input |
191 | 193 |
gene expression data matrix. |
192 | 194 |
|
193 |
-3. Discard gene sets that, after applying the previous filter, do not meet a |
|
195 |
+3. Discard gene sets that, after applying the previous filters, do not meet a |
|
194 | 196 |
minimum and maximum size, which by default is one for the minimum size and |
195 |
- has no limit in the maximum size. |
|
197 |
+ has no limit for the maximum size. |
|
196 | 198 |
|
197 |
-If as a result of this filter either no genes or gene sets are left, the |
|
199 |
+If, as a result of this filter, either no genes or gene sets are left, the |
|
198 | 200 |
`gsva()` function will prompt an error. A common cause for an error at this |
199 | 201 |
stage is that gene identifiers between the expression data matrix and the gene |
200 |
-sets do not belong to the same standard nomenclature and could not be mapped. |
|
202 |
+sets do not belong to the same standard nomenclature and could not be mapped, |
|
203 |
+because either the input data were not provided using some of the specialized |
|
204 |
+containers described above or the necessary metadata in those containers to |
|
205 |
+successfully map gene identifiers is missing. |
|
201 | 206 |
|
202 | 207 |
By default, the `gsva()` function employs the method described by |
203 | 208 |
@haenzelmann_castelo_guinney_2013 but this can be changed using the argument |
... | ... |
@@ -231,18 +236,19 @@ parameters: |
231 | 236 |
`abs.ranking=FALSE` and it implies that a modified Kuiper statistic is used |
232 | 237 |
to calculate enrichment scores, taking the magnitude difference between the |
233 | 238 |
largest positive and negative random walk deviations. When `abs.ranking=TRUE` |
234 |
- the original Kuiper statistic is used, by whih the largest positive and |
|
235 |
- negative random walk devations add added together. In this case, gene sets |
|
239 |
+ the original Kuiper statistic is used, by which the largest positive and |
|
240 |
+ negative random walk deviations are added together. In this case, gene sets |
|
236 | 241 |
with genes enriched on either extreme (high or low) will be regarded as |
237 | 242 |
highly activated. |
238 | 243 |
|
239 | 244 |
* `tau`: Exponent defining the weight of the tail in the random walk. By |
240 | 245 |
default `tau=1`. When `method="ssgsea"`, this parameter is also used and its |
241 |
- default value becomes then `tau=0.25`. |
|
246 |
+ default value becomes then `tau=0.25` to match the methodology described in |
|
247 |
+ [@barbie_systematic_2009]. |
|
242 | 248 |
|
243 | 249 |
In general, the default values for the previous parameters are suitable for |
244 |
-most analysis settings, which usually consist of normalized continuous |
|
245 |
-expression values. |
|
250 |
+most analysis settings, which usually consist of some kind of normalized |
|
251 |
+continuous expression values. |
|
246 | 252 |
|
247 | 253 |
# Gene sets definitions and mapping to gene identifiers |
248 | 254 |
|
... | ... |
@@ -259,7 +265,7 @@ neurons and cultured astroglial cells), derived from murine models by |
259 | 265 |
by using GSVA to transform the gene expression measurements into enrichment |
260 | 266 |
scores for these four gene sets, without taking the sample subtype grouping |
261 | 267 |
into account. We start by having a quick glance to the data, which forms part of |
262 |
-the `r Biocpkg("GSVAdata") package: |
|
268 |
+the `r Biocpkg("GSVAdata")` package: |
|
263 | 269 |
|
264 | 270 |
```{r} |
265 | 271 |
library(GSVAdata) |