Browse code

Some fixes removing the dependency of fastmatch and other minor issues.

[rcastelo] authored on 03/03/2021 11:07:14
Showing 5 changed files

... ...
@@ -1,17 +1,18 @@
1 1
 Package: GSVA
2
-Version: 1.39.17
2
+Version: 1.39.18
3 3
 Title: Gene Set Variation Analysis for microarray and RNA-seq data
4 4
 Authors@R: c(person("Justin", "Guinney", role=c("aut", "cre"), email="justin.guinney@sagebase.org"),
5 5
              person("Robert", "Castelo", role="aut", email="robert.castelo@upf.edu"),
6 6
              person("Alexey", "Sergushichev", role="ctb", email="alsergbox@gmail.com"),
7 7
              person("Pablo Sebastian", "Rodriguez", role="ctb", email="pablosebastian.rodriguez@upf.edu"))
8 8
 Depends: R (>= 3.5.0)
9
-Imports: methods, stats, utils, graphics, BiocGenerics, S4Vectors, IRanges,
9
+Imports: methods, stats, utils, graphics, S4Vectors, IRanges,
10 10
          Biobase, SummarizedExperiment, GSEABase, Matrix, parallel,
11 11
          BiocParallel, SingleCellExperiment, sparseMatrixStats, DelayedArray,
12 12
          DelayedMatrixStats, HDF5Array, BiocSingular
13
-Suggests: RUnit, BiocStyle, knitr, markdown, limma, RColorBrewer, genefilter,
14
-          edgeR, GSVAdata, shiny, shinythemes, ggplot2, data.table, plotly
13
+Suggests: BiocGenerics, RUnit, BiocStyle, knitr, markdown, limma, RColorBrewer,
14
+          genefilter, edgeR, GSVAdata, shiny, shinythemes, ggplot2, data.table,
15
+          plotly
15 16
 Description: Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. This new matrix of GSVA enrichment scores facilitates applying standard analytical methods like functional enrichment, survival analysis, clustering, CNV-pathway analysis or cross-tissue pathway analysis, in a pathway-centric manner.
16 17
 License: GPL (>= 2)
17 18
 VignetteBuilder: knitr
... ...
@@ -222,7 +222,6 @@ setMethod("gsva", signature(expr="SummarizedExperiment", gset.idx.list="GeneSetC
222 222
     ## 'annotation' argument
223 223
     mapped.gset.idx.list <- mapIdentifiers(gset.idx.list,
224 224
                                            AnnoOrEntrezIdentifier(annotpkg))
225
-    mapped.gset.idx.list <- geneIds(mapped.gset.idx.list) 
226 225
   } else {
227 226
     mapped.gset.idx.list <- gset.idx.list
228 227
     if (verbose) {
... ...
@@ -230,6 +229,7 @@ setMethod("gsva", signature(expr="SummarizedExperiment", gset.idx.list="GeneSetC
230 229
           "Attempting to directly match identifiers in 'expr' to gene sets.", sep="\n")
231 230
     }
232 231
   }
232
+  mapped.gset.idx.list <- geneIds(mapped.gset.idx.list) 
233 233
 
234 234
   ## map to the actual features for which expression data is available
235 235
   mapped.gset.idx.list <- .mapGeneSetsToFeatures(mapped.gset.idx.list, rownames(expr))
... ...
@@ -149,7 +149,7 @@ rankHDF5 <- function(X){
149 149
 .fastRndWalk2 <- function(gSetIdx, geneRanking, ra_block) {
150 150
   n <- length(geneRanking)
151 151
   k <- length(gSetIdx)
152
-  idxs <- sort.int(fastmatch::fmatch(gSetIdx, geneRanking))
152
+  idxs <- sort.int(match(gSetIdx, geneRanking))
153 153
   stepCDFinGeneSet2 <-
154 154
     sum(ra_block[geneRanking[idxs]] * (n - idxs + 1)) /
155 155
     sum((ra_block[geneRanking[idxs]]))    
... ...
@@ -29,7 +29,7 @@ test_ssgsea <- function() {
29 29
                           function(x, y) na.omit(match(x, y)),
30 30
                           rownames(y))
31 31
   fast.gset.idx.list <- lapply(geneSets,
32
-                               function(x, y) na.omit(fastmatch::fmatch(x, y)),
32
+                               function(x, y) na.omit(match(x, y)),
33 33
                                rownames(y))
34 34
   checkIdentical(gset.idx.list, fast.gset.idx.list)
35 35
 
... ...
@@ -1,5 +1,5 @@
1 1
 ---
2
-title: "GSVA: gene set variation analysis for bulk expression data"
2
+title: "GSVA: gene set variation analysis for molecular profiling data"
3 3
 author:
4 4
 - name: Robert Castelo
5 5
   affiliation:
... ...
@@ -18,7 +18,9 @@ abstract: >
18 18
   _GSVA_. These methods transform an input gene-by-sample expression data matrix
19 19
   into a gene-set-by-sample expression data matrix. Thereby enabling the
20 20
   estimation of pathway activity for each sample and facilitating pathway-centric
21
-  analyses of gene expression data. In this vignette we illustrate how to use
21
+  analyses of gene expression data. While this methodology was initially developed
22
+  for gene expression data, it is readily aplicable to other types of molecular
23
+  profiling data. In this vignette we illustrate how to use
22 24
   the GSVA package with bulk microarray and RNA-seq expression data.
23 25
 date: "`r BiocStyle::doc_date()`"
24 26
 package: "`r pkg_ver('GSVA')`"
... ...
@@ -73,7 +75,7 @@ X[1:5, 1:5]
73 75
 ```
74 76
 
75 77
 Given a collection of gene sets stored, for instance, in a `list` object such as
76
-this one with gene sampled uniformly at random without replacement into the gene sets:
78
+this one with genes sampled uniformly at random without replacement into the gene sets:
77 79
 
78 80
 ```{r}
79 81
 ## sample gene set sizes
... ...
@@ -93,7 +95,7 @@ gsva.es[1:5, 1:5]
93 95
 ```
94 96
 
95 97
 So, the first argument to the `gsva()` function is the gene expression data matrix
96
-and the second the collection of gene sets. The `gsva()` function can take the
98
+and the second the collection of gene sets. The `gsva()` function can take the input
97 99
 expression data and gene sets using different specialized containers that facilitate
98 100
 the access and manipulation of molecular and phenotype data, as well as their associated
99 101
 metadata. Another advanced features include the use of on-disk and parallel backends to
... ...
@@ -144,10 +146,10 @@ methods:
144 146
   the package and similarly to ssGSEA, is a non-parametric method that
145 147
   uses the empirical CDFs of gene expression ranks inside and outside the gene
146 148
   set, but it starts by calculating an expression-level statistic that brings
147
-  gene expression profiles with a different dynamic range to a common scale.
149
+  gene expression profiles with different dynamic ranges to a common scale.
148 150
 
149 151
 The interested user may find full technical details about how these methods
150
-work in their corresponding article cited above. If you use any of them in a
152
+work in their corresponding articles cited above. If you use any of them in a
151 153
 publication, please cite it with the given bibliographic reference.
152 154
 
153 155
 # Overview of the GSVA functionality
... ...
@@ -159,15 +161,15 @@ the following two input arguments:
159 161
    following containers:
160 162
    * A `matrix` of expression values with genes corresponding to rows and samples
161 163
      corresponding to columns.
162
-   * An `ExpressionSet` object, see package `r Biocpkg("Biobase")`.
164
+   * An `ExpressionSet` object; see package `r Biocpkg("Biobase")`.
163 165
    * A `SummarizedExperiment` object, see package
164 166
      `r Biocpkg("SummarizedExperiment")`.
165
-2. A collection of gene sets, which can be provided in one of the following
167
+2. A collection of gene sets; which can be provided in one of the following
166 168
    containers:
167 169
    * A `list` object where each element corresponds to a gene set defined by a
168 170
      vector of gene identifiers, and the element names correspond to the names of
169 171
      the gene sets.
170
-   * A `GeneSetCollection` object, see package `r Biocpkg("GSEABase")`.
172
+   * A `GeneSetCollection` object; see package `r Biocpkg("GSEABase")`.
171 173
 
172 174
 One advantage of providing the input data using specialized containers such as
173 175
 `ExpressionSet`, `SummarizedExperiment` and `GeneSetCollection` is that the
... ...
@@ -190,14 +192,17 @@ the following filters:
190 192
 2. Discard genes in the input gene sets that do not map to a gene in the input
191 193
    gene expression data matrix.
192 194
 
193
-3. Discard gene sets that, after applying the previous filter, do not meet a
195
+3. Discard gene sets that, after applying the previous filters, do not meet a
194 196
    minimum and maximum size, which by default is one for the minimum size and
195
-   has no limit in the maximum size.
197
+   has no limit for the maximum size.
196 198
 
197
-If as a result of this filter either no genes or gene sets are left, the
199
+If, as a result of this filter, either no genes or gene sets are left, the
198 200
 `gsva()` function will prompt an error. A common cause for an error at this
199 201
 stage is that gene identifiers between the expression data matrix and the gene
200
-sets do not belong to the same standard nomenclature and could not be mapped.
202
+sets do not belong to the same standard nomenclature and could not be mapped,
203
+because either the input data were not provided using some of the specialized
204
+containers described above or the necessary metadata in those containers to
205
+successfully map gene identifiers is missing.
201 206
 
202 207
 By default, the `gsva()` function employs the method described by
203 208
 @haenzelmann_castelo_guinney_2013 but this can be changed using the argument
... ...
@@ -231,18 +236,19 @@ parameters:
231 236
   `abs.ranking=FALSE` and it implies that a modified Kuiper statistic is used
232 237
   to calculate enrichment scores, taking the magnitude difference between the
233 238
   largest positive and negative random walk deviations. When `abs.ranking=TRUE`
234
-  the original Kuiper statistic is used, by whih the largest positive and
235
-  negative random walk devations add added together. In this case, gene sets
239
+  the original Kuiper statistic is used, by which the largest positive and
240
+  negative random walk deviations are added together. In this case, gene sets
236 241
   with genes enriched on either extreme (high or low) will be regarded as
237 242
   highly activated.
238 243
 
239 244
 * `tau`: Exponent defining the weight of the tail in the random walk. By
240 245
   default `tau=1`. When `method="ssgsea"`, this parameter is also used and its
241
-  default value becomes then `tau=0.25`.
246
+  default value becomes then `tau=0.25` to match the methodology described in
247
+  [@barbie_systematic_2009].
242 248
 
243 249
 In general, the default values for the previous parameters are suitable for
244
-most analysis settings, which usually consist of normalized continuous
245
-expression values.
250
+most analysis settings, which usually consist of some kind of normalized
251
+continuous expression values.
246 252
 
247 253
 # Gene sets definitions and mapping to gene identifiers
248 254
 
... ...
@@ -259,7 +265,7 @@ neurons and cultured astroglial cells), derived from murine models by
259 265
 by using GSVA to transform the gene expression measurements into enrichment
260 266
 scores for these four gene sets, without taking the sample subtype grouping
261 267
 into account. We start by having a quick glance to the data, which forms part of
262
-the `r Biocpkg("GSVAdata") package:
268
+the `r Biocpkg("GSVAdata")` package:
263 269
 
264 270
 ```{r}
265 271
 library(GSVAdata)