...
|
...
|
@@ -35,6 +35,8 @@ library(dplyr)
|
35
|
35
|
library(ggplot2)
|
36
|
36
|
library(ggrepel)
|
37
|
37
|
library(SummarizedExperiment)
|
|
38
|
+library(pheatmap)
|
|
39
|
+library(viridis)
|
38
|
40
|
```
|
39
|
41
|
|
40
|
42
|
```{r, include = FALSE}
|
...
|
...
|
@@ -71,25 +73,109 @@ if (!requireNamespace("BiocManager", quietly = TRUE)) {
|
71
|
73
|
BiocManager::install("gemma.R")
|
72
|
74
|
```
|
73
|
75
|
|
74
|
|
-# Downloading expression data
|
|
76
|
+# Searching for datasets of interest in Gemma
|
|
77
|
+
|
|
78
|
+The package includes various functions to search for datasets fitting desired
|
|
79
|
+criteria.
|
|
80
|
+
|
|
81
|
+All datasets belonging to a taxon of interest can be accessed by using [`get_taxon_datasets`](https://pavlidislab.github.io/gemma.R/reference/get_taxon_datasets.html)
|
|
82
|
+while the function [`search_datasets`](https://pavlidislab.github.io/gemma.R/reference/search_datasets.html) can be used to further limit the results by
|
|
83
|
+a specified query containing key words, experiment names or ontology term URIs
|
|
84
|
+
|
|
85
|
+```{r}
|
|
86
|
+get_taxon_datasets(taxon = 'human') %>% # all human datasets
|
|
87
|
+ select(experiment.ShortName, experiment.Name,taxon.Name)
|
|
88
|
+
|
|
89
|
+search_datasets('bipolar',taxon = 'human') %>% # human datasets mentioning bipolar
|
|
90
|
+ select(experiment.ShortName, experiment.Name,taxon.Name)
|
|
91
|
+
|
|
92
|
+search_datasets('http://purl.obolibrary.org/obo/DOID_3312', # ontology term URI for the bipolar disorder
|
|
93
|
+ taxon = 'human') %>%
|
|
94
|
+ select(experiment.ShortName, experiment.Name,taxon.Name)
|
|
95
|
+```
|
|
96
|
+
|
|
97
|
+Note that a single call of these functions will only return 20 results by default
|
|
98
|
+and a 100 results maximum, controlled by the `limit` argument. In order to get
|
|
99
|
+all available results, `offset` argument should be used to make multiple calls.
|
|
100
|
+
|
|
101
|
+To see how many available results are there, you can look at the
|
|
102
|
+attributes of the output objects where additional information from the
|
|
103
|
+API response is appended.
|
|
104
|
+
|
|
105
|
+```{r}
|
|
106
|
+# a quick call with a limit of 1 to get the result count
|
|
107
|
+result <- get_taxon_datasets(taxon = 'human',limit = 1)
|
|
108
|
+print(attributes(result)$totalElements)
|
|
109
|
+```
|
|
110
|
+
|
|
111
|
+Since the maximum limit is a 100 getting all results available will require multiple calls.
|
75
|
112
|
|
76
|
|
-The main goal of this wrapper is to enable easy access to Gemma's
|
77
|
|
-curated datasets for downstream analyses or meta-analyses combining
|
78
|
|
-multiple datasets. In this example, we want to find datasets that are
|
79
|
|
-associated with bipolar disorder, and we are only interested in human
|
80
|
|
-data. In addition, we'll subset our results to datasets that have been
|
81
|
|
-batch corrected.
|
82
|
|
-
|
83
|
|
-```{r search}
|
84
|
|
-search_gemma('bipolar') %>%
|
85
|
|
- filter(geeq.batchCorrected == TRUE) %>%
|
86
|
|
- select(experiment.ShortName, experiment.Name, experiment.ID, experiment.Accession, experiment.SampleCount)
|
|
113
|
+```{r,eval=FALSE}
|
|
114
|
+count = attributes(result)$totalElements
|
|
115
|
+all_results <- lapply(seq(0, count, 100), function(offset){
|
|
116
|
+ get_taxon_datasets(taxon = 'human',limit = 100, offset = offset)
|
|
117
|
+}) %>% do.call(rbind,.) %>%
|
|
118
|
+ select(experiment.ShortName, experiment.Name,taxon.Name) %>%
|
|
119
|
+ head()
|
|
120
|
+```
|
87
|
121
|
```
|
|
122
|
+ experiment.ShortName
|
|
123
|
+1: GSE2018
|
|
124
|
+2: GSE4036
|
|
125
|
+3: GSE3489
|
|
126
|
+4: GSE1923
|
|
127
|
+5: GSE361
|
|
128
|
+6: GSE492
|
|
129
|
+ experiment.Name
|
|
130
|
+1: Human Lung Transplant - BAL
|
|
131
|
+2: perro-affy-human-186940
|
|
132
|
+3: Patterns of gene dysregulation in the frontal cortex of patients with HIV encephalitis
|
|
133
|
+4: Identification of PDGF-dependent patterns of gene expression in U87 glioblastoma cells
|
|
134
|
+5: Mammary epithelial cell transduction
|
|
135
|
+6: Effect of prostaglandin analogs on aqueous humor outflow
|
|
136
|
+ taxon.Name
|
|
137
|
+1: human
|
|
138
|
+2: human
|
|
139
|
+3: human
|
|
140
|
+4: human
|
|
141
|
+5: human
|
|
142
|
+6: human
|
|
143
|
+```
|
|
144
|
+See [larger queries](#larger-queries) section for more details. To keep this vignette
|
|
145
|
+simpler we will keep using the first 20 results returned by default in examples below.
|
88
|
146
|
|
89
|
|
-We are left with two datasets. For simplicity, we'll pick
|
90
|
|
-[GSE46416](https://gemma.msl.ubc.ca/expressionExperiment/showExpressionExperiment.html?id=8997)
|
91
|
|
-since it has the smaller number of samples. Now that we have the ID for
|
92
|
|
-our experiment, we can fetch the data associated with it.
|
|
147
|
+Information provided about the datasets by these functions include details about
|
|
148
|
+the quality and design of the study that can be used to judge if it is suitable for
|
|
149
|
+your use case. For instance `geeq.batchEffect` column will be set to -1 if Gemma's
|
|
150
|
+preprocessing has detected batch effects that were unable to be resolved by batch
|
|
151
|
+correction and `experiment.SampleCount`
|
|
152
|
+will include the number of samples used in the experiment. More information about
|
|
153
|
+these and other fields can be found at the function documentation.
|
|
154
|
+
|
|
155
|
+```{r}
|
|
156
|
+get_taxon_datasets(taxon = 'human') %>% # get human datasets
|
|
157
|
+ filter(geeq.batchEffect !=-1 & experiment.SampleCount > 4) %>% # filter for datasets without batch effects and with a sample count more than 4
|
|
158
|
+ select(experiment.ShortName, experiment.Name,taxon.Name)
|
|
159
|
+```
|
|
160
|
+
|
|
161
|
+Gemma uses multiple ontologies when annotating datasets and using the term URIs instead of
|
|
162
|
+free text to search can lead to more specific results. [`search_annotations`](https://pavlidislab.github.io/gemma.R/reference/search_annotations.html) function
|
|
163
|
+allows searching for annotation terms that might be relevant to your query.
|
|
164
|
+
|
|
165
|
+```{r}
|
|
166
|
+search_annotations('bipolar') %>%
|
|
167
|
+ head
|
|
168
|
+```
|
|
169
|
+
|
|
170
|
+
|
|
171
|
+
|
|
172
|
+# Downloading expression data
|
|
173
|
+
|
|
174
|
+Upon identifying datasets of interest, more information about specific ones
|
|
175
|
+can be requested. In this example we will be using GSE46416 which includes samples
|
|
176
|
+taken from healthy donors along with manic/euthymic phase bipolar disorder patients.
|
|
177
|
+
|
|
178
|
+The data associated with specific experiments can be accessed by using [`get_datasets_by_ids`](https://pavlidislab.github.io/gemma.R/reference/get_datasets_by_ids.html)
|
93
|
179
|
|
94
|
180
|
```{r dataset}
|
95
|
181
|
get_datasets_by_ids("GSE46416") %>%
|
...
|
...
|
@@ -97,7 +183,7 @@ get_datasets_by_ids("GSE46416") %>%
|
97
|
183
|
```
|
98
|
184
|
|
99
|
185
|
To access the expression data in a convenient form, you can use
|
100
|
|
-[`get_dataset_object()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_object.html).
|
|
186
|
+[`get_dataset_object`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_object.html).
|
101
|
187
|
It is a high-level wrapper that combines various endpoint calls to
|
102
|
188
|
return an annotated
|
103
|
189
|
[`SummarizedExperiment`](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
|
...
|
...
|
@@ -135,23 +221,27 @@ manic
|
135
|
221
|
|
136
|
222
|
Let's check the expression for every sample to make sure they look OK:
|
137
|
223
|
|
138
|
|
-```{r boxplot, fig.cap="Gene expression distributions of bipolar patients during manic phase and controls."}
|
|
224
|
+```{r boxplot, fig.cap="Sample to sample correlations of bipolar patients during manic phase and controls."}
|
139
|
225
|
# Get Expression matrix
|
140
|
226
|
manicExpr <- assay(manic, "counts")
|
141
|
|
-boxplot(manicExpr, pch = ".", xaxt = "n", xlab = "Sample", ylab = "Expression")
|
|
227
|
+
|
|
228
|
+
|
|
229
|
+manicExpr %>%
|
|
230
|
+ cor %>%
|
|
231
|
+ pheatmap(col =viridis(10),border_color = NA,angle_col = 45,fontsize = 7)
|
142
|
232
|
```
|
143
|
233
|
|
144
|
234
|
You can also use
|
145
|
|
-[`get_dataset_expression()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_expression.html)
|
|
235
|
+[`get_dataset_expression`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_expression.html)
|
146
|
236
|
to only get the expression matrix, and
|
147
|
|
-[`get_dataset_design()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_design.html)
|
|
237
|
+[`get_dataset_design`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_design.html)
|
148
|
238
|
to get the experimental design matrix.
|
149
|
239
|
|
150
|
|
-## Platform Annotations
|
|
240
|
+# Platform Annotations
|
151
|
241
|
|
152
|
242
|
Expression data in Gemma comes with annotations for the gene each
|
153
|
243
|
expression profile corresponds to. Using the
|
154
|
|
-[`get_platform_annotations()`](https://pavlidislab.github.io/gemma.R/reference/get_platform_annotations.html)
|
|
244
|
+[`get_platform_annotations`](https://pavlidislab.github.io/gemma.R/reference/get_platform_annotations.html)
|
155
|
245
|
function, these annotations can be retrieved independently of the
|
156
|
246
|
expression data, along with additional annotations such as Gene Ontology
|
157
|
247
|
terms.
|
...
|
...
|
@@ -168,7 +258,7 @@ head(get_platform_annotations('Generic_human'))
|
168
|
258
|
|
169
|
259
|
If you are interested in a particular gene, you can see which platforms
|
170
|
260
|
include it using
|
171
|
|
-[`get_gene_probes()`](https://pavlidislab.github.io/gemma.R/reference/get_gene_probes.html).
|
|
261
|
+[`get_gene_probes`](https://pavlidislab.github.io/gemma.R/reference/get_gene_probes.html).
|
172
|
262
|
Note that functions to search gene work best with unambigious
|
173
|
263
|
identifiers rather than symbols.
|
174
|
264
|
|
...
|
...
|
@@ -184,7 +274,7 @@ head(probes[,.SD, .SDcols = !colnames(probes) %in% c('mapping.Description','plat
|
184
|
274
|
|
185
|
275
|
```
|
186
|
276
|
|
187
|
|
-## Differential expression analyses
|
|
277
|
+# Differential expression analyses
|
188
|
278
|
|
189
|
279
|
Gemma contains precomputed differential expression analyses for most of
|
190
|
280
|
its datasets. Analyses can involve more than one factor, such as "sex"
|
...
|
...
|
@@ -192,20 +282,60 @@ as well as "disease". Some datasets contain more than one analysis to
|
192
|
282
|
account for different factors and their interactions. The results are
|
193
|
283
|
stored as resultSets, each corresponding to one factor (or their
|
194
|
284
|
interaction). You can access them using
|
195
|
|
-[`get_differential_expression_values()`](https://pavlidislab.github.io/gemma.R/reference/get_differential_expression_values.html).
|
|
285
|
+[`get_differential_expression_values`](https://pavlidislab.github.io/gemma.R/reference/get_differential_expression_values.html).
|
196
|
286
|
From here on, we can explore and visualize the data to find the most
|
197
|
287
|
differentially-expressed genes
|
198
|
288
|
|
199
|
|
-Note that `get_differential_expression_values()` has the argument
|
200
|
|
-`readableContrasts`. By default this is set to `FALSE`, resulting in the
|
201
|
|
-output columns being named based on the contrast IDs rather the more
|
202
|
|
-human readable columns used below. These contrast IDs can be used
|
203
|
|
-together with
|
204
|
|
-[`get_dataset_differential_expression_analyses()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_differential_expression_analyses.html)
|
205
|
|
-during a systematic analysis.
|
|
289
|
+Note that `get_differential_expression_values` can return multiple differentials
|
|
290
|
+per study if a study has multiple factors to contrast. Since GSE46416 only has one
|
|
291
|
+extracting the first element of the returned list is all we need.
|
|
292
|
+```{r}
|
|
293
|
+dif_exp <- get_differential_expression_values('GSE46416')
|
|
294
|
+(dif_exp[[1]])
|
|
295
|
+```
|
|
296
|
+
|
|
297
|
+By default the columns names of the output correspond to contrast IDs. To see what
|
|
298
|
+conditions these IDs correspond to we can either use `get_dataset_differential_expression_analyses`
|
|
299
|
+to get the metadata about differentials of a given dataset, or set`readableContrasts` argument
|
|
300
|
+of `get_differential_expression_values` to `TRUE`. The former approach is usually better for
|
|
301
|
+a large scale systematic analysis while the latter is easier to read in an interactive session.
|
|
302
|
+
|
|
303
|
+`get_dataset_differential_expression_analyses` function returns structured metadata
|
|
304
|
+about the differentials.
|
|
305
|
+
|
|
306
|
+```{r}
|
|
307
|
+(contrasts <- get_dataset_differential_expression_analyses('GSE46416'))
|
|
308
|
+```
|
|
309
|
+
|
|
310
|
+`contrast.id` column corresponds to the column names in the
|
|
311
|
+output of `get_differential_expression_values` while `result.ID` corresponds to the
|
|
312
|
+name of the differential in the output object. Using them together will let one to access
|
|
313
|
+differentially expressed gene counts for each condition contrast
|
|
314
|
+
|
|
315
|
+```{r}
|
|
316
|
+# using result.ID and contrast.ID of the output above, we can access specific
|
|
317
|
+# results. Note that one study may have multiple contrast objects
|
|
318
|
+seq_len(nrow(contrasts)) %>% sapply(function(i){
|
|
319
|
+ result_set = dif_exp[[as.character(contrasts[i,]$result.ID)]]
|
|
320
|
+ p_values = result_set[[glue::glue("contrast_{contrasts[i,]$contrast.id}_pvalue")]]
|
|
321
|
+
|
|
322
|
+ # multiple testing correction
|
|
323
|
+ sum(p.adjust(p_values,method = 'BH') < 0.05)
|
|
324
|
+}) -> dif_exp_genes
|
|
325
|
+
|
|
326
|
+data.frame(result.ID = contrasts$result.ID,
|
|
327
|
+ contrast.id = contrasts$contrast.id,
|
|
328
|
+ baseline.factorValue = contrasts$baseline.factorValue,
|
|
329
|
+ experimental.factorValue = contrasts$experimental.factorValue,
|
|
330
|
+ n_diff = dif_exp_genes)
|
|
331
|
+```
|
|
332
|
+
|
|
333
|
+Alternatively we, since we are only looking at one dataset and one contrast manually,
|
|
334
|
+we can simply use `readableContrasts`.
|
|
335
|
+
|
206
|
336
|
|
207
|
337
|
```{r diffExpr, fig.cap="Differentially-expressed genes in bipolar patients during manic phase versus controls.", fig.wide=TRUE, warning = FALSE}
|
208
|
|
-de <- get_differential_expression_values("GSE46416",readableContrasts = TRUE)[[1]]
|
|
338
|
+(de <- get_differential_expression_values("GSE46416",readableContrasts = TRUE)[[1]])
|
209
|
339
|
|
210
|
340
|
# Classify probes for plotting
|
211
|
341
|
de$diffexpr <- "No"
|
...
|
...
|
@@ -251,39 +381,7 @@ ggplot(
|
251
|
381
|
theme_minimal()
|
252
|
382
|
```
|
253
|
383
|
|
254
|
|
-Setting `readableContrasts=FALSE` (default) will allow easier
|
255
|
|
-programmatic access to differentials. For instance, lets get the number of
|
256
|
|
-differentially expressed genes for each contrast returned by [`get_dataset_differential_expression_analyses()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_differential_expression_analyses.html) for a study
|
257
|
|
-
|
258
|
|
-```{r}
|
259
|
|
-# return all contrasts for the dataset
|
260
|
|
-(contrasts <- get_dataset_differential_expression_analyses('GSE46416'))
|
261
|
|
-
|
262
|
|
-dif_exp_values = get_differential_expression_values('GSE46416')
|
263
|
|
-
|
264
|
|
-
|
265
|
|
-# using result.ID and contrast.ID of the output above, we can access specific
|
266
|
|
-# results. Note that one study may have multiple contrast objects
|
267
|
|
-seq_len(nrow(contrasts)) %>% sapply(function(i){
|
268
|
|
- result_set = dif_exp_values[[as.character(contrasts[i,]$result.ID)]]
|
269
|
|
- p_values = result_set[[glue::glue("contrast_{contrasts[i,]$contrast.id}_pvalue")]]
|
270
|
|
-
|
271
|
|
- # multiple testing correction
|
272
|
|
- sum(p.adjust(p_values,method = 'BH') < 0.05)
|
273
|
|
-}) -> dif_exp_genes
|
274
|
|
-
|
275
|
|
-dif_exp_genes
|
276
|
|
-
|
277
|
|
-```
|
278
|
|
-## Larger queries
|
279
|
|
-
|
280
|
|
-Some endpoints accept multiple identifiers in a single function call.
|
281
|
|
-For example, getting information on 2 datasets at the same time.
|
282
|
|
-
|
283
|
|
-```{r triple-query}
|
284
|
|
-get_datasets_by_ids(datasets = c("GSE35974", "GSE46416")) %>%
|
285
|
|
- select(experiment.ShortName, experiment.Name, experiment.ID, experiment.Accession, experiment.SampleCount, taxon.Name)
|
286
|
|
-```
|
|
384
|
+# Larger queries
|
287
|
385
|
|
288
|
386
|
To query large amounts of data, the API has a pagination system which
|
289
|
387
|
uses the `limit` and `offset` parameters. To avoid overloading the
|
...
|
...
|
@@ -373,7 +471,7 @@ different directory, use `options(gemma.cache = 'path')`.
|
373
|
471
|
If you're done with your fetching and want to ensure no space is being
|
374
|
472
|
used for cached results, or if you just want to ensure you're getting
|
375
|
473
|
up-to-date data from Gemma, you can clear the cache using
|
376
|
|
-[`forget_gemma_memoised()`](https://pavlidislab.github.io/gemma.R/reference/forget_gemma_memoised.html).
|
|
474
|
+[`forget_gemma_memoised`](https://pavlidislab.github.io/gemma.R/reference/forget_gemma_memoised.html).
|
377
|
475
|
|
378
|
476
|
## Changing defaults
|
379
|
477
|
|