Browse code

vignette updates

OganM authored on 24/10/2022 21:15:42
Showing 1 changed files

... ...
@@ -35,6 +35,8 @@ library(dplyr)
35 35
 library(ggplot2)
36 36
 library(ggrepel)
37 37
 library(SummarizedExperiment)
38
+library(pheatmap)
39
+library(viridis)
38 40
 ```
39 41
 
40 42
 ```{r, include = FALSE}
... ...
@@ -71,25 +73,109 @@ if (!requireNamespace("BiocManager", quietly = TRUE)) {
71 73
 BiocManager::install("gemma.R")
72 74
 ```
73 75
 
74
-# Downloading expression data
76
+# Searching for datasets of interest in Gemma
77
+
78
+The package includes various functions to search for datasets fitting desired 
79
+criteria.
80
+
81
+All datasets belonging to a taxon of interest can be accessed by using [`get_taxon_datasets`](https://pavlidislab.github.io/gemma.R/reference/get_taxon_datasets.html)
82
+while the function [`search_datasets`](https://pavlidislab.github.io/gemma.R/reference/search_datasets.html) can be used to further limit the results by
83
+a specified query containing key words, experiment names or ontology term URIs
84
+
85
+```{r}
86
+get_taxon_datasets(taxon = 'human') %>%  # all human datasets 
87
+    select(experiment.ShortName, experiment.Name,taxon.Name)
88
+
89
+search_datasets('bipolar',taxon = 'human') %>% # human datasets mentioning bipolar
90
+    select(experiment.ShortName, experiment.Name,taxon.Name)
91
+
92
+search_datasets('http://purl.obolibrary.org/obo/DOID_3312', # ontology term URI for the bipolar disorder
93
+                taxon = 'human') %>% 
94
+    select(experiment.ShortName, experiment.Name,taxon.Name)
95
+```
96
+
97
+Note that a single call of these functions will only return 20 results by default
98
+and a 100 results maximum, controlled by the `limit` argument. In order to get
99
+all available results, `offset` argument should be used to make multiple calls.
100
+
101
+To see how many available results are there, you can look at the
102
+attributes of the output objects where additional information from the
103
+API response is appended.
104
+
105
+```{r}
106
+# a quick call with a limit of 1 to get the result count
107
+result <- get_taxon_datasets(taxon = 'human',limit = 1) 
108
+print(attributes(result)$totalElements)
109
+```
110
+
111
+Since the maximum limit is a 100 getting all results available will require multiple calls.
75 112
 
76
-The main goal of this wrapper is to enable easy access to Gemma's
77
-curated datasets for downstream analyses or meta-analyses combining
78
-multiple datasets. In this example, we want to find datasets that are
79
-associated with bipolar disorder, and we are only interested in human
80
-data. In addition, we'll subset our results to datasets that have been
81
-batch corrected.
82
-
83
-```{r search}
84
-search_gemma('bipolar') %>%
85
-    filter(geeq.batchCorrected == TRUE) %>%
86
-    select(experiment.ShortName, experiment.Name, experiment.ID, experiment.Accession, experiment.SampleCount)
113
+```{r,eval=FALSE}
114
+count = attributes(result)$totalElements
115
+all_results <- lapply(seq(0, count, 100), function(offset){
116
+    get_taxon_datasets(taxon = 'human',limit = 100, offset = offset)
117
+}) %>% do.call(rbind,.) %>% 
118
+    select(experiment.ShortName, experiment.Name,taxon.Name) %>% 
119
+    head()
120
+```
87 121
 ```
122
+   experiment.ShortName
123
+1:              GSE2018
124
+2:              GSE4036
125
+3:              GSE3489
126
+4:              GSE1923
127
+5:               GSE361
128
+6:               GSE492
129
+                                                                          experiment.Name
130
+1:                                                            Human Lung Transplant - BAL
131
+2:                                                                perro-affy-human-186940
132
+3: Patterns of gene dysregulation in the frontal cortex of patients with HIV encephalitis
133
+4: Identification of PDGF-dependent patterns of gene expression in U87 glioblastoma cells
134
+5:                                                   Mammary epithelial cell transduction
135
+6:                               Effect of prostaglandin analogs on aqueous humor outflow
136
+   taxon.Name
137
+1:      human
138
+2:      human
139
+3:      human
140
+4:      human
141
+5:      human
142
+6:      human
143
+```
144
+See [larger queries](#larger-queries) section for more details. To keep this vignette
145
+simpler we will keep using the first 20 results returned by default in examples below.
88 146
 
89
-We are left with two datasets. For simplicity, we'll pick
90
-[GSE46416](https://gemma.msl.ubc.ca/expressionExperiment/showExpressionExperiment.html?id=8997)
91
-since it has the smaller number of samples. Now that we have the ID for
92
-our experiment, we can fetch the data associated with it.
147
+Information provided about the datasets by these functions include details about
148
+the quality and design of the study that can be used to judge if it is suitable for
149
+your use case. For instance `geeq.batchEffect` column will be set to -1 if Gemma's
150
+preprocessing has detected batch effects that were unable to be resolved by batch
151
+correction and `experiment.SampleCount`
152
+will include the number of samples used in the experiment. More information about
153
+these and other fields can be found at the function documentation.
154
+
155
+```{r}
156
+get_taxon_datasets(taxon = 'human') %>% # get human datasets
157
+    filter(geeq.batchEffect !=-1 & experiment.SampleCount > 4) %>% # filter for datasets without batch effects and with a sample count more than 4
158
+    select(experiment.ShortName, experiment.Name,taxon.Name)
159
+```
160
+
161
+Gemma uses multiple ontologies when annotating datasets and using the term URIs instead of
162
+free text to search can lead to more specific results. [`search_annotations`](https://pavlidislab.github.io/gemma.R/reference/search_annotations.html) function
163
+allows searching for annotation terms that might be relevant to your query.
164
+
165
+```{r}
166
+search_annotations('bipolar') %>% 
167
+    head
168
+```
169
+
170
+
171
+
172
+# Downloading expression data
173
+
174
+Upon identifying datasets of interest, more information about specific ones 
175
+can be requested. In this example we will be using GSE46416 which includes samples
176
+taken from healthy donors along with manic/euthymic phase bipolar disorder patients.
177
+
178
+The data associated with specific experiments can be accessed by using [`get_datasets_by_ids`](https://pavlidislab.github.io/gemma.R/reference/get_datasets_by_ids.html)
93 179
 
94 180
 ```{r dataset}
95 181
 get_datasets_by_ids("GSE46416") %>%
... ...
@@ -97,7 +183,7 @@ get_datasets_by_ids("GSE46416") %>%
97 183
 ```
98 184
 
99 185
 To access the expression data in a convenient form, you can use
100
-[`get_dataset_object()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_object.html).
186
+[`get_dataset_object`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_object.html).
101 187
 It is a high-level wrapper that combines various endpoint calls to
102 188
 return an annotated
103 189
 [`SummarizedExperiment`](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
... ...
@@ -135,23 +221,27 @@ manic
135 221
 
136 222
 Let's check the expression for every sample to make sure they look OK:
137 223
 
138
-```{r boxplot, fig.cap="Gene expression distributions of bipolar patients during manic phase and controls."}
224
+```{r boxplot, fig.cap="Sample to sample correlations of bipolar patients during manic phase and controls."}
139 225
 # Get Expression matrix
140 226
 manicExpr <- assay(manic, "counts")
141
-boxplot(manicExpr, pch = ".", xaxt = "n", xlab = "Sample", ylab = "Expression")
227
+
228
+
229
+manicExpr %>% 
230
+    cor %>% 
231
+    pheatmap(col =viridis(10),border_color = NA,angle_col = 45,fontsize = 7)
142 232
 ```
143 233
 
144 234
 You can also use
145
-[`get_dataset_expression()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_expression.html)
235
+[`get_dataset_expression`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_expression.html)
146 236
 to only get the expression matrix, and
147
-[`get_dataset_design()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_design.html)
237
+[`get_dataset_design`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_design.html)
148 238
 to get the experimental design matrix.
149 239
 
150
-## Platform Annotations
240
+# Platform Annotations
151 241
 
152 242
 Expression data in Gemma comes with annotations for the gene each
153 243
 expression profile corresponds to. Using the
154
-[`get_platform_annotations()`](https://pavlidislab.github.io/gemma.R/reference/get_platform_annotations.html)
244
+[`get_platform_annotations`](https://pavlidislab.github.io/gemma.R/reference/get_platform_annotations.html)
155 245
 function, these annotations can be retrieved independently of the
156 246
 expression data, along with additional annotations such as Gene Ontology
157 247
 terms.
... ...
@@ -168,7 +258,7 @@ head(get_platform_annotations('Generic_human'))
168 258
 
169 259
 If you are interested in a particular gene, you can see which platforms
170 260
 include it using
171
-[`get_gene_probes()`](https://pavlidislab.github.io/gemma.R/reference/get_gene_probes.html).
261
+[`get_gene_probes`](https://pavlidislab.github.io/gemma.R/reference/get_gene_probes.html).
172 262
 Note that functions to search gene work best with unambigious
173 263
 identifiers rather than symbols.
174 264
 
... ...
@@ -184,7 +274,7 @@ head(probes[,.SD, .SDcols = !colnames(probes) %in% c('mapping.Description','plat
184 274
 
185 275
 ```
186 276
 
187
-## Differential expression analyses
277
+# Differential expression analyses
188 278
 
189 279
 Gemma contains precomputed differential expression analyses for most of
190 280
 its datasets. Analyses can involve more than one factor, such as "sex"
... ...
@@ -192,20 +282,60 @@ as well as "disease". Some datasets contain more than one analysis to
192 282
 account for different factors and their interactions. The results are
193 283
 stored as resultSets, each corresponding to one factor (or their
194 284
 interaction). You can access them using
195
-[`get_differential_expression_values()`](https://pavlidislab.github.io/gemma.R/reference/get_differential_expression_values.html).
285
+[`get_differential_expression_values`](https://pavlidislab.github.io/gemma.R/reference/get_differential_expression_values.html).
196 286
 From here on, we can explore and visualize the data to find the most
197 287
 differentially-expressed genes
198 288
 
199
-Note that `get_differential_expression_values()` has the argument
200
-`readableContrasts`. By default this is set to `FALSE`, resulting in the
201
-output columns being named based on the contrast IDs rather the more
202
-human readable columns used below. These contrast IDs can be used
203
-together with
204
-[`get_dataset_differential_expression_analyses()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_differential_expression_analyses.html)
205
-during a systematic analysis.
289
+Note that `get_differential_expression_values` can return multiple differentials
290
+per study if a study has multiple factors to contrast. Since GSE46416 only has one
291
+extracting the first element of the returned list is all we need.
292
+```{r}
293
+dif_exp <- get_differential_expression_values('GSE46416')
294
+(dif_exp[[1]])
295
+```
296
+
297
+By default the columns names of the output correspond to contrast IDs. To see what
298
+conditions these IDs correspond to we can either use `get_dataset_differential_expression_analyses`
299
+to get the metadata about differentials of a given dataset, or set`readableContrasts` argument
300
+of `get_differential_expression_values` to `TRUE`. The former approach is usually better for
301
+a large scale systematic analysis while the latter is easier to read in an interactive session.
302
+
303
+`get_dataset_differential_expression_analyses` function returns structured metadata
304
+about the differentials.
305
+
306
+```{r}
307
+(contrasts <- get_dataset_differential_expression_analyses('GSE46416'))
308
+```
309
+
310
+`contrast.id` column corresponds to the column names in the
311
+output of `get_differential_expression_values` while `result.ID` corresponds to the
312
+name of the differential in the output object. Using them together will let one to access
313
+differentially expressed gene counts for each condition contrast
314
+
315
+```{r}
316
+# using result.ID and contrast.ID of the output above, we can access specific
317
+# results. Note that one study may have multiple contrast objects
318
+seq_len(nrow(contrasts)) %>% sapply(function(i){
319
+    result_set = dif_exp[[as.character(contrasts[i,]$result.ID)]]
320
+    p_values = result_set[[glue::glue("contrast_{contrasts[i,]$contrast.id}_pvalue")]]
321
+    
322
+    # multiple testing correction
323
+    sum(p.adjust(p_values,method = 'BH') < 0.05)
324
+}) -> dif_exp_genes
325
+
326
+data.frame(result.ID = contrasts$result.ID,
327
+           contrast.id = contrasts$contrast.id,
328
+           baseline.factorValue = contrasts$baseline.factorValue,
329
+           experimental.factorValue = contrasts$experimental.factorValue,
330
+           n_diff = dif_exp_genes)
331
+```
332
+
333
+Alternatively we, since we are only looking at one dataset and one contrast manually,
334
+we can simply use `readableContrasts`.
335
+
206 336
 
207 337
 ```{r diffExpr, fig.cap="Differentially-expressed genes in bipolar patients during manic phase versus controls.", fig.wide=TRUE, warning = FALSE}
208
-de <- get_differential_expression_values("GSE46416",readableContrasts = TRUE)[[1]]
338
+(de <- get_differential_expression_values("GSE46416",readableContrasts = TRUE)[[1]])
209 339
 
210 340
 # Classify probes for plotting
211 341
 de$diffexpr <- "No"
... ...
@@ -251,39 +381,7 @@ ggplot(
251 381
     theme_minimal()
252 382
 ```
253 383
 
254
-Setting `readableContrasts=FALSE` (default) will allow easier
255
-programmatic access to differentials. For instance, lets get the number of
256
-differentially expressed genes for each contrast returned by [`get_dataset_differential_expression_analyses()`](https://pavlidislab.github.io/gemma.R/reference/get_dataset_differential_expression_analyses.html) for a study
257
-
258
-```{r}
259
-# return all contrasts for the dataset
260
-(contrasts <- get_dataset_differential_expression_analyses('GSE46416'))
261
-
262
-dif_exp_values = get_differential_expression_values('GSE46416')
263
-
264
-
265
-# using result.ID and contrast.ID of the output above, we can access specific
266
-# results. Note that one study may have multiple contrast objects
267
-seq_len(nrow(contrasts)) %>% sapply(function(i){
268
-    result_set = dif_exp_values[[as.character(contrasts[i,]$result.ID)]]
269
-    p_values = result_set[[glue::glue("contrast_{contrasts[i,]$contrast.id}_pvalue")]]
270
-    
271
-    # multiple testing correction
272
-    sum(p.adjust(p_values,method = 'BH') < 0.05)
273
-}) -> dif_exp_genes
274
-
275
-dif_exp_genes
276
-
277
-```
278
-## Larger queries
279
-
280
-Some endpoints accept multiple identifiers in a single function call.
281
-For example, getting information on 2 datasets at the same time.
282
-
283
-```{r triple-query}
284
-get_datasets_by_ids(datasets = c("GSE35974", "GSE46416")) %>%
285
-    select(experiment.ShortName, experiment.Name, experiment.ID, experiment.Accession, experiment.SampleCount, taxon.Name)
286
-```
384
+# Larger queries
287 385
 
288 386
 To query large amounts of data, the API has a pagination system which
289 387
 uses the `limit` and `offset` parameters. To avoid overloading the
... ...
@@ -373,7 +471,7 @@ different directory, use `options(gemma.cache = 'path')`.
373 471
 If you're done with your fetching and want to ensure no space is being
374 472
 used for cached results, or if you just want to ensure you're getting
375 473
 up-to-date data from Gemma, you can clear the cache using
376
-[`forget_gemma_memoised()`](https://pavlidislab.github.io/gemma.R/reference/forget_gemma_memoised.html).
474
+[`forget_gemma_memoised`](https://pavlidislab.github.io/gemma.R/reference/forget_gemma_memoised.html).
377 475
 
378 476
 ## Changing defaults
379 477