... | ... |
@@ -250,7 +250,7 @@ In our example, we want to find the genes that change specifically in T cells. F |
250 | 250 |
```{r do_pseudobulk} |
251 | 251 |
# The contrast argument specifies what we want to compare |
252 | 252 |
# We test the expression difference of stimulated and control T-cells |
253 |
-de_res <- test_de(fit, contrast = fact(cell = "CD4 T cells", stim = "ctrl") - fact(cell = "CD4 T cells", stim = "stim")) |
|
253 |
+de_res <- test_de(fit, contrast = cond(cell = "CD4 T cells", stim = "ctrl") - cond(cell = "CD4 T cells", stim = "stim")) |
|
254 | 254 |
|
255 | 255 |
# Most different genes |
256 | 256 |
head(de_res[order(de_res$pval), ]) |
... | ... |
@@ -222,10 +222,17 @@ counts(sce_subset) <- as.matrix(counts(sce_subset)) |
222 | 222 |
sce_subset$cell <- droplevels(sce_subset$cell) |
223 | 223 |
``` |
224 | 224 |
|
225 |
+In the first step we will aggregate the counts of each patient, condition and cell type and form pseudobulk samples. This ensures that I get reliable p-value by treating each patient as a replicate and not each cell. |
|
226 |
+ |
|
227 |
+```{r} |
|
228 |
+sce_reduced <- pseudobulk(sce_subset, group_by = vars(ind, stim, cell)) |
|
229 |
+``` |
|
230 |
+ |
|
231 |
+ |
|
225 | 232 |
We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment: |
226 | 233 |
|
227 | 234 |
```{r kang_fit} |
228 |
-fit <- glm_gp(sce_subset, design = ~ cell + stim + stim:cell - 1, |
|
235 |
+fit <- glm_gp(sce_reduced, design = ~ cell + stim + stim:cell - 1, |
|
229 | 236 |
reference_level = "NK cells") |
230 | 237 |
summary(fit) |
231 | 238 |
``` |
... | ... |
@@ -238,26 +245,18 @@ colnames(fit$Beta) |
238 | 245 |
``` |
239 | 246 |
|
240 | 247 |
|
241 |
-In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. To get a proper estimate of the uncertainty (cells from the same donor are **not** independent replicates), we create a pseudobulk for each sample: |
|
248 |
+In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. |
|
242 | 249 |
|
243 | 250 |
```{r do_pseudobulk} |
244 | 251 |
# The contrast argument specifies what we want to compare |
245 | 252 |
# We test the expression difference of stimulated and control T-cells |
246 |
-# |
|
247 |
-# There is no sample label in the colData, so we create it on the fly |
|
248 |
-# from `stim` and `ind` columns in colData(fit$data). |
|
249 |
-de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, |
|
250 |
- pseudobulk_by = paste0(stim, "-", ind)) |
|
251 |
- |
|
252 |
-# The large `lfc` values come from groups were nearly all counts are 0 |
|
253 |
-# Setting them to Inf makes the plots look nicer |
|
254 |
-de_res$lfc <- ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc) |
|
253 |
+de_res <- test_de(fit, contrast = fact(cell = "CD4 T cells", stim = "ctrl") - fact(cell = "CD4 T cells", stim = "stim")) |
|
255 | 254 |
|
256 | 255 |
# Most different genes |
257 | 256 |
head(de_res[order(de_res$pval), ]) |
258 | 257 |
``` |
259 | 258 |
|
260 |
-The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells: _IFI6_, _IFIT3_, and _IRF7_ literally stand for _Interferon Induced/Regulated Protein_. |
|
259 |
+The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells: _IFI6_, _IFIT3_ and _ISG15_ literally stand for _Interferon Induced/Stimulated Protein_. |
|
261 | 260 |
|
262 | 261 |
To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values. |
263 | 262 |
|
... | ... |
@@ -271,17 +270,19 @@ ggplot(de_res, aes(x = lfc, y = -log10(pval))) + |
271 | 270 |
|
272 | 271 |
Another important task in single cell data analysis is the identification of marker genes for cell clusters. For this we can also use our Gamma-Poisson fit. |
273 | 272 |
|
274 |
-Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition: |
|
273 |
+Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition (this time not accounting for the pseudo-replication structure): |
|
275 | 274 |
|
276 | 275 |
```{r find_marker_genes} |
277 |
-marker_genes <- test_de(fit, `cellCD4 T cells` - `cellB cells`, sort_by = pval) |
|
276 |
+fit_full <- glm_gp(sce_subset, design = ~ cell + stim + stim:cell - 1, |
|
277 |
+ reference_level = "NK cells") |
|
278 |
+marker_genes <- test_de(fit_full, `cellCD4 T cells` - `cellB cells`, sort_by = pval) |
|
278 | 279 |
head(marker_genes) |
279 | 280 |
``` |
280 | 281 |
|
281 | 282 |
If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast: |
282 | 283 |
|
283 | 284 |
```{r do_signif_genes} |
284 |
-marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - |
|
285 |
+marker_genes2 <- test_de(fit_full, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - |
|
285 | 286 |
(`cellB cells` + `cellB cells:stimstim`), |
286 | 287 |
sort_by = pval) |
287 | 288 |
|
... | ... |
@@ -4,7 +4,7 @@ output: github_document |
4 | 4 |
|
5 | 5 |
<!-- README.md is generated from README.Rmd. Please edit that file --> |
6 | 6 |
|
7 |
-```{r, include = FALSE} |
|
7 |
+```{r init_chunk, include = FALSE} |
|
8 | 8 |
knitr::opts_chunk$set( |
9 | 9 |
collapse = TRUE, |
10 | 10 |
comment = "#>", |
... | ... |
@@ -59,14 +59,14 @@ If you use this package in a scientific publication, please cite: |
59 | 59 |
|
60 | 60 |
Load the glmGamPoi package |
61 | 61 |
|
62 |
-```{r} |
|
62 |
+```{r load_glmGamPoi} |
|
63 | 63 |
library(glmGamPoi) |
64 | 64 |
``` |
65 | 65 |
|
66 | 66 |
|
67 | 67 |
To fit a single Gamma-Poisson GLM do: |
68 | 68 |
|
69 |
-```{r} |
|
69 |
+```{r single_gp_fit} |
|
70 | 70 |
# overdispersion = 1/size |
71 | 71 |
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
72 | 72 |
|
... | ... |
@@ -83,12 +83,12 @@ The `glm_gp()` function returns a list with the results of the fit. Most importa |
83 | 83 |
Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy: |
84 | 84 |
|
85 | 85 |
I will first load an example dataset using the `TENxPBMCData` package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells: |
86 |
-```{r, warning=FALSE, message = FALSE} |
|
86 |
+```{r load_additional_packages, warning=FALSE, message = FALSE} |
|
87 | 87 |
library(SummarizedExperiment) |
88 | 88 |
library(DelayedMatrixStats) |
89 | 89 |
``` |
90 | 90 |
|
91 |
-```{r} |
|
91 |
+```{r load_pbmc_data} |
|
92 | 92 |
# The full dataset with 33,000 genes and 4340 cells |
93 | 93 |
# The first time this is run, it will download the data |
94 | 94 |
pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k") |
... | ... |
@@ -101,7 +101,7 @@ pbmcs_subset |
101 | 101 |
|
102 | 102 |
I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory. |
103 | 103 |
|
104 |
-```{r} |
|
104 |
+```{r simple_fit} |
|
105 | 105 |
fit <- glm_gp(pbmcs_subset, on_disk = FALSE) |
106 | 106 |
summary(fit) |
107 | 107 |
``` |
... | ... |
@@ -114,7 +114,7 @@ I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2") |
114 | 114 |
|
115 | 115 |
|
116 | 116 |
|
117 |
-```{r, warning=FALSE} |
|
117 |
+```{r run_bench_mark, warning=FALSE} |
|
118 | 118 |
# Explicitly realize count matrix in memory so that it is a fair comparison |
119 | 119 |
pbmcs_subset <- as.matrix(assay(pbmcs_subset)) |
120 | 120 |
model_matrix <- matrix(1, nrow = ncol(pbmcs_subset)) |
... | ... |
@@ -145,7 +145,7 @@ On this dataset, `glmGamPoi` is more than 5 times faster than `edgeR` and more t |
145 | 145 |
|
146 | 146 |
To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable: |
147 | 147 |
|
148 |
-```{r message=FALSE, warning=FALSE} |
|
148 |
+```{r compare_with_deseq2_and_edger, message=FALSE, warning=FALSE} |
|
149 | 149 |
# Results with my method |
150 | 150 |
fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
151 | 151 |
|
... | ... |
@@ -201,14 +201,14 @@ The method scales linearly, with the number of rows and columns in the dataset. |
201 | 201 |
|
202 | 202 |
`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes. To demonstrate this feature, we will use the data from [Kang _et al._ (2018)](https://www.ncbi.nlm.nih.gov/pubmed/29227470) provided by the `MuscData` package. This is a single cell dataset of 8 Lupus patients for which 10x droplet-based scRNA-seq was performed before and after treatment with interferon beta. The `SingleCellExperiment` object conveniently provides the patient id (`ind`), treatment status (`stim`) and cell type (`cell`): |
203 | 203 |
|
204 |
-```{r} |
|
204 |
+```{r load_kang_data} |
|
205 | 205 |
sce <- muscData::Kang18_8vs8() |
206 | 206 |
colData(sce) |
207 | 207 |
``` |
208 | 208 |
|
209 | 209 |
For demonstration purpose, I will work on a subset of the genes and cells: |
210 | 210 |
|
211 |
-```{r} |
|
211 |
+```{r subset_kang_data} |
|
212 | 212 |
set.seed(1) |
213 | 213 |
# Take highly expressed genes and proper cells: |
214 | 214 |
sce_subset <- sce[rowSums(counts(sce)) > 100, |
... | ... |
@@ -224,7 +224,7 @@ sce_subset$cell <- droplevels(sce_subset$cell) |
224 | 224 |
|
225 | 225 |
We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment: |
226 | 226 |
|
227 |
-```{r} |
|
227 |
+```{r kang_fit} |
|
228 | 228 |
fit <- glm_gp(sce_subset, design = ~ cell + stim + stim:cell - 1, |
229 | 229 |
reference_level = "NK cells") |
230 | 230 |
summary(fit) |
... | ... |
@@ -233,14 +233,14 @@ summary(fit) |
233 | 233 |
|
234 | 234 |
To see how the coefficient of our model are called, we look at the `colnames(fit$Beta)`: |
235 | 235 |
|
236 |
-```{r} |
|
236 |
+```{r show_beta_colnames} |
|
237 | 237 |
colnames(fit$Beta) |
238 | 238 |
``` |
239 | 239 |
|
240 | 240 |
|
241 | 241 |
In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. To get a proper estimate of the uncertainty (cells from the same donor are **not** independent replicates), we create a pseudobulk for each sample: |
242 | 242 |
|
243 |
-```{r} |
|
243 |
+```{r do_pseudobulk} |
|
244 | 244 |
# The contrast argument specifies what we want to compare |
245 | 245 |
# We test the expression difference of stimulated and control T-cells |
246 | 246 |
# |
... | ... |
@@ -261,7 +261,7 @@ The test is successful and we identify interesting genes that are differentially |
261 | 261 |
|
262 | 262 |
To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values. |
263 | 263 |
|
264 |
-```{r} |
|
264 |
+```{r make_volcano_plot} |
|
265 | 265 |
library(ggplot2) |
266 | 266 |
ggplot(de_res, aes(x = lfc, y = -log10(pval))) + |
267 | 267 |
geom_point(size = 0.6, aes(color = adj_pval < 0.1)) + |
... | ... |
@@ -273,14 +273,14 @@ Another important task in single cell data analysis is the identification of mar |
273 | 273 |
|
274 | 274 |
Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition: |
275 | 275 |
|
276 |
-```{r} |
|
276 |
+```{r find_marker_genes} |
|
277 | 277 |
marker_genes <- test_de(fit, `cellCD4 T cells` - `cellB cells`, sort_by = pval) |
278 | 278 |
head(marker_genes) |
279 | 279 |
``` |
280 | 280 |
|
281 | 281 |
If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast: |
282 | 282 |
|
283 |
-```{r} |
|
283 |
+```{r do_signif_genes} |
|
284 | 284 |
marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - |
285 | 285 |
(`cellB cells` + `cellB cells:stimstim`), |
286 | 286 |
sort_by = pval) |
... | ... |
@@ -292,7 +292,7 @@ We identify many genes related to the human leukocyte antigen (HLA) system that |
292 | 292 |
|
293 | 293 |
A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results. |
294 | 294 |
|
295 |
-```{r} |
|
295 |
+```{r plot_marker_genes} |
|
296 | 296 |
# Create a data.frame with the expression values, gene names, and cell types |
297 | 297 |
tmp <- data.frame(gene = rep(marker_genes$name[1:6], times = ncol(sce_subset)), |
298 | 298 |
expression = c(counts(sce_subset)[marker_genes$name[1:6], ]), |
... | ... |
@@ -316,7 +316,7 @@ This work was supported by the EMBL International PhD Programme and the European |
316 | 316 |
|
317 | 317 |
# Session Info |
318 | 318 |
|
319 |
-```{r} |
|
319 |
+```{r print_sessionInfo} |
|
320 | 320 |
sessionInfo() |
321 | 321 |
``` |
322 | 322 |
|
... | ... |
@@ -307,7 +307,9 @@ ggplot(tmp, aes(x = celltype, y = expression)) + |
307 | 307 |
|
308 | 308 |
|
309 | 309 |
|
310 |
+# Acknowlegments |
|
310 | 311 |
|
312 |
+This work was supported by the EMBL International PhD Programme and the European Research Council Synergy grant DECODE under grant agreement No. 810296. |
|
311 | 313 |
|
312 | 314 |
|
313 | 315 |
|
... | ... |
@@ -53,7 +53,7 @@ If you use this package in a scientific publication, please cite: |
53 | 53 |
|
54 | 54 |
> glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data |
55 | 55 |
> Constantin Ahlmann-Eltze, Wolfgang Huber |
56 |
-> bioRxiv 2020.08.13.249623; doi: https://doi.org/10.1101/2020.08.13.249623 |
|
56 |
+> Bioinformatics; 2020-12-09; doi: https://doi.org/10.1093/bioinformatics/btaa1009 |
|
57 | 57 |
|
58 | 58 |
# Example |
59 | 59 |
|
... | ... |
@@ -245,13 +245,13 @@ In our example, we want to find the genes that change specifically in T cells. F |
245 | 245 |
# We test the expression difference of stimulated and control T-cells |
246 | 246 |
# |
247 | 247 |
# There is no sample label in the colData, so we create it on the fly |
248 |
-# (paste0(stim, "-", ind)) to aggregate the counts by sample |
|
248 |
+# from `stim` and `ind` columns in colData(fit$data). |
|
249 | 249 |
de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, |
250 | 250 |
pseudobulk_by = paste0(stim, "-", ind)) |
251 | 251 |
|
252 | 252 |
# The large `lfc` values come from groups were nearly all counts are 0 |
253 | 253 |
# Setting them to Inf makes the plots look nicer |
254 |
-de_res$lfc = ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc) |
|
254 |
+de_res$lfc <- ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc) |
|
255 | 255 |
|
256 | 256 |
# Most different genes |
257 | 257 |
head(de_res[order(de_res$pval), ]) |
... | ... |
@@ -265,7 +265,7 @@ To get a more complete overview of the results, we can make a volcano plot that |
265 | 265 |
library(ggplot2) |
266 | 266 |
ggplot(de_res, aes(x = lfc, y = -log10(pval))) + |
267 | 267 |
geom_point(size = 0.6, aes(color = adj_pval < 0.1)) + |
268 |
- ggtitle("Volcano Plot", "Genes that change most through\ninterferon-beta treatment in T cells") |
|
268 |
+ ggtitle("Volcano Plot", "Genes that change most through interferon-beta treatment in T cells") |
|
269 | 269 |
``` |
270 | 270 |
|
271 | 271 |
|
... | ... |
@@ -288,7 +288,7 @@ marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - |
288 | 288 |
head(marker_genes2) |
289 | 289 |
``` |
290 | 290 |
|
291 |
-Unsurprisingly, we identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences. |
|
291 |
+We identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences. |
|
292 | 292 |
|
293 | 293 |
A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results. |
294 | 294 |
|
Improve section on differential expression analysis. Use the Kang 2018
dataset provided by muscData
... | ... |
@@ -199,58 +199,118 @@ The method scales linearly, with the number of rows and columns in the dataset. |
199 | 199 |
|
200 | 200 |
## Differential expression analysis |
201 | 201 |
|
202 |
-`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes: |
|
202 |
+`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes. To demonstrate this feature, we will use the data from [Kang _et al._ (2018)](https://www.ncbi.nlm.nih.gov/pubmed/29227470) provided by the `MuscData` package. This is a single cell dataset of 8 Lupus patients for which 10x droplet-based scRNA-seq was performed before and after treatment with interferon beta. The `SingleCellExperiment` object conveniently provides the patient id (`ind`), treatment status (`stim`) and cell type (`cell`): |
|
203 | 203 |
|
204 | 204 |
```{r} |
205 |
-# Create random categorical assignment to demonstrate DE |
|
206 |
-group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE) |
|
207 |
- |
|
208 |
-# Fit model with group vector as design |
|
209 |
-fit <- glm_gp(pbmcs_subset, design = group) |
|
210 |
-# Compare against model without group |
|
211 |
-res <- test_de(fit, reduced_design = ~ 1) |
|
212 |
-# Look at first 6 genes |
|
213 |
-head(res) |
|
205 |
+sce <- muscData::Kang18_8vs8() |
|
206 |
+colData(sce) |
|
214 | 207 |
``` |
215 | 208 |
|
216 |
-The p-values agree well with the ones that `edgeR` is calculating. This is because `glmGamPoi` uses the same framework of quasi-likelihood ratio tests that was invented by `edgeR` and is described in [Lund et al. (2012)](https://doi.org/10.1515/1544-6115.1826). |
|
209 |
+For demonstration purpose, I will work on a subset of the genes and cells: |
|
217 | 210 |
|
218 |
-```{r fig.height=3, fig.width=3, warning=FALSE, message=FALSE} |
|
219 |
-model_matrix <- model.matrix(~ group, data = data.frame(group = group)) |
|
220 |
-edgeR_data <- edgeR::DGEList(pbmcs_subset) |
|
221 |
-edgeR_data <- edgeR::calcNormFactors(edgeR_data) |
|
222 |
-edgeR_data <- edgeR::estimateDisp(edgeR_data, design = model_matrix) |
|
223 |
-edgeR_fit <- edgeR::glmQLFit(edgeR_data, design = model_matrix) |
|
224 |
-edgeR_test <- edgeR::glmQLFTest(edgeR_fit, coef = 2) |
|
225 |
-edgeR_res <- edgeR::topTags(edgeR_test, sort.by = "none", n = nrow(pbmcs_subset)) |
|
211 |
+```{r} |
|
212 |
+set.seed(1) |
|
213 |
+# Take highly expressed genes and proper cells: |
|
214 |
+sce_subset <- sce[rowSums(counts(sce)) > 100, |
|
215 |
+ sample(which(sce$multiplets == "singlet" & |
|
216 |
+ ! is.na(sce$cell) & |
|
217 |
+ sce$cell %in% c("CD4 T cells", "B cells", "NK cells")), |
|
218 |
+ 1000)] |
|
219 |
+# Convert counts to dense matrix |
|
220 |
+counts(sce_subset) <- as.matrix(counts(sce_subset)) |
|
221 |
+# Remove empty levels because glm_gp() will complain otherwise |
|
222 |
+sce_subset$cell <- droplevels(sce_subset$cell) |
|
226 | 223 |
``` |
227 | 224 |
|
225 |
+We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment: |
|
228 | 226 |
|
229 |
-```{r fig.height=4, fig.width=3, warning=FALSE, message=FALSE, echo = FALSE} |
|
230 |
-par(cex.main = 2, cex.lab = 1.5) |
|
231 |
-plot(res$pval, edgeR_res$table$PValue, pch = 16, log = "xy", |
|
232 |
- main = "p-values", xlab = "glmGamPoi", ylab = "edgeR") |
|
233 |
-abline(0,1) |
|
227 |
+```{r} |
|
228 |
+fit <- glm_gp(sce_subset, design = ~ cell + stim + stim:cell - 1, |
|
229 |
+ reference_level = "NK cells") |
|
230 |
+summary(fit) |
|
231 |
+``` |
|
232 |
+ |
|
233 |
+ |
|
234 |
+To see how the coefficient of our model are called, we look at the `colnames(fit$Beta)`: |
|
235 |
+ |
|
236 |
+```{r} |
|
237 |
+colnames(fit$Beta) |
|
238 |
+``` |
|
239 |
+ |
|
240 |
+ |
|
241 |
+In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. To get a proper estimate of the uncertainty (cells from the same donor are **not** independent replicates), we create a pseudobulk for each sample: |
|
242 |
+ |
|
243 |
+```{r} |
|
244 |
+# The contrast argument specifies what we want to compare |
|
245 |
+# We test the expression difference of stimulated and control T-cells |
|
246 |
+# |
|
247 |
+# There is no sample label in the colData, so we create it on the fly |
|
248 |
+# (paste0(stim, "-", ind)) to aggregate the counts by sample |
|
249 |
+de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, |
|
250 |
+ pseudobulk_by = paste0(stim, "-", ind)) |
|
251 |
+ |
|
252 |
+# The large `lfc` values come from groups were nearly all counts are 0 |
|
253 |
+# Setting them to Inf makes the plots look nicer |
|
254 |
+de_res$lfc = ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc) |
|
255 |
+ |
|
256 |
+# Most different genes |
|
257 |
+head(de_res[order(de_res$pval), ]) |
|
258 |
+``` |
|
259 |
+ |
|
260 |
+The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells: _IFI6_, _IFIT3_, and _IRF7_ literally stand for _Interferon Induced/Regulated Protein_. |
|
261 |
+ |
|
262 |
+To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values. |
|
263 |
+ |
|
264 |
+```{r} |
|
265 |
+library(ggplot2) |
|
266 |
+ggplot(de_res, aes(x = lfc, y = -log10(pval))) + |
|
267 |
+ geom_point(size = 0.6, aes(color = adj_pval < 0.1)) + |
|
268 |
+ ggtitle("Volcano Plot", "Genes that change most through\ninterferon-beta treatment in T cells") |
|
234 | 269 |
``` |
235 | 270 |
|
236 | 271 |
|
237 |
-#### Pseudobulk |
|
272 |
+Another important task in single cell data analysis is the identification of marker genes for cell clusters. For this we can also use our Gamma-Poisson fit. |
|
273 |
+ |
|
274 |
+Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition: |
|
238 | 275 |
|
239 |
-Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment by measuring 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. |
|
276 |
+```{r} |
|
277 |
+marker_genes <- test_de(fit, `cellCD4 T cells` - `cellB cells`, sort_by = pval) |
|
278 |
+head(marker_genes) |
|
279 |
+``` |
|
240 | 280 |
|
241 |
-Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()` function of glmGamPoi supports this feature directly through the `pseudobulk_by` and `subset_to` parameters: |
|
281 |
+If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast: |
|
242 | 282 |
|
243 | 283 |
```{r} |
244 |
-# say we have cell type labels for each cell and know from which sample they come originally |
|
245 |
-sample_labels <- rep(paste0("sample_", 1:6), length = ncol(pbmcs_subset)) |
|
246 |
-cell_type_labels <- sample(c("T-cells", "B-cells", "Macrophages"), ncol(pbmcs_subset), replace = TRUE) |
|
247 |
- |
|
248 |
-test_de(fit, contrast = Group1 - Group2, |
|
249 |
- pseudobulk_by = sample_labels, |
|
250 |
- subset_to = cell_type_labels == "T-cells", |
|
251 |
- n_max = 4, sort_by = pval, decreasing = FALSE) |
|
284 |
+marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - |
|
285 |
+ (`cellB cells` + `cellB cells:stimstim`), |
|
286 |
+ sort_by = pval) |
|
287 |
+ |
|
288 |
+head(marker_genes2) |
|
252 | 289 |
``` |
253 | 290 |
|
291 |
+Unsurprisingly, we identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences. |
|
292 |
+ |
|
293 |
+A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results. |
|
294 |
+ |
|
295 |
+```{r} |
|
296 |
+# Create a data.frame with the expression values, gene names, and cell types |
|
297 |
+tmp <- data.frame(gene = rep(marker_genes$name[1:6], times = ncol(sce_subset)), |
|
298 |
+ expression = c(counts(sce_subset)[marker_genes$name[1:6], ]), |
|
299 |
+ celltype = rep(sce_subset$cell, each = 6)) |
|
300 |
+ |
|
301 |
+ggplot(tmp, aes(x = celltype, y = expression)) + |
|
302 |
+ geom_jitter(height = 0.1) + |
|
303 |
+ stat_summary(geom = "crossbar", fun = "mean", color = "red") + |
|
304 |
+ facet_wrap(~ gene, scales = "free_y") + |
|
305 |
+ ggtitle("Marker genes of B vs. T cells") |
|
306 |
+``` |
|
307 |
+ |
|
308 |
+ |
|
309 |
+ |
|
310 |
+ |
|
311 |
+ |
|
312 |
+ |
|
313 |
+ |
|
254 | 314 |
|
255 | 315 |
# Session Info |
256 | 316 |
|
... | ... |
@@ -49,6 +49,11 @@ BiocManager::install("glmGamPoi") |
49 | 49 |
|
50 | 50 |
For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")` repo. |
51 | 51 |
|
52 |
+If you use this package in a scientific publication, please cite: |
|
53 |
+ |
|
54 |
+> glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data |
|
55 |
+> Constantin Ahlmann-Eltze, Wolfgang Huber |
|
56 |
+> bioRxiv 2020.08.13.249623; doi: https://doi.org/10.1101/2020.08.13.249623 |
|
52 | 57 |
|
53 | 58 |
# Example |
54 | 59 |
|
... | ... |
@@ -24,12 +24,12 @@ set.seed(2) |
24 | 24 |
|
25 | 25 |
Pronounciation: [`dʒi əl əm ɡam ˈpwɑ`](http://ipa-reader.xyz/?text=d%CA%92i%20%C9%99l%20%C9%99m%20%C9%A1am%20%CB%88pw%C9%91) |
26 | 26 |
|
27 |
-The core design aims of `gmlGamPoi` are: |
|
27 |
+The core design aims of `glmGamPoi` are: |
|
28 | 28 |
|
29 |
-* Fit the Gamma-Poisson models on arbitrarily large or small datasets |
|
29 |
+* Fit Gamma-Poisson models on arbitrarily large or small datasets |
|
30 | 30 |
* Be faster than alternative methods, such as `DESeq2` or `edgeR` |
31 | 31 |
* Calculate exact or approximate results based on user preference |
32 |
-* Support in memory or on-disk data |
|
32 |
+* Support in-memory or on-disk data |
|
33 | 33 |
* Follow established conventions around tools for RNA-seq analysis |
34 | 34 |
* Present a simple user-interface |
35 | 35 |
* Avoid unnecessary dependencies |
... | ... |
@@ -64,6 +64,7 @@ To fit a single Gamma-Poisson GLM do: |
64 | 64 |
```{r} |
65 | 65 |
# overdispersion = 1/size |
66 | 66 |
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
67 |
+ |
|
67 | 68 |
# design = ~ 1 means that an intercept-only model is fit |
68 | 69 |
fit <- glm_gp(counts, design = ~ 1) |
69 | 70 |
fit |
... | ... |
@@ -86,6 +87,7 @@ library(DelayedMatrixStats) |
86 | 87 |
# The full dataset with 33,000 genes and 4340 cells |
87 | 88 |
# The first time this is run, it will download the data |
88 | 89 |
pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k") |
90 |
+ |
|
89 | 91 |
# I want genes where at least some counts are non-zero |
90 | 92 |
non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0) |
91 | 93 |
pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ] |
... | ... |
@@ -187,7 +189,7 @@ I am comparing the gene-wise estimates of the coefficients from all three method |
187 | 189 |
|
188 | 190 |
## Scalability |
189 | 191 |
|
190 |
-The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk. Fitting that dataset in-memory is not possible because it is just too big: the maximum in-memory matrix size is `2^31-1 ≈ 2.1e9` is elements, the `pbmc68k` dataset however has nearly 100 million elements more than that. |
|
192 |
+The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk. |
|
191 | 193 |
|
192 | 194 |
|
193 | 195 |
## Differential expression analysis |
... | ... |
@@ -229,7 +231,7 @@ abline(0,1) |
229 | 231 |
|
230 | 232 |
#### Pseudobulk |
231 | 233 |
|
232 |
-Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment through single cell experiment with 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. |
|
234 |
+Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment by measuring 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. |
|
233 | 235 |
|
234 | 236 |
Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()` function of glmGamPoi supports this feature directly through the `pseudobulk_by` and `subset_to` parameters: |
235 | 237 |
|
... | ... |
@@ -13,15 +13,17 @@ knitr::opts_chunk$set( |
13 | 13 |
) |
14 | 14 |
set.seed(2) |
15 | 15 |
``` |
16 |
-# glmGamPoi |
|
16 |
+ |
|
17 |
+# glmGamPoi <a href='https://github.com/const-ae/glmGamPoi'><img src='man/figures/logo.svg' align="right" height="139" /></a> |
|
17 | 18 |
|
18 | 19 |
<!-- badges: start --> |
19 | 20 |
[](https://codecov.io/gh/const-ae/glmGamPoi) |
20 | 21 |
<!-- badges: end --> |
21 | 22 |
|
22 |
- |
|
23 | 23 |
> Fit Gamma-Poisson Generalized Linear Models Reliably. |
24 | 24 |
|
25 |
+Pronounciation: [`dʒi əl əm ɡam ˈpwɑ`](http://ipa-reader.xyz/?text=d%CA%92i%20%C9%99l%20%C9%99m%20%C9%A1am%20%CB%88pw%C9%91) |
|
26 |
+ |
|
25 | 27 |
The core design aims of `gmlGamPoi` are: |
26 | 28 |
|
27 | 29 |
* Fit the Gamma-Poisson models on arbitrarily large or small datasets |
... | ... |
@@ -132,7 +132,7 @@ bench::mark( |
132 | 132 |
) |
133 | 133 |
``` |
134 | 134 |
|
135 |
-On this dataset, `glmGamPoi` is more than 6 times faster than `edgeR` and more than 20 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations. |
|
135 |
+On this dataset, `glmGamPoi` is more than 5 times faster than `edgeR` and more than 18 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations. |
|
136 | 136 |
|
137 | 137 |
To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable: |
138 | 138 |
|
... | ... |
@@ -144,13 +144,12 @@ fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
144 | 144 |
dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, |
145 | 145 |
colData = data.frame(name = seq_len(4340)), |
146 | 146 |
design = ~ 1) |
147 |
-dds <- DESeq2::estimateSizeFactors(dds, "poscounts") |
|
147 |
+sizeFactors(dds) <- fit$size_factors |
|
148 | 148 |
dds <- DESeq2::estimateDispersions(dds, quiet = TRUE) |
149 | 149 |
dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6) |
150 | 150 |
|
151 | 151 |
#edgeR |
152 |
-edgeR_data <- edgeR::DGEList(pbmcs_subset) |
|
153 |
-edgeR_data <- edgeR::calcNormFactors(edgeR_data) |
|
152 |
+edgeR_data <- edgeR::DGEList(pbmcs_subset, lib.size = fit$size_factors) |
|
154 | 153 |
edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix) |
155 | 154 |
edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix) |
156 | 155 |
``` |
... | ... |
@@ -165,7 +164,7 @@ plot(fit$Beta[,1], edgeR_fit$unshrunk.coefficients[,1], pch = 16, |
165 | 164 |
main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "edgeR") |
166 | 165 |
abline(0,1) |
167 | 166 |
|
168 |
-plot(fit$Mu[,1], rowData(dds)$baseMean, pch = 16, log="xy", |
|
167 |
+plot(fit$Mu[,1], assay(dds, "mu")[,1], pch = 16, log="xy", |
|
169 | 168 |
main = "Gene Mean", xlab = "glmGamPoi", ylab = "DESeq2") |
170 | 169 |
abline(0,1) |
171 | 170 |
plot(fit$Mu[,1], edgeR_fit$fitted.values[,1], pch = 16, log="xy", |
* remove comment on size_factor that is outdated.
* add explanation of desgin = ~ 1
... | ... |
@@ -62,7 +62,7 @@ To fit a single Gamma-Poisson GLM do: |
62 | 62 |
```{r} |
63 | 63 |
# overdispersion = 1/size |
64 | 64 |
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
65 |
-# size_factors = FALSE, because only a single GLM is fitted |
|
65 |
+# design = ~ 1 means that an intercept-only model is fit |
|
66 | 66 |
fit <- glm_gp(counts, design = ~ 1) |
67 | 67 |
fit |
68 | 68 |
|
... | ... |
@@ -226,6 +226,24 @@ abline(0,1) |
226 | 226 |
``` |
227 | 227 |
|
228 | 228 |
|
229 |
+#### Pseudobulk |
|
230 |
+ |
|
231 |
+Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment through single cell experiment with 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. |
|
232 |
+ |
|
233 |
+Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()` function of glmGamPoi supports this feature directly through the `pseudobulk_by` and `subset_to` parameters: |
|
234 |
+ |
|
235 |
+```{r} |
|
236 |
+# say we have cell type labels for each cell and know from which sample they come originally |
|
237 |
+sample_labels <- rep(paste0("sample_", 1:6), length = ncol(pbmcs_subset)) |
|
238 |
+cell_type_labels <- sample(c("T-cells", "B-cells", "Macrophages"), ncol(pbmcs_subset), replace = TRUE) |
|
239 |
+ |
|
240 |
+test_de(fit, contrast = Group1 - Group2, |
|
241 |
+ pseudobulk_by = sample_labels, |
|
242 |
+ subset_to = cell_type_labels == "T-cells", |
|
243 |
+ n_max = 4, sort_by = pval, decreasing = FALSE) |
|
244 |
+``` |
|
245 |
+ |
|
246 |
+ |
|
229 | 247 |
# Session Info |
230 | 248 |
|
231 | 249 |
```{r} |
... | ... |
@@ -50,13 +50,20 @@ For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi |
50 | 50 |
|
51 | 51 |
# Example |
52 | 52 |
|
53 |
+Load the glmGamPoi package |
|
54 |
+ |
|
55 |
+```{r} |
|
56 |
+library(glmGamPoi) |
|
57 |
+``` |
|
58 |
+ |
|
59 |
+ |
|
53 | 60 |
To fit a single Gamma-Poisson GLM do: |
54 | 61 |
|
55 | 62 |
```{r} |
56 | 63 |
# overdispersion = 1/size |
57 | 64 |
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
58 | 65 |
# size_factors = FALSE, because only a single GLM is fitted |
59 |
-fit <- glmGamPoi::glm_gp(counts, design = ~ 1) |
|
66 |
+fit <- glm_gp(counts, design = ~ 1) |
|
60 | 67 |
fit |
61 | 68 |
|
62 | 69 |
# Internally fit is just a list: |
... | ... |
@@ -86,7 +93,7 @@ pbmcs_subset |
86 | 93 |
I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory. |
87 | 94 |
|
88 | 95 |
```{r} |
89 |
-fit <- glmGamPoi::glm_gp(pbmcs_subset, on_disk = FALSE) |
|
96 |
+fit <- glm_gp(pbmcs_subset, on_disk = FALSE) |
|
90 | 97 |
summary(fit) |
91 | 98 |
``` |
92 | 99 |
|
... | ... |
@@ -106,9 +113,9 @@ model_matrix <- matrix(1, nrow = ncol(pbmcs_subset)) |
106 | 113 |
|
107 | 114 |
bench::mark( |
108 | 115 |
glmGamPoi_in_memory = { |
109 |
- glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
|
116 |
+ glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
|
110 | 117 |
}, glmGamPoi_on_disk = { |
111 |
- glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE) |
|
118 |
+ glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE) |
|
112 | 119 |
}, DESeq2 = suppressMessages({ |
113 | 120 |
dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, |
114 | 121 |
colData = data.frame(name = seq_len(4340)), |
... | ... |
@@ -131,7 +138,7 @@ To demonstrate that the method does not sacrifice accuracy, I compare the parame |
131 | 138 |
|
132 | 139 |
```{r message=FALSE, warning=FALSE} |
133 | 140 |
# Results with my method |
134 |
-fit <- glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
|
141 |
+fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
|
135 | 142 |
|
136 | 143 |
# DESeq2 |
137 | 144 |
dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, |
... | ... |
@@ -191,9 +198,9 @@ The method scales linearly, with the number of rows and columns in the dataset. |
191 | 198 |
group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE) |
192 | 199 |
|
193 | 200 |
# Fit model with group vector as design |
194 |
-fit <- glmGamPoi::glm_gp(pbmcs_subset, design = group) |
|
201 |
+fit <- glm_gp(pbmcs_subset, design = group) |
|
195 | 202 |
# Compare against model without group |
196 |
-res <- glmGamPoi::gampoi_test_qlr(pbmcs_subset, fit, reduced_design = ~ 1) |
|
203 |
+res <- test_de(fit, reduced_design = ~ 1) |
|
197 | 204 |
# Look at first 6 genes |
198 | 205 |
head(res) |
199 | 206 |
``` |
... | ... |
@@ -56,11 +56,11 @@ To fit a single Gamma-Poisson GLM do: |
56 | 56 |
# overdispersion = 1/size |
57 | 57 |
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
58 | 58 |
# size_factors = FALSE, because only a single GLM is fitted |
59 |
-fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE, overdispersion_shrinkage = FALSE) |
|
59 |
+fit <- glmGamPoi::glm_gp(counts, design = ~ 1) |
|
60 | 60 |
fit |
61 | 61 |
|
62 | 62 |
# Internally fit is just a list: |
63 |
-as.list(fit) |
|
63 |
+as.list(fit)[1:2] |
|
64 | 64 |
``` |
65 | 65 |
|
66 | 66 |
The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion. |
... | ... |
@@ -127,7 +127,7 @@ bench::mark( |
127 | 127 |
|
128 | 128 |
On this dataset, `glmGamPoi` is more than 6 times faster than `edgeR` and more than 20 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations. |
129 | 129 |
|
130 |
-To demonstrate that the method is not sacrificing accuracy, I compare the parameters that each method estimates. I find that means and β coefficients are identical, but that the estimates of the overdispersion estimates from `glmGamPoi` are more reliable: |
|
130 |
+To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable: |
|
131 | 131 |
|
132 | 132 |
```{r message=FALSE, warning=FALSE} |
133 | 133 |
# Results with my method |
... | ... |
@@ -182,6 +182,43 @@ I am comparing the gene-wise estimates of the coefficients from all three method |
182 | 182 |
The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk. Fitting that dataset in-memory is not possible because it is just too big: the maximum in-memory matrix size is `2^31-1 ≈ 2.1e9` is elements, the `pbmc68k` dataset however has nearly 100 million elements more than that. |
183 | 183 |
|
184 | 184 |
|
185 |
+## Differential expression analysis |
|
186 |
+ |
|
187 |
+`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes: |
|
188 |
+ |
|
189 |
+```{r} |
|
190 |
+# Create random categorical assignment to demonstrate DE |
|
191 |
+group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE) |
|
192 |
+ |
|
193 |
+# Fit model with group vector as design |
|
194 |
+fit <- glmGamPoi::glm_gp(pbmcs_subset, design = group) |
|
195 |
+# Compare against model without group |
|
196 |
+res <- glmGamPoi::gampoi_test_qlr(pbmcs_subset, fit, reduced_design = ~ 1) |
|
197 |
+# Look at first 6 genes |
|
198 |
+head(res) |
|
199 |
+``` |
|
200 |
+ |
|
201 |
+The p-values agree well with the ones that `edgeR` is calculating. This is because `glmGamPoi` uses the same framework of quasi-likelihood ratio tests that was invented by `edgeR` and is described in [Lund et al. (2012)](https://doi.org/10.1515/1544-6115.1826). |
|
202 |
+ |
|
203 |
+```{r fig.height=3, fig.width=3, warning=FALSE, message=FALSE} |
|
204 |
+model_matrix <- model.matrix(~ group, data = data.frame(group = group)) |
|
205 |
+edgeR_data <- edgeR::DGEList(pbmcs_subset) |
|
206 |
+edgeR_data <- edgeR::calcNormFactors(edgeR_data) |
|
207 |
+edgeR_data <- edgeR::estimateDisp(edgeR_data, design = model_matrix) |
|
208 |
+edgeR_fit <- edgeR::glmQLFit(edgeR_data, design = model_matrix) |
|
209 |
+edgeR_test <- edgeR::glmQLFTest(edgeR_fit, coef = 2) |
|
210 |
+edgeR_res <- edgeR::topTags(edgeR_test, sort.by = "none", n = nrow(pbmcs_subset)) |
|
211 |
+``` |
|
212 |
+ |
|
213 |
+ |
|
214 |
+```{r fig.height=4, fig.width=3, warning=FALSE, message=FALSE, echo = FALSE} |
|
215 |
+par(cex.main = 2, cex.lab = 1.5) |
|
216 |
+plot(res$pval, edgeR_res$table$PValue, pch = 16, log = "xy", |
|
217 |
+ main = "p-values", xlab = "glmGamPoi", ylab = "edgeR") |
|
218 |
+abline(0,1) |
|
219 |
+``` |
|
220 |
+ |
|
221 |
+ |
|
185 | 222 |
# Session Info |
186 | 223 |
|
187 | 224 |
```{r} |
... | ... |
@@ -56,11 +56,11 @@ To fit a single Gamma-Poisson GLM do: |
56 | 56 |
# overdispersion = 1/size |
57 | 57 |
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
58 | 58 |
# size_factors = FALSE, because only a single GLM is fitted |
59 |
-fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE) |
|
59 |
+fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE, overdispersion_shrinkage = FALSE) |
|
60 | 60 |
fit |
61 | 61 |
|
62 | 62 |
# Internally fit is just a list: |
63 |
-c(fit) |
|
63 |
+as.list(fit) |
|
64 | 64 |
``` |
65 | 65 |
|
66 | 66 |
The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion. |
... | ... |
@@ -96,8 +96,10 @@ summary(fit) |
96 | 96 |
|
97 | 97 |
I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("edgeR")`. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance. |
98 | 98 |
|
99 |
+ |
|
100 |
+ |
|
99 | 101 |
```{r, warning=FALSE} |
100 |
-# Explicitly realize count matrix in memory |
|
102 |
+# Explicitly realize count matrix in memory so that it is a fair comparison |
|
101 | 103 |
pbmcs_subset <- as.matrix(assay(pbmcs_subset)) |
102 | 104 |
model_matrix <- matrix(1, nrow = ncol(pbmcs_subset)) |
103 | 105 |
|
... | ... |
@@ -119,7 +121,7 @@ bench::mark( |
119 | 121 |
edgeR_data <- edgeR::calcNormFactors(edgeR_data) |
120 | 122 |
edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix) |
121 | 123 |
edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix) |
122 |
- }, check = FALSE |
|
124 |
+ }, check = FALSE, min_iterations = 3 |
|
123 | 125 |
) |
124 | 126 |
``` |
125 | 127 |
|
... | ... |
@@ -16,7 +16,6 @@ set.seed(2) |
16 | 16 |
# glmGamPoi |
17 | 17 |
|
18 | 18 |
<!-- badges: start --> |
19 |
-[](https://github.com/const-ae/glmGamPoi/actions) |
|
20 | 19 |
[](https://codecov.io/gh/const-ae/glmGamPoi) |
21 | 20 |
<!-- badges: end --> |
22 | 21 |
|
... | ... |
@@ -37,7 +37,7 @@ The core design aims of `gmlGamPoi` are: |
37 | 37 |
|
38 | 38 |
# Installation |
39 | 39 |
|
40 |
-You can install the release version of `r BiocStyle::Biocpkg("glmGamPoi")` from BioConductor: |
|
40 |
+You can install the release version of *[glmGamPoi](https://bioconductor.org/packages/glmGamPoi)* from BioConductor: |
|
41 | 41 |
|
42 | 42 |
``` r |
43 | 43 |
if (!requireNamespace("BiocManager", quietly = TRUE)) |
... | ... |
@@ -11,42 +11,180 @@ knitr::opts_chunk$set( |
11 | 11 |
fig.path = "man/figures/README-" |
12 | 12 |
# out.width = "100%" |
13 | 13 |
) |
14 |
-set.seed(1) |
|
14 |
+set.seed(2) |
|
15 | 15 |
``` |
16 | 16 |
# glmGamPoi |
17 | 17 |
|
18 | 18 |
<!-- badges: start --> |
19 |
-[](https://github.com/const-ae/glmGamPoi) |
|
19 |
+[](https://github.com/const-ae/glmGamPoi/actions) |
|
20 | 20 |
[](https://codecov.io/gh/const-ae/glmGamPoi) |
21 | 21 |
<!-- badges: end --> |
22 | 22 |
|
23 |
-```{r child = "man/fragments/paragraph-intro.Rmd"} |
|
24 |
-``` |
|
25 | 23 |
|
24 |
+> Fit Gamma-Poisson Generalized Linear Models Reliably. |
|
25 |
+ |
|
26 |
+The core design aims of `gmlGamPoi` are: |
|
27 |
+ |
|
28 |
+* Fit the Gamma-Poisson models on arbitrarily large or small datasets |
|
29 |
+* Be faster than alternative methods, such as `DESeq2` or `edgeR` |
|
30 |
+* Calculate exact or approximate results based on user preference |
|
31 |
+* Support in memory or on-disk data |
|
32 |
+* Follow established conventions around tools for RNA-seq analysis |
|
33 |
+* Present a simple user-interface |
|
34 |
+* Avoid unnecessary dependencies |
|
35 |
+* Make integration into other tools easy |
|
36 |
+ |
|
37 |
+ |
|
38 |
+# Installation |
|
26 | 39 |
|
40 |
+You can install the release version of `r BiocStyle::Biocpkg("glmGamPoi")` from BioConductor: |
|
27 | 41 |
|
28 |
-## Installation |
|
42 |
+``` r |
|
43 |
+if (!requireNamespace("BiocManager", quietly = TRUE)) |
|
44 |
+ install.packages("BiocManager") |
|
29 | 45 |
|
30 |
-```{r child = "man/fragments/paragraph-installation.Rmd"} |
|
46 |
+BiocManager::install("glmGamPoi") |
|
31 | 47 |
``` |
32 | 48 |
|
49 |
+For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")` repo. |
|
33 | 50 |
|
34 |
-## Example |
|
35 | 51 |
|
36 |
-```{r child = "man/fragments/paragraph-simple_example.Rmd"} |
|
52 |
+# Example |
|
53 |
+ |
|
54 |
+To fit a single Gamma-Poisson GLM do: |
|
55 |
+ |
|
56 |
+```{r} |
|
57 |
+# overdispersion = 1/size |
|
58 |
+counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) |
|
59 |
+# size_factors = FALSE, because only a single GLM is fitted |
|
60 |
+fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE) |
|
61 |
+fit |
|
62 |
+ |
|
63 |
+# Internally fit is just a list: |
|
64 |
+c(fit) |
|
37 | 65 |
``` |
38 | 66 |
|
67 |
+The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion. |
|
39 | 68 |
|
40 | 69 |
Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy: |
41 | 70 |
|
42 |
-```{r child = "man/fragments/paragraph-pbmc4k_example.Rmd"} |
|
71 |
+I will first load an example dataset using the `TENxPBMCData` package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells: |
|
72 |
+```{r, warning=FALSE, message = FALSE} |
|
73 |
+library(SummarizedExperiment) |
|
74 |
+library(DelayedMatrixStats) |
|
75 |
+``` |
|
76 |
+ |
|
77 |
+```{r} |
|
78 |
+# The full dataset with 33,000 genes and 4340 cells |
|
79 |
+# The first time this is run, it will download the data |
|
80 |
+pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k") |
|
81 |
+# I want genes where at least some counts are non-zero |
|
82 |
+non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0) |
|
83 |
+pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ] |
|
84 |
+pbmcs_subset |
|
43 | 85 |
``` |
44 | 86 |
|
87 |
+I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory. |
|
88 |
+ |
|
89 |
+```{r} |
|
90 |
+fit <- glmGamPoi::glm_gp(pbmcs_subset, on_disk = FALSE) |
|
91 |
+summary(fit) |
|
92 |
+``` |
|
93 |
+ |
|
94 |
+ |
|
95 |
+ |
|
96 |
+# Benchmark |
|
97 |
+ |
|
98 |
+I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("edgeR")`. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance. |
|
99 |
+ |
|
100 |
+```{r, warning=FALSE} |
|
101 |
+# Explicitly realize count matrix in memory |
|
102 |
+pbmcs_subset <- as.matrix(assay(pbmcs_subset)) |
|
103 |
+model_matrix <- matrix(1, nrow = ncol(pbmcs_subset)) |
|
104 |
+ |
|
105 |
+ |
|
106 |
+bench::mark( |
|
107 |
+ glmGamPoi_in_memory = { |
|
108 |
+ glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) |
|