Browse code

Change 'fact' to 'cond' for contrast formation

const-ae authored on 15/02/2023 17:28:19
Showing 1 changed files
... ...
@@ -250,7 +250,7 @@ In our example, we want to find the genes that change specifically in T cells. F
250 250
 ```{r do_pseudobulk}
251 251
 # The contrast argument specifies what we want to compare
252 252
 # We test the expression difference of stimulated and control T-cells
253
-de_res <- test_de(fit, contrast = fact(cell = "CD4 T cells", stim = "ctrl") - fact(cell = "CD4 T cells", stim = "stim")) 
253
+de_res <- test_de(fit, contrast = cond(cell = "CD4 T cells", stim = "ctrl") - cond(cell = "CD4 T cells", stim = "stim")) 
254 254
 
255 255
 # Most different genes
256 256
 head(de_res[order(de_res$pval), ])
Browse code

Use new 'pseudobulk' workflow in README

const-ae authored on 03/01/2023 18:07:42
Showing 1 changed files
... ...
@@ -222,10 +222,17 @@ counts(sce_subset) <- as.matrix(counts(sce_subset))
222 222
 sce_subset$cell <- droplevels(sce_subset$cell)
223 223
 ```
224 224
 
225
+In the first step we will aggregate the counts of each patient, condition and cell type and form pseudobulk samples. This ensures that I get reliable p-value by treating each patient as a replicate and not each cell.
226
+
227
+```{r}
228
+sce_reduced <- pseudobulk(sce_subset, group_by = vars(ind, stim, cell))
229
+```
230
+
231
+
225 232
 We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment:
226 233
 
227 234
 ```{r kang_fit}
228
-fit <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
235
+fit <- glm_gp(sce_reduced, design = ~ cell + stim +  stim:cell - 1,
229 236
               reference_level = "NK cells")
230 237
 summary(fit)
231 238
 ```
... ...
@@ -238,26 +245,18 @@ colnames(fit$Beta)
238 245
 ```
239 246
 
240 247
 
241
-In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. To get a proper estimate of the uncertainty (cells from the same donor are **not** independent replicates), we create a pseudobulk for each sample:
248
+In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. 
242 249
 
243 250
 ```{r do_pseudobulk}
244 251
 # The contrast argument specifies what we want to compare
245 252
 # We test the expression difference of stimulated and control T-cells
246
-#
247
-# There is no sample label in the colData, so we create it on the fly
248
-# from `stim` and `ind` columns in colData(fit$data).
249
-de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
250
-                  pseudobulk_by = paste0(stim, "-", ind)) 
251
-
252
-# The large `lfc` values come from groups were nearly all counts are 0
253
-# Setting them to Inf makes the plots look nicer
254
-de_res$lfc <- ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc)
253
+de_res <- test_de(fit, contrast = fact(cell = "CD4 T cells", stim = "ctrl") - fact(cell = "CD4 T cells", stim = "stim")) 
255 254
 
256 255
 # Most different genes
257 256
 head(de_res[order(de_res$pval), ])
258 257
 ```
259 258
 
260
-The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells:  _IFI6_, _IFIT3_, and _IRF7_ literally stand for _Interferon Induced/Regulated Protein_.
259
+The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells:  _IFI6_, _IFIT3_ and _ISG15_ literally stand for _Interferon Induced/Stimulated Protein_.
261 260
 
262 261
 To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values.
263 262
 
... ...
@@ -271,17 +270,19 @@ ggplot(de_res, aes(x = lfc, y = -log10(pval))) +
271 270
 
272 271
 Another important task in single cell data analysis is the identification of marker genes for cell clusters. For this we can also use our Gamma-Poisson fit. 
273 272
 
274
-Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition:
273
+Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition (this time not accounting for the pseudo-replication structure):
275 274
 
276 275
 ```{r find_marker_genes}
277
-marker_genes <- test_de(fit, `cellCD4 T cells` - `cellB cells`, sort_by = pval)
276
+fit_full <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
277
+                   reference_level = "NK cells")
278
+marker_genes <- test_de(fit_full, `cellCD4 T cells` - `cellB cells`, sort_by = pval)
278 279
 head(marker_genes)
279 280
 ```
280 281
 
281 282
 If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast:
282 283
 
283 284
 ```{r do_signif_genes}
284
-marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
285
+marker_genes2 <- test_de(fit_full, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
285 286
                                (`cellB cells` + `cellB cells:stimstim`), 
286 287
                         sort_by = pval)
287 288
 
Browse code

Update README

const-ae authored on 19/03/2021 15:06:47
Showing 1 changed files
... ...
@@ -4,7 +4,7 @@ output: github_document
4 4
 
5 5
 <!-- README.md is generated from README.Rmd. Please edit that file -->
6 6
 
7
-```{r, include = FALSE}
7
+```{r init_chunk, include = FALSE}
8 8
 knitr::opts_chunk$set(
9 9
   collapse = TRUE,
10 10
   comment = "#>",
... ...
@@ -59,14 +59,14 @@ If you use this package in a scientific publication, please cite:
59 59
 
60 60
 Load the glmGamPoi package
61 61
 
62
-```{r}
62
+```{r load_glmGamPoi}
63 63
 library(glmGamPoi)
64 64
 ```
65 65
 
66 66
 
67 67
 To fit a single Gamma-Poisson GLM do:
68 68
 
69
-```{r}
69
+```{r single_gp_fit}
70 70
 # overdispersion = 1/size
71 71
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
72 72
 
... ...
@@ -83,12 +83,12 @@ The `glm_gp()` function returns a list with the results of the fit. Most importa
83 83
 Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy:
84 84
 
85 85
 I will first load an example dataset using the `TENxPBMCData` package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells:
86
-```{r, warning=FALSE, message = FALSE}
86
+```{r load_additional_packages, warning=FALSE, message = FALSE}
87 87
 library(SummarizedExperiment)
88 88
 library(DelayedMatrixStats)
89 89
 ```
90 90
 
91
-```{r}
91
+```{r load_pbmc_data}
92 92
 # The full dataset with 33,000 genes and 4340 cells
93 93
 # The first time this is run, it will download the data
94 94
 pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k")
... ...
@@ -101,7 +101,7 @@ pbmcs_subset
101 101
 
102 102
 I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory.
103 103
 
104
-```{r}
104
+```{r simple_fit}
105 105
 fit <- glm_gp(pbmcs_subset, on_disk = FALSE)
106 106
 summary(fit)
107 107
 ```
... ...
@@ -114,7 +114,7 @@ I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")
114 114
 
115 115
 
116 116
 
117
-```{r, warning=FALSE}
117
+```{r run_bench_mark, warning=FALSE}
118 118
 # Explicitly realize count matrix in memory so that it is a fair comparison
119 119
 pbmcs_subset <- as.matrix(assay(pbmcs_subset))
120 120
 model_matrix <- matrix(1, nrow = ncol(pbmcs_subset))
... ...
@@ -145,7 +145,7 @@ On this dataset, `glmGamPoi` is more than 5 times faster than `edgeR` and more t
145 145
 
146 146
 To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable:
147 147
 
148
-```{r message=FALSE, warning=FALSE}
148
+```{r compare_with_deseq2_and_edger, message=FALSE, warning=FALSE}
149 149
 # Results with my method
150 150
 fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
151 151
 
... ...
@@ -201,14 +201,14 @@ The method scales linearly, with the number of rows and columns in the dataset.
201 201
 
202 202
 `glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes. To demonstrate this feature, we will use the data from [Kang _et al._ (2018)](https://www.ncbi.nlm.nih.gov/pubmed/29227470) provided by the `MuscData` package. This is a single cell dataset of 8 Lupus patients for which 10x droplet-based scRNA-seq was performed before and after treatment with interferon beta. The `SingleCellExperiment` object conveniently provides the patient id (`ind`), treatment status (`stim`) and cell type (`cell`):
203 203
 
204
-```{r}
204
+```{r load_kang_data}
205 205
 sce <- muscData::Kang18_8vs8()
206 206
 colData(sce)
207 207
 ```
208 208
 
209 209
 For demonstration purpose, I will work on a subset of the genes and cells:
210 210
 
211
-```{r}
211
+```{r subset_kang_data}
212 212
 set.seed(1)
213 213
 # Take highly expressed genes and proper cells:
214 214
 sce_subset <- sce[rowSums(counts(sce)) > 100, 
... ...
@@ -224,7 +224,7 @@ sce_subset$cell <- droplevels(sce_subset$cell)
224 224
 
225 225
 We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment:
226 226
 
227
-```{r}
227
+```{r kang_fit}
228 228
 fit <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
229 229
               reference_level = "NK cells")
230 230
 summary(fit)
... ...
@@ -233,14 +233,14 @@ summary(fit)
233 233
 
234 234
 To see how the coefficient of our model are called, we look at the `colnames(fit$Beta)`:
235 235
 
236
-```{r}
236
+```{r show_beta_colnames}
237 237
 colnames(fit$Beta)
238 238
 ```
239 239
 
240 240
 
241 241
 In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. To get a proper estimate of the uncertainty (cells from the same donor are **not** independent replicates), we create a pseudobulk for each sample:
242 242
 
243
-```{r}
243
+```{r do_pseudobulk}
244 244
 # The contrast argument specifies what we want to compare
245 245
 # We test the expression difference of stimulated and control T-cells
246 246
 #
... ...
@@ -261,7 +261,7 @@ The test is successful and we identify interesting genes that are differentially
261 261
 
262 262
 To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values.
263 263
 
264
-```{r}
264
+```{r make_volcano_plot}
265 265
 library(ggplot2)
266 266
 ggplot(de_res, aes(x = lfc, y = -log10(pval))) +
267 267
   geom_point(size = 0.6, aes(color = adj_pval < 0.1)) +
... ...
@@ -273,14 +273,14 @@ Another important task in single cell data analysis is the identification of mar
273 273
 
274 274
 Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition:
275 275
 
276
-```{r}
276
+```{r find_marker_genes}
277 277
 marker_genes <- test_de(fit, `cellCD4 T cells` - `cellB cells`, sort_by = pval)
278 278
 head(marker_genes)
279 279
 ```
280 280
 
281 281
 If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast:
282 282
 
283
-```{r}
283
+```{r do_signif_genes}
284 284
 marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
285 285
                                (`cellB cells` + `cellB cells:stimstim`), 
286 286
                         sort_by = pval)
... ...
@@ -292,7 +292,7 @@ We identify many genes related to the human leukocyte antigen (HLA) system that
292 292
 
293 293
 A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results.
294 294
 
295
-```{r}
295
+```{r plot_marker_genes}
296 296
 # Create a data.frame with the expression values, gene names, and cell types
297 297
 tmp <- data.frame(gene = rep(marker_genes$name[1:6], times = ncol(sce_subset)),
298 298
                   expression = c(counts(sce_subset)[marker_genes$name[1:6], ]),
... ...
@@ -316,7 +316,7 @@ This work was supported by the EMBL International PhD Programme and the European
316 316
 
317 317
 # Session Info
318 318
 
319
-```{r}
319
+```{r print_sessionInfo}
320 320
 sessionInfo()
321 321
 ```
322 322
 
Browse code

Add Acknowlegments section to README

const-ae authored on 16/02/2021 14:39:33
Showing 1 changed files
... ...
@@ -307,7 +307,9 @@ ggplot(tmp, aes(x = celltype, y = expression)) +
307 307
 
308 308
 
309 309
 
310
+# Acknowlegments
310 311
 
312
+This work was supported by the EMBL International PhD Programme and the European Research Council Synergy grant DECODE under grant agreement No. 810296.
311 313
 
312 314
 
313 315
 
Browse code

Update Citation in README

const-ae authored on 28/12/2020 17:36:24
Showing 1 changed files
... ...
@@ -53,7 +53,7 @@ If you use this package in a scientific publication, please cite:
53 53
 
54 54
 > glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data  
55 55
 > Constantin Ahlmann-Eltze, Wolfgang Huber  
56
-> bioRxiv 2020.08.13.249623; doi: https://doi.org/10.1101/2020.08.13.249623
56
+> Bioinformatics; 2020-12-09; doi: https://doi.org/10.1093/bioinformatics/btaa1009
57 57
 
58 58
 # Example
59 59
 
Browse code

Update README / vignette

const-ae authored on 17/11/2020 11:43:58
Showing 1 changed files
... ...
@@ -245,13 +245,13 @@ In our example, we want to find the genes that change specifically in T cells. F
245 245
 # We test the expression difference of stimulated and control T-cells
246 246
 #
247 247
 # There is no sample label in the colData, so we create it on the fly
248
-# (paste0(stim, "-", ind)) to aggregate the counts by sample
248
+# from `stim` and `ind` columns in colData(fit$data).
249 249
 de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
250 250
                   pseudobulk_by = paste0(stim, "-", ind)) 
251 251
 
252 252
 # The large `lfc` values come from groups were nearly all counts are 0
253 253
 # Setting them to Inf makes the plots look nicer
254
-de_res$lfc = ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc)
254
+de_res$lfc <- ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc)
255 255
 
256 256
 # Most different genes
257 257
 head(de_res[order(de_res$pval), ])
... ...
@@ -265,7 +265,7 @@ To get a more complete overview of the results, we can make a volcano plot that
265 265
 library(ggplot2)
266 266
 ggplot(de_res, aes(x = lfc, y = -log10(pval))) +
267 267
   geom_point(size = 0.6, aes(color = adj_pval < 0.1)) +
268
-  ggtitle("Volcano Plot", "Genes that change most through\ninterferon-beta treatment in T cells")
268
+  ggtitle("Volcano Plot", "Genes that change most through interferon-beta treatment in T cells")
269 269
 ```
270 270
 
271 271
 
... ...
@@ -288,7 +288,7 @@ marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) -
288 288
 head(marker_genes2)
289 289
 ```
290 290
 
291
-Unsurprisingly, we identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences.
291
+We identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences.
292 292
 
293 293
 A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results.
294 294
 
Browse code

Update README / vignette

Improve section on differential expression analysis. Use the Kang 2018
dataset provided by muscData

const-ae authored on 16/11/2020 17:38:34
Showing 1 changed files
... ...
@@ -199,58 +199,118 @@ The method scales linearly, with the number of rows and columns in the dataset.
199 199
 
200 200
 ## Differential expression analysis
201 201
 
202
-`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes:
202
+`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes. To demonstrate this feature, we will use the data from [Kang _et al._ (2018)](https://www.ncbi.nlm.nih.gov/pubmed/29227470) provided by the `MuscData` package. This is a single cell dataset of 8 Lupus patients for which 10x droplet-based scRNA-seq was performed before and after treatment with interferon beta. The `SingleCellExperiment` object conveniently provides the patient id (`ind`), treatment status (`stim`) and cell type (`cell`):
203 203
 
204 204
 ```{r}
205
-# Create random categorical assignment to demonstrate DE
206
-group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE)
207
-
208
-# Fit model with group vector as design
209
-fit <- glm_gp(pbmcs_subset, design = group)
210
-# Compare against model without group 
211
-res <- test_de(fit, reduced_design = ~ 1)
212
-# Look at first 6 genes
213
-head(res)
205
+sce <- muscData::Kang18_8vs8()
206
+colData(sce)
214 207
 ```
215 208
 
216
-The p-values agree well with the ones that `edgeR` is calculating. This is because `glmGamPoi` uses the same framework of quasi-likelihood ratio tests that was invented by `edgeR` and is described in [Lund et al. (2012)](https://doi.org/10.1515/1544-6115.1826).
209
+For demonstration purpose, I will work on a subset of the genes and cells:
217 210
 
218
-```{r fig.height=3, fig.width=3, warning=FALSE, message=FALSE}
219
-model_matrix <- model.matrix(~ group, data = data.frame(group = group))
220
-edgeR_data <- edgeR::DGEList(pbmcs_subset)
221
-edgeR_data <- edgeR::calcNormFactors(edgeR_data)
222
-edgeR_data <- edgeR::estimateDisp(edgeR_data, design = model_matrix)
223
-edgeR_fit <- edgeR::glmQLFit(edgeR_data, design = model_matrix)
224
-edgeR_test <- edgeR::glmQLFTest(edgeR_fit, coef = 2)
225
-edgeR_res <- edgeR::topTags(edgeR_test, sort.by = "none", n = nrow(pbmcs_subset))
211
+```{r}
212
+set.seed(1)
213
+# Take highly expressed genes and proper cells:
214
+sce_subset <- sce[rowSums(counts(sce)) > 100, 
215
+                  sample(which(sce$multiplets == "singlet" & 
216
+                              ! is.na(sce$cell) &
217
+                              sce$cell %in% c("CD4 T cells", "B cells", "NK cells")), 
218
+                         1000)]
219
+# Convert counts to dense matrix
220
+counts(sce_subset) <- as.matrix(counts(sce_subset))
221
+# Remove empty levels because glm_gp() will complain otherwise
222
+sce_subset$cell <- droplevels(sce_subset$cell)
226 223
 ```
227 224
 
225
+We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment:
228 226
 
229
-```{r fig.height=4, fig.width=3, warning=FALSE, message=FALSE, echo = FALSE}
230
-par(cex.main = 2, cex.lab = 1.5)
231
-plot(res$pval, edgeR_res$table$PValue, pch = 16, log = "xy",
232
-     main = "p-values", xlab = "glmGamPoi", ylab = "edgeR")
233
-abline(0,1)
227
+```{r}
228
+fit <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
229
+              reference_level = "NK cells")
230
+summary(fit)
231
+```
232
+
233
+
234
+To see how the coefficient of our model are called, we look at the `colnames(fit$Beta)`:
235
+
236
+```{r}
237
+colnames(fit$Beta)
238
+```
239
+
240
+
241
+In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. To get a proper estimate of the uncertainty (cells from the same donor are **not** independent replicates), we create a pseudobulk for each sample:
242
+
243
+```{r}
244
+# The contrast argument specifies what we want to compare
245
+# We test the expression difference of stimulated and control T-cells
246
+#
247
+# There is no sample label in the colData, so we create it on the fly
248
+# (paste0(stim, "-", ind)) to aggregate the counts by sample
249
+de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
250
+                  pseudobulk_by = paste0(stim, "-", ind)) 
251
+
252
+# The large `lfc` values come from groups were nearly all counts are 0
253
+# Setting them to Inf makes the plots look nicer
254
+de_res$lfc = ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc)
255
+
256
+# Most different genes
257
+head(de_res[order(de_res$pval), ])
258
+```
259
+
260
+The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells:  _IFI6_, _IFIT3_, and _IRF7_ literally stand for _Interferon Induced/Regulated Protein_.
261
+
262
+To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values.
263
+
264
+```{r}
265
+library(ggplot2)
266
+ggplot(de_res, aes(x = lfc, y = -log10(pval))) +
267
+  geom_point(size = 0.6, aes(color = adj_pval < 0.1)) +
268
+  ggtitle("Volcano Plot", "Genes that change most through\ninterferon-beta treatment in T cells")
234 269
 ```
235 270
 
236 271
 
237
-#### Pseudobulk
272
+Another important task in single cell data analysis is the identification of marker genes for cell clusters. For this we can also use our Gamma-Poisson fit. 
273
+
274
+Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition:
238 275
 
239
-Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment by measuring 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. 
276
+```{r}
277
+marker_genes <- test_de(fit, `cellCD4 T cells` - `cellB cells`, sort_by = pval)
278
+head(marker_genes)
279
+```
240 280
 
241
-Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()` function of glmGamPoi supports this feature directly through the `pseudobulk_by` and `subset_to` parameters:
281
+If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast:
242 282
 
243 283
 ```{r}
244
-# say we have cell type labels for each cell and know from which sample they come originally
245
-sample_labels <- rep(paste0("sample_", 1:6), length = ncol(pbmcs_subset))
246
-cell_type_labels <- sample(c("T-cells", "B-cells", "Macrophages"), ncol(pbmcs_subset), replace = TRUE)
247
-
248
-test_de(fit, contrast = Group1 - Group2,
249
-        pseudobulk_by = sample_labels, 
250
-        subset_to = cell_type_labels == "T-cells",
251
-        n_max = 4, sort_by = pval, decreasing = FALSE)
284
+marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
285
+                               (`cellB cells` + `cellB cells:stimstim`), 
286
+                        sort_by = pval)
287
+
288
+head(marker_genes2)
252 289
 ```
253 290
 
291
+Unsurprisingly, we identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences.
292
+
293
+A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results.
294
+
295
+```{r}
296
+# Create a data.frame with the expression values, gene names, and cell types
297
+tmp <- data.frame(gene = rep(marker_genes$name[1:6], times = ncol(sce_subset)),
298
+                  expression = c(counts(sce_subset)[marker_genes$name[1:6], ]),
299
+                  celltype = rep(sce_subset$cell, each = 6))
300
+
301
+ggplot(tmp, aes(x = celltype, y = expression)) +
302
+  geom_jitter(height = 0.1) +
303
+  stat_summary(geom = "crossbar", fun = "mean", color = "red") +
304
+  facet_wrap(~ gene, scales = "free_y") +
305
+  ggtitle("Marker genes of B vs. T cells")
306
+```
307
+
308
+
309
+
310
+
311
+
312
+
313
+
254 314
 
255 315
 # Session Info
256 316
 
Browse code

Add reference to bioRxiv

const-ae authored on 17/08/2020 17:40:03
Showing 1 changed files
... ...
@@ -49,6 +49,11 @@ BiocManager::install("glmGamPoi")
49 49
 
50 50
 For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")` repo.
51 51
 
52
+If you use this package in a scientific publication, please cite:
53
+
54
+> glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data  
55
+> Constantin Ahlmann-Eltze, Wolfgang Huber  
56
+> bioRxiv 2020.08.13.249623; doi: https://doi.org/10.1101/2020.08.13.249623
52 57
 
53 58
 # Example
54 59
 
Browse code

Fix typos in README

const-ae authored on 11/08/2020 12:22:53
Showing 1 changed files
... ...
@@ -24,12 +24,12 @@ set.seed(2)
24 24
 
25 25
 Pronounciation: [`dʒi əl əm ɡam ˈpwɑ`](http://ipa-reader.xyz/?text=d%CA%92i%20%C9%99l%20%C9%99m%20%C9%A1am%20%CB%88pw%C9%91)
26 26
 
27
-The core design aims of `gmlGamPoi` are:
27
+The core design aims of `glmGamPoi` are:
28 28
 
29
-* Fit the Gamma-Poisson models on arbitrarily large or small datasets
29
+* Fit Gamma-Poisson models on arbitrarily large or small datasets
30 30
 * Be faster than alternative methods, such as `DESeq2` or `edgeR`
31 31
 * Calculate exact or approximate results based on user preference
32
-* Support in memory or on-disk data
32
+* Support in-memory or on-disk data
33 33
 * Follow established conventions around tools for RNA-seq analysis
34 34
 * Present a simple user-interface
35 35
 * Avoid unnecessary dependencies
... ...
@@ -64,6 +64,7 @@ To fit a single Gamma-Poisson GLM do:
64 64
 ```{r}
65 65
 # overdispersion = 1/size
66 66
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
67
+
67 68
 # design = ~ 1 means that an intercept-only model is fit
68 69
 fit <- glm_gp(counts, design = ~ 1)
69 70
 fit
... ...
@@ -86,6 +87,7 @@ library(DelayedMatrixStats)
86 87
 # The full dataset with 33,000 genes and 4340 cells
87 88
 # The first time this is run, it will download the data
88 89
 pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k")
90
+
89 91
 # I want genes where at least some counts are non-zero
90 92
 non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0)
91 93
 pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ]
... ...
@@ -187,7 +189,7 @@ I am comparing the gene-wise estimates of the coefficients from all three method
187 189
 
188 190
 ## Scalability
189 191
 
190
-The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk. Fitting that dataset in-memory is not possible because it is just too big: the maximum in-memory matrix size is `2^31-1 ≈ 2.1e9` is elements, the `pbmc68k` dataset however has nearly 100 million elements more than that.
192
+The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk.
191 193
 
192 194
 
193 195
 ## Differential expression analysis
... ...
@@ -229,7 +231,7 @@ abline(0,1)
229 231
 
230 232
 #### Pseudobulk
231 233
 
232
-Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment through single cell experiment with 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. 
234
+Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment by measuring 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. 
233 235
 
234 236
 Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()` function of glmGamPoi supports this feature directly through the `pseudobulk_by` and `subset_to` parameters:
235 237
 
Browse code

Add hexsticker

const-ae authored on 06/08/2020 08:30:40
Showing 1 changed files
... ...
@@ -13,15 +13,17 @@ knitr::opts_chunk$set(
13 13
 )
14 14
 set.seed(2)
15 15
 ```
16
-# glmGamPoi
16
+
17
+# glmGamPoi <a href='https://github.com/const-ae/glmGamPoi'><img src='man/figures/logo.svg' align="right" height="139" /></a>
17 18
 
18 19
 <!-- badges: start -->
19 20
 [![codecov](https://codecov.io/gh/const-ae/glmGamPoi/branch/master/graph/badge.svg)](https://codecov.io/gh/const-ae/glmGamPoi)
20 21
 <!-- badges: end -->
21 22
 
22
-
23 23
 > Fit Gamma-Poisson Generalized Linear Models Reliably.
24 24
 
25
+Pronounciation: [`dʒi əl əm ɡam ˈpwɑ`](http://ipa-reader.xyz/?text=d%CA%92i%20%C9%99l%20%C9%99m%20%C9%A1am%20%CB%88pw%C9%91)
26
+
25 27
 The core design aims of `gmlGamPoi` are:
26 28
 
27 29
 * Fit the Gamma-Poisson models on arbitrarily large or small datasets
Browse code

Update README and vignette

const-ae authored on 05/08/2020 11:37:54
Showing 1 changed files
... ...
@@ -132,7 +132,7 @@ bench::mark(
132 132
 )
133 133
 ```
134 134
 
135
-On this dataset, `glmGamPoi` is more than 6 times faster than `edgeR` and more than 20 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations.
135
+On this dataset, `glmGamPoi` is more than 5 times faster than `edgeR` and more than 18 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations.
136 136
 
137 137
 To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable:
138 138
 
... ...
@@ -144,13 +144,12 @@ fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
144 144
 dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, 
145 145
                         colData = data.frame(name = seq_len(4340)),
146 146
                         design = ~ 1)
147
-dds <- DESeq2::estimateSizeFactors(dds, "poscounts")
147
+sizeFactors(dds)  <- fit$size_factors
148 148
 dds <- DESeq2::estimateDispersions(dds, quiet = TRUE)
149 149
 dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6)
150 150
 
151 151
 #edgeR
152
-edgeR_data <- edgeR::DGEList(pbmcs_subset)
153
-edgeR_data <- edgeR::calcNormFactors(edgeR_data)
152
+edgeR_data <- edgeR::DGEList(pbmcs_subset, lib.size = fit$size_factors)
154 153
 edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix)
155 154
 edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)
156 155
 ```
... ...
@@ -165,7 +164,7 @@ plot(fit$Beta[,1], edgeR_fit$unshrunk.coefficients[,1], pch = 16,
165 164
      main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "edgeR")
166 165
 abline(0,1)
167 166
 
168
-plot(fit$Mu[,1], rowData(dds)$baseMean, pch = 16, log="xy",
167
+plot(fit$Mu[,1], assay(dds, "mu")[,1], pch = 16, log="xy",
169 168
      main = "Gene Mean", xlab = "glmGamPoi", ylab = "DESeq2")
170 169
 abline(0,1)
171 170
 plot(fit$Mu[,1], edgeR_fit$fitted.values[,1], pch = 16, log="xy",
Browse code

Update README and vignette (fixes #3)

* remove comment on size_factor that is outdated.
* add explanation of desgin = ~ 1

const-ae authored on 11/07/2020 12:47:28
Showing 1 changed files
... ...
@@ -62,7 +62,7 @@ To fit a single Gamma-Poisson GLM do:
62 62
 ```{r}
63 63
 # overdispersion = 1/size
64 64
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
65
-# size_factors = FALSE, because only a single GLM is fitted
65
+# design = ~ 1 means that an intercept-only model is fit
66 66
 fit <- glm_gp(counts, design = ~ 1)
67 67
 fit
68 68
 
Browse code

Update README with description of pseudobulk

const-ae authored on 18/06/2020 19:21:18
Showing 1 changed files
... ...
@@ -226,6 +226,24 @@ abline(0,1)
226 226
 ```
227 227
 
228 228
 
229
+#### Pseudobulk
230
+
231
+Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment through single cell experiment with 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition. 
232
+
233
+Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()` function of glmGamPoi supports this feature directly through the `pseudobulk_by` and `subset_to` parameters:
234
+
235
+```{r}
236
+# say we have cell type labels for each cell and know from which sample they come originally
237
+sample_labels <- rep(paste0("sample_", 1:6), length = ncol(pbmcs_subset))
238
+cell_type_labels <- sample(c("T-cells", "B-cells", "Macrophages"), ncol(pbmcs_subset), replace = TRUE)
239
+
240
+test_de(fit, contrast = Group1 - Group2,
241
+        pseudobulk_by = sample_labels, 
242
+        subset_to = cell_type_labels == "T-cells",
243
+        n_max = 4, sort_by = pval, decreasing = FALSE)
244
+```
245
+
246
+
229 247
 # Session Info
230 248
 
231 249
 ```{r}
Browse code

Update README

const-ae authored on 09/06/2020 09:51:12
Showing 1 changed files
... ...
@@ -50,13 +50,20 @@ For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi
50 50
 
51 51
 # Example
52 52
 
53
+Load the glmGamPoi package
54
+
55
+```{r}
56
+library(glmGamPoi)
57
+```
58
+
59
+
53 60
 To fit a single Gamma-Poisson GLM do:
54 61
 
55 62
 ```{r}
56 63
 # overdispersion = 1/size
57 64
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
58 65
 # size_factors = FALSE, because only a single GLM is fitted
59
-fit <- glmGamPoi::glm_gp(counts, design = ~ 1)
66
+fit <- glm_gp(counts, design = ~ 1)
60 67
 fit
61 68
 
62 69
 # Internally fit is just a list:
... ...
@@ -86,7 +93,7 @@ pbmcs_subset
86 93
 I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory.
87 94
 
88 95
 ```{r}
89
-fit <- glmGamPoi::glm_gp(pbmcs_subset, on_disk = FALSE)
96
+fit <- glm_gp(pbmcs_subset, on_disk = FALSE)
90 97
 summary(fit)
91 98
 ```
92 99
 
... ...
@@ -106,9 +113,9 @@ model_matrix <- matrix(1, nrow = ncol(pbmcs_subset))
106 113
 
107 114
 bench::mark(
108 115
   glmGamPoi_in_memory = {
109
-    glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
116
+    glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
110 117
   }, glmGamPoi_on_disk = {
111
-    glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE)
118
+    glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE)
112 119
   }, DESeq2 = suppressMessages({
113 120
     dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset,
114 121
                         colData = data.frame(name = seq_len(4340)),
... ...
@@ -131,7 +138,7 @@ To demonstrate that the method does not sacrifice accuracy, I compare the parame
131 138
 
132 139
 ```{r message=FALSE, warning=FALSE}
133 140
 # Results with my method
134
-fit <- glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
141
+fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
135 142
 
136 143
 # DESeq2
137 144
 dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, 
... ...
@@ -191,9 +198,9 @@ The method scales linearly, with the number of rows and columns in the dataset.
191 198
 group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE)
192 199
 
193 200
 # Fit model with group vector as design
194
-fit <- glmGamPoi::glm_gp(pbmcs_subset, design = group)
201
+fit <- glm_gp(pbmcs_subset, design = group)
195 202
 # Compare against model without group 
196
-res <- glmGamPoi::gampoi_test_qlr(pbmcs_subset, fit, reduced_design = ~ 1)
203
+res <- test_de(fit, reduced_design = ~ 1)
197 204
 # Look at first 6 genes
198 205
 head(res)
199 206
 ```
Browse code

Update README

const-ae authored on 03/06/2020 10:03:27
Showing 1 changed files
... ...
@@ -56,11 +56,11 @@ To fit a single Gamma-Poisson GLM do:
56 56
 # overdispersion = 1/size
57 57
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
58 58
 # size_factors = FALSE, because only a single GLM is fitted
59
-fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE, overdispersion_shrinkage = FALSE)
59
+fit <- glmGamPoi::glm_gp(counts, design = ~ 1)
60 60
 fit
61 61
 
62 62
 # Internally fit is just a list:
63
-as.list(fit)
63
+as.list(fit)[1:2]
64 64
 ```
65 65
 
66 66
 The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion.
... ...
@@ -127,7 +127,7 @@ bench::mark(
127 127
 
128 128
 On this dataset, `glmGamPoi` is more than 6 times faster than `edgeR` and more than 20 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations.
129 129
 
130
-To demonstrate that the method is not sacrificing accuracy, I compare the parameters that each method estimates. I find that means and β coefficients are identical, but that the estimates of the overdispersion estimates from `glmGamPoi` are more reliable:
130
+To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable:
131 131
 
132 132
 ```{r message=FALSE, warning=FALSE}
133 133
 # Results with my method
... ...
@@ -182,6 +182,43 @@ I am comparing the gene-wise estimates of the coefficients from all three method
182 182
 The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk. Fitting that dataset in-memory is not possible because it is just too big: the maximum in-memory matrix size is `2^31-1 ≈ 2.1e9` is elements, the `pbmc68k` dataset however has nearly 100 million elements more than that.
183 183
 
184 184
 
185
+## Differential expression analysis
186
+
187
+`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes:
188
+
189
+```{r}
190
+# Create random categorical assignment to demonstrate DE
191
+group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE)
192
+
193
+# Fit model with group vector as design
194
+fit <- glmGamPoi::glm_gp(pbmcs_subset, design = group)
195
+# Compare against model without group 
196
+res <- glmGamPoi::gampoi_test_qlr(pbmcs_subset, fit, reduced_design = ~ 1)
197
+# Look at first 6 genes
198
+head(res)
199
+```
200
+
201
+The p-values agree well with the ones that `edgeR` is calculating. This is because `glmGamPoi` uses the same framework of quasi-likelihood ratio tests that was invented by `edgeR` and is described in [Lund et al. (2012)](https://doi.org/10.1515/1544-6115.1826).
202
+
203
+```{r fig.height=3, fig.width=3, warning=FALSE, message=FALSE}
204
+model_matrix <- model.matrix(~ group, data = data.frame(group = group))
205
+edgeR_data <- edgeR::DGEList(pbmcs_subset)
206
+edgeR_data <- edgeR::calcNormFactors(edgeR_data)
207
+edgeR_data <- edgeR::estimateDisp(edgeR_data, design = model_matrix)
208
+edgeR_fit <- edgeR::glmQLFit(edgeR_data, design = model_matrix)
209
+edgeR_test <- edgeR::glmQLFTest(edgeR_fit, coef = 2)
210
+edgeR_res <- edgeR::topTags(edgeR_test, sort.by = "none", n = nrow(pbmcs_subset))
211
+```
212
+
213
+
214
+```{r fig.height=4, fig.width=3, warning=FALSE, message=FALSE, echo = FALSE}
215
+par(cex.main = 2, cex.lab = 1.5)
216
+plot(res$pval, edgeR_res$table$PValue, pch = 16, log = "xy",
217
+     main = "p-values", xlab = "glmGamPoi", ylab = "edgeR")
218
+abline(0,1)
219
+```
220
+
221
+
185 222
 # Session Info
186 223
 
187 224
 ```{r}
Browse code

Update README

const-ae authored on 29/05/2020 11:56:58
Showing 1 changed files
... ...
@@ -56,11 +56,11 @@ To fit a single Gamma-Poisson GLM do:
56 56
 # overdispersion = 1/size
57 57
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
58 58
 # size_factors = FALSE, because only a single GLM is fitted
59
-fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE)
59
+fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE, overdispersion_shrinkage = FALSE)
60 60
 fit
61 61
 
62 62
 # Internally fit is just a list:
63
-c(fit)
63
+as.list(fit)
64 64
 ```
65 65
 
66 66
 The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion.
... ...
@@ -96,8 +96,10 @@ summary(fit)
96 96
 
97 97
 I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("edgeR")`. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance.
98 98
 
99
+
100
+
99 101
 ```{r, warning=FALSE}
100
-# Explicitly realize count matrix in memory
102
+# Explicitly realize count matrix in memory so that it is a fair comparison
101 103
 pbmcs_subset <- as.matrix(assay(pbmcs_subset))
102 104
 model_matrix <- matrix(1, nrow = ncol(pbmcs_subset))
103 105
 
... ...
@@ -119,7 +121,7 @@ bench::mark(
119 121
     edgeR_data <- edgeR::calcNormFactors(edgeR_data)
120 122
     edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix)
121 123
     edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)
122
-  }, check = FALSE
124
+  }, check = FALSE, min_iterations = 3
123 125
 )
124 126
 ```
125 127
 
Browse code

Remove Github Actions Badge

const-ae authored on 30/04/2020 16:11:14
Showing 1 changed files
... ...
@@ -16,7 +16,6 @@ set.seed(2)
16 16
 # glmGamPoi
17 17
 
18 18
 <!-- badges: start -->
19
-[![R build status](https://github.com/const-ae/glmGamPoi/workflows/R-CMD-check/badge.svg)](https://github.com/const-ae/glmGamPoi/actions)
20 19
 [![codecov](https://codecov.io/gh/const-ae/glmGamPoi/branch/master/graph/badge.svg)](https://codecov.io/gh/const-ae/glmGamPoi)
21 20
 <!-- badges: end -->
22 21
 
Browse code

Update README.Rmd

const-ae authored on 30/04/2020 15:58:51
Showing 1 changed files
... ...
@@ -37,7 +37,7 @@ The core design aims of `gmlGamPoi` are:
37 37
 
38 38
 # Installation
39 39
 
40
-You can install the release version of `r BiocStyle::Biocpkg("glmGamPoi")` from BioConductor:
40
+You can install the release version of *[glmGamPoi](https://bioconductor.org/packages/glmGamPoi)* from BioConductor:
41 41
 
42 42
 ``` r
43 43
 if (!requireNamespace("BiocManager", quietly = TRUE))
Browse code

Use same code in README and vignette

const-ae authored on 27/03/2020 16:01:37
Showing 1 changed files
... ...
@@ -11,42 +11,180 @@ knitr::opts_chunk$set(
11 11
   fig.path = "man/figures/README-"
12 12
   # out.width = "100%"
13 13
 )
14
-set.seed(1)
14
+set.seed(2)
15 15
 ```
16 16
 # glmGamPoi
17 17
 
18 18
 <!-- badges: start -->
19
-[![R build status](https://github.com/const-ae/glmGamPoi/workflows/R-CMD-check/badge.svg)](https://github.com/const-ae/glmGamPoi)
19
+[![R build status](https://github.com/const-ae/glmGamPoi/workflows/R-CMD-check/badge.svg)](https://github.com/const-ae/glmGamPoi/actions)
20 20
 [![codecov](https://codecov.io/gh/const-ae/glmGamPoi/branch/master/graph/badge.svg)](https://codecov.io/gh/const-ae/glmGamPoi)
21 21
 <!-- badges: end -->
22 22
 
23
-```{r child = "man/fragments/paragraph-intro.Rmd"}
24
-```
25 23
 
24
+> Fit Gamma-Poisson Generalized Linear Models Reliably.
25
+
26
+The core design aims of `gmlGamPoi` are:
27
+
28
+* Fit the Gamma-Poisson models on arbitrarily large or small datasets
29
+* Be faster than alternative methods, such as `DESeq2` or `edgeR`
30
+* Calculate exact or approximate results based on user preference
31
+* Support in memory or on-disk data
32
+* Follow established conventions around tools for RNA-seq analysis
33
+* Present a simple user-interface
34
+* Avoid unnecessary dependencies
35
+* Make integration into other tools easy
36
+
37
+
38
+# Installation
26 39
 
40
+You can install the release version of `r BiocStyle::Biocpkg("glmGamPoi")` from BioConductor:
27 41
 
28
-## Installation
42
+``` r
43
+if (!requireNamespace("BiocManager", quietly = TRUE))
44
+    install.packages("BiocManager")
29 45
 
30
-```{r child = "man/fragments/paragraph-installation.Rmd"}
46
+BiocManager::install("glmGamPoi")
31 47
 ```
32 48
 
49
+For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")` repo.
33 50
 
34
-## Example
35 51
 
36
-```{r child = "man/fragments/paragraph-simple_example.Rmd"}
52
+# Example
53
+
54
+To fit a single Gamma-Poisson GLM do:
55
+
56
+```{r}
57
+# overdispersion = 1/size
58
+counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
59
+# size_factors = FALSE, because only a single GLM is fitted
60
+fit <- glmGamPoi::glm_gp(counts, design = ~ 1, size_factors = FALSE)
61
+fit
62
+
63
+# Internally fit is just a list:
64
+c(fit)
37 65
 ```
38 66
 
67
+The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion.
39 68
 
40 69
 Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy:
41 70
 
42
-```{r child = "man/fragments/paragraph-pbmc4k_example.Rmd"}
71
+I will first load an example dataset using the `TENxPBMCData` package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells:
72
+```{r, warning=FALSE, message = FALSE}
73
+library(SummarizedExperiment)
74
+library(DelayedMatrixStats)
75
+```
76
+
77
+```{r}
78
+# The full dataset with 33,000 genes and 4340 cells
79
+# The first time this is run, it will download the data
80
+pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k")
81
+# I want genes where at least some counts are non-zero
82
+non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0)
83
+pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ]
84
+pbmcs_subset
43 85
 ```
44 86
 
87
+I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory.
88
+
89
+```{r}
90
+fit <- glmGamPoi::glm_gp(pbmcs_subset, on_disk = FALSE)
91
+summary(fit)
92
+```
93
+
94
+
95
+
96
+# Benchmark
97
+
98
+I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("edgeR")`. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance.
99
+
100
+```{r, warning=FALSE}
101
+# Explicitly realize count matrix in memory
102
+pbmcs_subset <- as.matrix(assay(pbmcs_subset))
103
+model_matrix <- matrix(1, nrow = ncol(pbmcs_subset))
104
+
105
+
106
+bench::mark(
107
+  glmGamPoi_in_memory = {
108
+    glmGamPoi::glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
</