README.Rmd
f99a7093
 ---
 output: github_document
 ---
 
 <!-- README.md is generated from README.Rmd. Please edit that file -->
 
f95f49a7
 ```{r init_chunk, include = FALSE}
f99a7093
 knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>",
f61e9312
   fig.path = "man/figures/README-"
   # out.width = "100%"
f99a7093
 )
b58ed312
 set.seed(2)
f99a7093
 ```
2bf4e41b
 
 # glmGamPoi <a href='https://github.com/const-ae/glmGamPoi'><img src='man/figures/logo.svg' align="right" height="139" /></a>
f99a7093
 
 <!-- badges: start -->
ebec0841
 [![codecov](https://codecov.io/gh/const-ae/glmGamPoi/branch/master/graph/badge.svg)](https://codecov.io/gh/const-ae/glmGamPoi)
f99a7093
 <!-- badges: end -->
 
b58ed312
 > Fit Gamma-Poisson Generalized Linear Models Reliably.
 
2bf4e41b
 Pronounciation: [`dʒi əl əm ɡam ˈpwɑ`](http://ipa-reader.xyz/?text=d%CA%92i%20%C9%99l%20%C9%99m%20%C9%A1am%20%CB%88pw%C9%91)
 
3084c144
 The core design aims of `glmGamPoi` are:
b58ed312
 
3084c144
 * Fit Gamma-Poisson models on arbitrarily large or small datasets
b58ed312
 * Be faster than alternative methods, such as `DESeq2` or `edgeR`
 * Calculate exact or approximate results based on user preference
3084c144
 * Support in-memory or on-disk data
b58ed312
 * Follow established conventions around tools for RNA-seq analysis
 * Present a simple user-interface
 * Avoid unnecessary dependencies
 * Make integration into other tools easy
 
 
 # Installation
f99a7093
 
3fd00537
 You can install the release version of *[glmGamPoi](https://bioconductor.org/packages/glmGamPoi)* from BioConductor:
f99a7093
 
b58ed312
 ``` r
 if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
f99a7093
 
b58ed312
 BiocManager::install("glmGamPoi")
f99a7093
 ```
 
b58ed312
 For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")` repo.
f99a7093
 
19270ae4
 If you use this package in a scientific publication, please cite:
 
 > glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data  
 > Constantin Ahlmann-Eltze, Wolfgang Huber  
533ca910
 > Bioinformatics; 2020-12-09; doi: https://doi.org/10.1093/bioinformatics/btaa1009
f99a7093
 
b58ed312
 # Example
 
f627ade7
 Load the glmGamPoi package
 
f95f49a7
 ```{r load_glmGamPoi}
f627ade7
 library(glmGamPoi)
 ```
 
 
b58ed312
 To fit a single Gamma-Poisson GLM do:
 
f95f49a7
 ```{r single_gp_fit}
b58ed312
 # overdispersion = 1/size
 counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)
3084c144
 
e196029d
 # design = ~ 1 means that an intercept-only model is fit
f627ade7
 fit <- glm_gp(counts, design = ~ 1)
b58ed312
 fit
 
 # Internally fit is just a list:
891eda3c
 as.list(fit)[1:2]
f99a7093
 ```
 
b58ed312
 The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion.
f99a7093
 
 Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy:
 
b58ed312
 I will first load an example dataset using the `TENxPBMCData` package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells:
f95f49a7
 ```{r load_additional_packages, warning=FALSE, message = FALSE}
b58ed312
 library(SummarizedExperiment)
 library(DelayedMatrixStats)
 ```
 
f95f49a7
 ```{r load_pbmc_data}
b58ed312
 # The full dataset with 33,000 genes and 4340 cells
 # The first time this is run, it will download the data
 pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k")
3084c144
 
b58ed312
 # I want genes where at least some counts are non-zero
 non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0)
 pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ]
 pbmcs_subset
f99a7093
 ```
 
b58ed312
 I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory.
 
f95f49a7
 ```{r simple_fit}
f627ade7
 fit <- glm_gp(pbmcs_subset, on_disk = FALSE)
b58ed312
 summary(fit)
 ```
 
 
 
 # Benchmark
 
 I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("edgeR")`. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance.
 
5d2b89e8
 
 
f95f49a7
 ```{r run_bench_mark, warning=FALSE}
5d2b89e8
 # Explicitly realize count matrix in memory so that it is a fair comparison
b58ed312
 pbmcs_subset <- as.matrix(assay(pbmcs_subset))
 model_matrix <- matrix(1, nrow = ncol(pbmcs_subset))
 
 
 bench::mark(
   glmGamPoi_in_memory = {
f627ade7
     glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
b58ed312
   }, glmGamPoi_on_disk = {
f627ade7
     glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE)
b58ed312
   }, DESeq2 = suppressMessages({
     dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset,
                         colData = data.frame(name = seq_len(4340)),
                         design = ~ 1)
     dds <- DESeq2::estimateSizeFactors(dds, "poscounts")
     dds <- DESeq2::estimateDispersions(dds, quiet = TRUE)
     dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6)
   }), edgeR = {
     edgeR_data <- edgeR::DGEList(pbmcs_subset)
     edgeR_data <- edgeR::calcNormFactors(edgeR_data)
     edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix)
     edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)
5d2b89e8
   }, check = FALSE, min_iterations = 3
b58ed312
 )
 ```
 
00ef94a3
 On this dataset, `glmGamPoi` is more than 5 times faster than `edgeR` and more than 18 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations.
b58ed312
 
891eda3c
 To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable:
b58ed312
 
f95f49a7
 ```{r compare_with_deseq2_and_edger, message=FALSE, warning=FALSE}
b58ed312
 # Results with my method
f627ade7
 fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
b58ed312
 
 # DESeq2
 dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, 
                         colData = data.frame(name = seq_len(4340)),
                         design = ~ 1)
00ef94a3
 sizeFactors(dds)  <- fit$size_factors
b58ed312
 dds <- DESeq2::estimateDispersions(dds, quiet = TRUE)
 dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6)
 
 #edgeR
00ef94a3
 edgeR_data <- edgeR::DGEList(pbmcs_subset, lib.size = fit$size_factors)
b58ed312
 edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix)
 edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)
 ```
 
 
 ```{r coefficientComparison, fig.height=5, fig.width=10, warning=FALSE, echo = FALSE}
 par(mfrow = c(2, 4), cex.main = 2, cex.lab = 1.5)
 plot(fit$Beta[,1], coef(dds)[,1] / log2(exp(1)), pch = 16, 
      main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "DESeq2")
 abline(0,1)
 plot(fit$Beta[,1], edgeR_fit$unshrunk.coefficients[,1], pch = 16,
      main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "edgeR")
 abline(0,1)
 
00ef94a3
 plot(fit$Mu[,1], assay(dds, "mu")[,1], pch = 16, log="xy",
b58ed312
      main = "Gene Mean", xlab = "glmGamPoi", ylab = "DESeq2")
 abline(0,1)
 plot(fit$Mu[,1], edgeR_fit$fitted.values[,1], pch = 16, log="xy",
      main = "Gene Mean", xlab = "glmGamPoi", ylab = "edgeR")
 abline(0,1)
 
 plot(fit$overdispersions, rowData(dds)$dispGeneEst, pch = 16, log="xy",
      main = "Overdispersion", xlab = "glmGamPoi", ylab = "DESeq2")
 abline(0,1)
 plot(fit$overdispersions, edgeR_fit$dispersion, pch = 16, log="xy",
      main = "Overdispersion", xlab = "glmGamPoi", ylab = "edgeR")
 abline(0,1)
 
 ```
 
 I am comparing the gene-wise estimates of the coefficients from all three methods. Points on the diagonal line are identical. The inferred Beta coefficients and gene means agree well between the methods, however the overdispersion differs quite a bit. `DESeq2` has problems estimating most of the overdispersions and sets them to `1e-8`. `edgeR` only approximates the overdispersions which explains the variation around the overdispersions calculated with `glmGamPoi`. 
 
 
 ## Scalability
 
3084c144
 The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk.
f99a7093
 
 
891eda3c
 ## Differential expression analysis
 
782bbe6f
 `glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes. To demonstrate this feature, we will use the data from [Kang _et al._ (2018)](https://www.ncbi.nlm.nih.gov/pubmed/29227470) provided by the `MuscData` package. This is a single cell dataset of 8 Lupus patients for which 10x droplet-based scRNA-seq was performed before and after treatment with interferon beta. The `SingleCellExperiment` object conveniently provides the patient id (`ind`), treatment status (`stim`) and cell type (`cell`):
891eda3c
 
f95f49a7
 ```{r load_kang_data}
782bbe6f
 sce <- muscData::Kang18_8vs8()
 colData(sce)
891eda3c
 ```
 
782bbe6f
 For demonstration purpose, I will work on a subset of the genes and cells:
891eda3c
 
f95f49a7
 ```{r subset_kang_data}
782bbe6f
 set.seed(1)
 # Take highly expressed genes and proper cells:
 sce_subset <- sce[rowSums(counts(sce)) > 100, 
                   sample(which(sce$multiplets == "singlet" & 
                               ! is.na(sce$cell) &
                               sce$cell %in% c("CD4 T cells", "B cells", "NK cells")), 
                          1000)]
 # Convert counts to dense matrix
 counts(sce_subset) <- as.matrix(counts(sce_subset))
 # Remove empty levels because glm_gp() will complain otherwise
 sce_subset$cell <- droplevels(sce_subset$cell)
891eda3c
 ```
 
ff7b7ac7
 In the first step we will aggregate the counts of each patient, condition and cell type and form pseudobulk samples. This ensures that I get reliable p-value by treating each patient as a replicate and not each cell.
 
 ```{r}
 sce_reduced <- pseudobulk(sce_subset, group_by = vars(ind, stim, cell))
 ```
 
 
782bbe6f
 We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment:
891eda3c
 
f95f49a7
 ```{r kang_fit}
ff7b7ac7
 fit <- glm_gp(sce_reduced, design = ~ cell + stim +  stim:cell - 1,
782bbe6f
               reference_level = "NK cells")
 summary(fit)
 ```
 
 
 To see how the coefficient of our model are called, we look at the `colnames(fit$Beta)`:
 
f95f49a7
 ```{r show_beta_colnames}
782bbe6f
 colnames(fit$Beta)
 ```
 
 
ff7b7ac7
 In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. 
782bbe6f
 
f95f49a7
 ```{r do_pseudobulk}
782bbe6f
 # The contrast argument specifies what we want to compare
 # We test the expression difference of stimulated and control T-cells
7502131f
 de_res <- test_de(fit, contrast = cond(cell = "CD4 T cells", stim = "ctrl") - cond(cell = "CD4 T cells", stim = "stim")) 
782bbe6f
 
 # Most different genes
 head(de_res[order(de_res$pval), ])
 ```
 
ff7b7ac7
 The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells:  _IFI6_, _IFIT3_ and _ISG15_ literally stand for _Interferon Induced/Stimulated Protein_.
782bbe6f
 
 To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values.
 
f95f49a7
 ```{r make_volcano_plot}
782bbe6f
 library(ggplot2)
 ggplot(de_res, aes(x = lfc, y = -log10(pval))) +
   geom_point(size = 0.6, aes(color = adj_pval < 0.1)) +
64b4c333
   ggtitle("Volcano Plot", "Genes that change most through interferon-beta treatment in T cells")
891eda3c
 ```
 
 
782bbe6f
 Another important task in single cell data analysis is the identification of marker genes for cell clusters. For this we can also use our Gamma-Poisson fit. 
 
ff7b7ac7
 Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition (this time not accounting for the pseudo-replication structure):
1031f621
 
f95f49a7
 ```{r find_marker_genes}
ff7b7ac7
 fit_full <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
                    reference_level = "NK cells")
 marker_genes <- test_de(fit_full, `cellCD4 T cells` - `cellB cells`, sort_by = pval)
782bbe6f
 head(marker_genes)
 ```
1031f621
 
782bbe6f
 If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast:
1031f621
 
f95f49a7
 ```{r do_signif_genes}
ff7b7ac7
 marker_genes2 <- test_de(fit_full, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
782bbe6f
                                (`cellB cells` + `cellB cells:stimstim`), 
                         sort_by = pval)
 
 head(marker_genes2)
1031f621
 ```
 
64b4c333
 We identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences.
782bbe6f
 
 A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results.
 
f95f49a7
 ```{r plot_marker_genes}
782bbe6f
 # Create a data.frame with the expression values, gene names, and cell types
 tmp <- data.frame(gene = rep(marker_genes$name[1:6], times = ncol(sce_subset)),
                   expression = c(counts(sce_subset)[marker_genes$name[1:6], ]),
                   celltype = rep(sce_subset$cell, each = 6))
 
 ggplot(tmp, aes(x = celltype, y = expression)) +
   geom_jitter(height = 0.1) +
   stat_summary(geom = "crossbar", fun = "mean", color = "red") +
   facet_wrap(~ gene, scales = "free_y") +
   ggtitle("Marker genes of B vs. T cells")
 ```
 
 
 
f67d3050
 # Acknowlegments
782bbe6f
 
f67d3050
 This work was supported by the EMBL International PhD Programme and the European Research Council Synergy grant DECODE under grant agreement No. 810296.
782bbe6f
 
 
 
1031f621
 
b58ed312
 # Session Info
f99a7093
 
f95f49a7
 ```{r print_sessionInfo}
b58ed312
 sessionInfo()
f61e9312
 ```
 
f99a7093