# celda: CEllular Latent Dirichlet Allocation
"celda" stands for "**CE**llular **L**atent **D**irichlet **A**llocation", which is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. Celda has advantages over other clustering frameworks:
1. Celda can simultaneously cluster genes into transcriptional states and cells into subpopulations
2. Celda uses count-based Dirichlet-multinomial distributions so no additional normalization is required for 3' DGE single cell RNA-seq
3. These types of models have shown good performance with sparse data.
## Installation Instructions
To install the most recent release of celda (used in the preprint version of the celda paper) via devtools:
The most up-to-date (but potentially less stable) version of celda can similarly be installed with:
**NOTE** On OSX, devtools::install_github() requires installation of **libgit2.** This can be installed via homebrew:
brew install libgit2
## Examples and vignettes
Vignettes are available in the package.
An analysis example using celda with RNASeq via vignette('celda-analysis')
### Decontamination with DecontX
Highly expressed genes from various cells clusters will be expressed at low levels in other clusters in droplet-based systems due to contamination. DecontX will decompose an observed count matrix into a decontaminated expression matrix and a contamination matrix. The only other parameter needed is a vector of cell cluster labels.
To simulate two 300 (gene) x 100 (cell) count matrices from 3 different cell types with total reads per cell ranged from 5000 to 40000: one matrix being ture expression matrix (rmat), the other matrix being contamination count matrix (cmat)
sim.con = simulateContaminatedMatrix( C = 100, G = 300, K = 3, N.Range= c(5000, 40000), seed = 9124)
true.contamination.percentage = colSums( sim.con$cmat ) / colSums( sim.con$cmat + sim.con$rmat )
# N.by.C: total transcripts per cell
# z: cell type label
Use DecontX to decompose the observed (contaminated) count matrix back into true expression matrix and a contamination matrix with specified cell label
observedCounts = sim.con$observedCounts
cell.label = sim.con$z
new.counts = DecontX( counts = observedCounts, z = cell.label, max.iter = 200, seed = 123)
# Decontaminated matrix: new.counts$res.list$est.rmat
# Percentage of contamination per cell: new.counts$res.list$est.conp
DecontX Performance check
estimated.contamination.percentage = new.counts$res.list$est.conp
plot( true.contamination.percentage, estimated.contamination.percentage) ; abline(0,1)
## New Features and announcements
The v0.4 release of celda represents a useable implementation of the various celda clustering models.
Please submit any usability issues or bugs to the issue tracker at https://github.com/campbio/celda
You can discuss celda, or ask the developers usage questions, in our [Google Group.](https://groups.google.com/forum/#!forum/celda-list)