title: "Estimate and remove cross-contamination from ambient RNA for scRNA-seq data with DecontX"
author: "Shiyi Yang, Sean Corbett, Yusuke Koga, Zhe Wang, W. Evan Johnson, Masanao Yajima, Joshua D. Campbell"
date: "`r Sys.Date()`"
    toc: true
vignette: >
  %\VignetteIndexEntry{Estimate and remove cross-contamination from ambient RNA for scRNA-seq data with DecontX}

# Introduction

DecontX is a Bayesian hierarchical model to estimate and remove cross-contamination from ambient RNA in single-cell RNA-seq count data generated from droplet-based sequencing devices. DecontX will take the count matrix with/without the cell labels and estimate the contamination level and deliver a decontaminted count matrix for downstream analysis. 

In this vignette we will demonstrate how to use DecontX to estimate and remove contamination.  

# Installation

celda can be installed from Bioconductor:

```{r, eval= FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE)){

The package can be loaded using the `library` command.

```{r, eval=TRUE, message=FALSE} 

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.

# Reproducibility note

Many functions in *celda* make use of stochastic algorithms or procedures which require the use of random number generator (RNG) for simulation or sampling. To maintain reproducibility, all these functions use a **default seed of 12345** to make sure same results are generated each time one of these functions is called. Explicitly setting the `seed` arguments is needed for greater control and randomness.

# Generation of a cross-contaminated dataset 
DecontX will take a matrix of counts (referred as observed counts) where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of DecontX, we will apply it to a simulated dataset.

In the function `simulateContaminatedMatrix`, the K parameter designates the number of cell clusters, the C parameter determines the number of cells, the G parameter determines the number of genes in the simulated dataset.

simCounts <- simulateContaminatedMatrix(G = 300, C = 100, K = 3)

The `nativeCounts` is the natively expressed counts matrix, and `observedCounts` is the observed counts matrix that contains both contaminated and natively expressed transctripts. The `NByC` is the total number of observed transcripts per cell. The counts matrix which only contains contamianted transcripts can be obtained by subtracting the observed counts matrix from the observed counts matrix. 

contamination <- simCounts$observedCounts - simCounts$nativeCounts

The `z` variable contains the population label for each cell.


The `phi` and `eta` variables contain the expression distributions and contamination distributions for each population, respectively. Each column corresponds to a population, each row represents a gene. The sum of the rows equal to 1.


# Decontamination using DecontX
DecontX uses bayesian method to estimate and remove contamination via varitaional inference.

```{r, warning = FALSE, message = FALSE}
decontxModel <- decontX(counts = simCounts$observedCounts, z = simCounts$z)

## Check convergence
Use log-likelihood to check convergence

```{r, eval = TRUE, fig.width = 5, fig.height = 5}

## Evaluate model performance

`decontX` estimates a contamination proportion for each cell. We compare the estimated contamination proportion with the real contamination proportion.

```{r, eval = TRUE, fig.width = 5, fig.height = 5}
    colSums(contamination) / simCounts$NByC, col = simCounts$z)
abline(0, 1)

# Session Information