---
title: "Estimate and remove cross-contamination from ambient RNA for scRNA-seq data with DecontX"
author: "Shiyi Yang, Sean Corbett, Yusuke Koga, Zhe Wang, W. Evan Johnson, Masanao Yajima, Joshua D. Campbell"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{Estimate and remove cross-contamination from ambient RNA for scRNA-seq data with DecontX}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

DecontX is a Bayesian hierarchical model to estimate and remove cross-contamination from ambient RNA in single-cell RNA-seq count data generated from droplet-based sequencing devices. DecontX will take the count matrix with/without the cell labels and estimate the contamination level and deliver a decontaminted count matrix for downstream analysis. 

In this vignette we will demonstrate how to use DecontX to estimate and remove contamination.  

# Installation

celda can be installed from Bioconductor:

```{r, eval= FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE)){
    install.packages("BiocManager")}
BiocManager::install("celda")
```

The package can be loaded using the `library` command.

```{r, eval=TRUE, message=FALSE} 
library(celda)
```

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.

# Reproducibility note

Many functions in *celda* make use of stochastic algorithms or procedures which require the use of random number generator (RNG) for simulation or sampling. To maintain reproducibility, all these functions use a **default seed of 12345** to make sure same results are generated each time one of these functions is called. Explicitly setting the `seed` arguments is needed for greater control and randomness.

# Generation of a cross-contaminated dataset 
DecontX will take a matrix of counts (referred as observed counts) where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of DecontX, we will apply it to a simulated dataset.

In the function `simulateContaminatedMatrix`, the K parameter designates the number of cell clusters, the C parameter determines the number of cells, the G parameter determines the number of genes in the simulated dataset.

```{r}
simCounts <- simulateContaminatedMatrix(G = 300, C = 100, K = 3)
```

The `nativeCounts` is the natively expressed counts matrix, and `observedCounts` is the observed counts matrix that contains both contaminated and natively expressed transctripts. The `NByC` is the total number of observed transcripts per cell. The counts matrix which only contains contamianted transcripts can be obtained by subtracting the observed counts matrix from the observed counts matrix. 

```{r}
contamination <- simCounts$observedCounts - simCounts$nativeCounts
```

The `z` variable contains the population label for each cell.

```{r}
table(simCounts$z)
```

The `phi` and `eta` variables contain the expression distributions and contamination distributions for each population, respectively. Each column corresponds to a population, each row represents a gene. The sum of the rows equal to 1.

```{r}
colSums(simCounts$phi)
colSums(simCounts$eta)
```

# Decontamination using DecontX
DecontX uses bayesian method to estimate and remove contamination via varitaional inference.

```{r, warning = FALSE, message = FALSE}
decontxModel <- decontX(counts = simCounts$observedCounts, z = simCounts$z)
```

## Check convergence
Use log-likelihood to check convergence

```{r, eval = TRUE, fig.width = 5, fig.height = 5}
plot(decontxModel$resList$logLikelihood)
```

## Evaluate model performance

`decontX` estimates a contamination proportion for each cell. We compare the estimated contamination proportion with the real contamination proportion.

```{r, eval = TRUE, fig.width = 5, fig.height = 5}
plot(decontxModel$resList$estConp,
    colSums(contamination) / simCounts$NByC, col = simCounts$z)
abline(0, 1)
```

# Session Information

```{r}
sessionInfo()
```