<!-- README.md is generated from README.Rmd. Please edit that file -->
# BLASE: Bulk Linking Analysis for Single-cell Experiments <a href="man/figures/logo.png"><img src="man/figures/logo.png" align="right" height="138" /></a>
<!-- badges: start -->
[](https://github.com/andrewmccluskey-uog/BLASE/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
## Overview
The goal of BLASE is to enable you to map bulk RNA-seq samples onto
Single Cell RNA-seq for further analysis, with an emphasis on
trajectories (but it can work for any continuous variable across your
data!).
It provides:
- Configurable discretisation of pseudotime into “pseudotime bins”
- A custom “Gene Peakedness” method for identifying temporally variable
genes
- Annotation of scRNA-seq based on bulk samples
- Mapping of bulk RNA-seq onto these bins.
- Plotting functions
## Installation
### Bioconductor
Package has been submitted to Bioconductor.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("blase")
### Github
You can install the development version of BLASE from
[GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("andrewmccluskey-uog/BLASE")
```
## Getting Started & Usage Notes
Take a look at the Vignette for use with a Single Cell Experiment.
Selection of genes is important for BLASE, as it will base any
predictions on the expression of these genes only. Any genes omitted
from this list will be ignored. Selecting too few genes reduces BLASE’s
ability to dissect different stages from each other. Conversely, a
number of genes that is too great (i.e. introducing genes which do not
have expression changes over the biological process) only introduces
noise, reducing BLASE’s precision. Ideally, these genes should be only
those which show substantial change over the pseudotime trajectory in
the single-cell reference that one would wish to map to.
For cases of organisms with many genes that are highly expressed for a
period over the biological process in a highly regulated way (for
example, Plasmodium spp.), we recommend using every gene in the genome.
For organisms which do not show this pattern, such as human or mouse, we
recommend using a subset of genes, selected by either TradeSeq or
BLASE’s gene peakedness selection. In order to make this process easier
for users of BLASE, we provide a function `get_top_n_genes()` which
enables simple selection of a certain number of genes from an
associationTest result generated by the `TradeSeq` package, and
convenience functions for calculating the gene peakedness as described
here.
Each reference scRNA-seq trajectory will have a unique fingerprint,
defined by genes activated over the course of the process. To optimally
use BLASE, it is important to consider how many genes will meaningfully
contribute to the fingerprint of the process, as well as how many bins
should be used to balance accuracy and precision. When selecting the
number of genes to use, it is important to select genes which can be
used meaningfully to describe the trajectory. Too few genes risks useful
signal being lost, however too many genes may introduce unhelpful noise.
When selecting the number of bins to use, there is a trade-off between
using a small number of larger bins, which can give very accurate
(i.e. correct call) readings, and a larger number of bins which gives
the desired precision (i.e. more granularity).
BLASE uses a discretised pseudotime value, which we refer to as
“pseudotime bins.” BLASE will calculate these bins when creating a
BlaseData object, or when using the `assign_pseudotime_bins()` function
to add these to the metadata of SingleCellExperiment or Seurat object.
Because the BLASE algorithm relies heavily on these bins, it is
important to have a reliable and consistent method to split these.
Depending on the dataset, different methods may be required to ensure
high-quality mappings. We have found that using the `pseudotime_range`
splitting method works best for most datasets.
The pseudotime range bin assignment method is fast, and (assuming
correctness of the pseudotime calculation) implies that each bin will
have a constant transcriptional distance between each other, given the
assumption that the method to generate the pseudotime accounts for this.
However, this method may perform poorly when a reference dataset
contains stretches of pseudotime with no or very few cells in it. In
this case, splitting by cells may be a better option. When assigning
bins to contain a constant number of cells, the pseudotime range covered
by each group is not constant, and may be less useful for mapping
purposes, but can overcome some of the issues with the pseudotime range
method.
BLASE’s main focus is on mapping RNA-seq samples to the discretised
pseudotime in a scRNA-seq dataset. Unlike other tools (e.g. CIBERSORTx,
DWLS, MuSiC) which estimate a proportion of cells per reference group
(typically cell type), BLASE calculates a score for how well each
reference group (for BLASE, typically a pseudotime bin) matches a
sample, and giving only a single “best match,” and the correlations for
all other bins. These values produced by BLASE should not necessarily be
treated as proportions of the population in the bulk sample.
## Development
### Automatic Style Corrections
styler::style_pkg(transformers = styler::tidyverse_style(indent_by = 4))
### Quality Checks
We subscribe to the `R cmd check` and `BiocCheck` guides:
devtools::check()
BiocCheck(`no-check-deprecated`=TRUE)