# CellMentor
CellMentor is a novel supervised cell type aware non-negative matrix factorization (NMF) method designed for enhanced cell type resolution in single-cell RNA sequencing analysis. By integrating cell type annotations into the NMF framework, CellMentor enables improved cell type separation, clustering, and annotation by leveraging latent patterns from reference datasets.
Features
- Improved cell type separation through constrained supervised factorization (CSFNMF)
- Automated parameter optimization for optimal performance
- Efficient projection of query datasets onto learned cell type spaces
- Seamless integration with Seurat for visualization and downstream analysis
# System Requirements
## Hardware requirements
- `CellMentor` package requires only a standard computer with enough RAM to support the in-memory operations.
- R version (e.g., R ≥ 4.3).
- Recommended RAM (e.g., ≥ 16 GB for medium datasets).
- GPU is NOT required.
## Software requirements
### OS Requirements
This package is supported for *macOS*, *Linux* and *Windows*. The package has been tested on the following systems:
+ macOS: Sequoia (15.6.1)
+ Linux: Ubuntu 20.04.6
### R Dependencies
`CellMentor` relies on several R packages, all of which are specified in the package DESCRIPTION file under the Imports field. These dependencies are installed automatically when you install CellMentor.
## Installation
You can install CellMentor from BioConductor:
```R
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("CellMentor")
```
You can install CellMentor directly from GitHub using devtools:
```R
if (!require("devtools")) install.packages("devtools")
devtools::install_github("petrenkokate/CellMentor", dependencies = TRUE)
```
### Typical Installation and Runtime
- **Installation time:** ~5 minutes on a standard desktop computer
(may vary depending on the number of dependencies to install).
- **Demo runtime:** ~8–10 minutes on a standard desktop computer
(tested on macOS Sequoia 15.6.1 with R ≥ 4.3, 16 GB RAM).
## Basic Usage
Set random seed for reproducibility
```R
set.seed(100)
```
Load required packages
```R
library(CellMentor)
library(Matrix)
library(ggplot2)
library(dplyr)
library(Seurat)
library(scRNAseq)
```
### 1. Load the datasets
```R
# Loading reference dataset (Baron)
baron <- h.baron_dataset()
reference_matrix <- baron$data
reference_celltypes <- baron$celltypes
# Loading query dataset (Muraro)
muraro <- muraro_dataset()
query_matrix <- muraro$data
query_celltypes <- muraro$celltypes # This would be unknown in a real application
# We keep it here for evaluation
```
### (Optional) Create smaller subsets for faster demonstration
```R
# Function to create balanced subsets
create_subset <- function(matrix, celltypes, cells_per_type = 30) {
# Get unique cell types
unique_types <- unique(celltypes)
# Select cells for each type
selected_cells <- c()
for (cell_type in unique_types) {
# Get cells of this type
type_cells <- names(celltypes)[celltypes == cell_type]
# If fewer cells than requested, take all of them
n_to_select <- min(cells_per_type, length(type_cells))
# Randomly select cells
selected <- sample(type_cells, n_to_select)
selected_cells <- c(selected_cells, selected)
}
# Return subset
list(
matrix = matrix[, selected_cells],
celltypes = celltypes[selected_cells]
)
}
# Create balanced subsets with 30 cells per type
baron_subset <- create_subset(reference_matrix, reference_celltypes, 30)
muraro_subset <- create_subset(query_matrix, query_celltypes, 30)
# Update variable names for clarity
reference_matrix <- baron_subset$matrix
reference_celltypes <- baron_subset$celltypes
query_matrix <- muraro_subset$matrix
query_celltypes <- muraro_subset$celltypes # This would be unknown in a real application
# We keep it here for evaluation
```
### 2. Create a CellMentor Object
```R
csfnmf_obj <- CreateCSFNMFobject(
ref_matrix = reference_matrix,
ref_celltype = reference_celltypes,
data_matrix = query_matrix,
norm = TRUE,
most.variable = TRUE,
scale = TRUE,
scale_by = "cells",
verbose = TRUE,
num_cores = 1
)
```
***Expected Output:***
```
[11:45:56] Starting CSFNMF object creation
[11:45:59] Validating inputs
[11:45:59] Creating CSFNMF object
[11:45:59] Converting matrices to sparse format
[11:46:00] Setting up annotations
[11:46:00] Cleaning matrices
[11:46:00] Removed 3319 empty rows from reference matrix
[11:46:00] Removed 496 empty rows from query matrix
[11:46:00] Finding common genes
[11:46:00] Found 14904 common genes
[11:46:00] Normalizing data
[11:46:04] Selecting variable genes
[11:46:05] Selected 999 variable genes
[11:46:05] Scaling data by cells
[11:46:06] Encoding cell types
[11:46:06] Reordering data
[11:46:06] Validating final object
[11:46:06] CSFNMF object creation complete
```
### 3. Run CellMentor and choose optimal params
The parameter selection process involves optimizing multiple hyperparameters to achieve the best cell type separation. Note that the function is computationally intensive and because of it we test only limited ranges of parameters.
#### Parameter descriptions:
- alpha_range: Controls within-class scatter (cell similarity within the same type)
- beta_range: Controls between-class scatter (cell separation between different types)
- gamma_range: Controls sparsity of the factorization
- delta_range: Controls orthogonality between factors
These are parameter settings that work well across most datasets without extensive tuning.
```R
# Find optimal parameters
optimal_params <- CellMentor(
csfnmf_obj,
alpha_range = c(1, 5), # Limited alpha range
beta_range = c(1, 5), # Limited beta range
gamma_range = c(0.1), # use only one gamma for speed
delta_range = c(1), # use only one delta for speed
num_cores = 5,
verbose = TRUE
)
# Get best model
best_model <- optimal_params$best_model
K_VALUE <- best_model@parameters$rank
```
***Expected Output:***
```
[11:46:11] Creating training object
[11:46:11] Determining optimal rank
[11:47:02] Optimal rank determined: 80
[11:47:02] Starting parameter grid search
[11:47:02] Testing configuration 1/4
[11:49:09] Testing configuration 2/4
[11:51:20] Testing configuration 3/4
[11:53:25] Testing configuration 4/4
[11:55:40] Training final model with best parameters on full dataset
[11:55:40] Initializing W and H matrices
[11:55:40] Calculating helper matrices
[11:55:40] Calculating alpha
[11:55:40] Calculating H constants
[11:55:47] Updating W and H matrices
```
Check best params
```R
print(optimal_params$best_params)
```
***Expected Output:***
```
$k
[1] 80
$init_method
[1] "regulated"
$alpha
[1] 5
$beta
[1] 5
$gamma
[1] 0.1
$delta
[1] 1
```
### 3. Project Data
```R
# Project query data onto the learned space
h_project <- project_data(
W = best_model@W, # Learned gene weights
X = best_model@matrices@data, # Query data matrix
num_cores = 5,
verbose = TRUE
)
```
### 4. Integration with Seurat
```R
rownames(query_matrix) <- make.unique(rownames(query_matrix))
seu_muraro <- CreateSeuratObject(counts = query_matrix)
seu_muraro$celltype <- query_celltypes
# Add CellMentor dimensionality reduction to Seurat object
seu_muraro$CellMentor <- CreateDimReducObject(
embeddings = t(as.matrix(h_project)),
key = "CellMentor_",
assay = DefaultAssay(seu_muraro),
loadings = as.matrix(best_model@W)
)
# Visualization
seu_muraro <- RunUMAP(seu_muraro, reduction = 'CellMentor', dims= 1:K_VALUE)
DimPlot(seu_muraro, group.by = 'celltype')
```
# Run CellMentor on Your Own Data
This quick guide shows how to create a **CSFNMF** object from your data. After creating the object, **follow the same steps as in the demo** (parameter search with `CellMentor()`, projection with `project_data()`, optional Seurat integration).
## Inputs
- **Reference counts matrix**: genes × cells (`ref_counts`)
- **Reference annotations**: vector of length `ncol(ref_counts)` (`ref_celltypes`), names must match `colnames(ref_counts)`
- **Query counts matrix**: genes × cells (`qry_counts`), with overlapping gene IDs
> Tip: rows = genes, columns = cells. Use sparse matrices (`Matrix::dgCMatrix`) for speed/memory.
## Create the object
```r
library(Matrix)
library(CellMentor)
# 1) Build CSFNMF object
csfnmf_obj <- CreateCSFNMFobject(
ref_matrix = ref_counts,
ref_celltype = ref_celltypes, # names(ref_celltypes) == colnames(ref_counts)
data_matrix = qry_counts,
norm = TRUE,
most.variable = TRUE,
scale = TRUE,
scale_by = "cells",
num_cores = 1,
verbose = TRUE
)
```
## Next steps
Proceed exactly as in the **Demo** section:
1. **Hyperparameter search & training:** `optimal <- CellMentor(csfnmf_obj, ...)`
2. **Best model:** `best_model <- optimal$best_model`
3. **Projection:** `h_project <- project_data(W = best_model@W, X = best_model@matrices@data)`
4. *(Optional)* **Seurat integration & UMAP:** same code as in the demo.
# Reproduce the analysis and figures from the paper
All scripts used to generate the analyses and figures reported in the manuscript are openly available at [petrenkokate/CellMentor_paper](https://github.com/petrenkokate/CellMentor_paper).
The repository contains:
- Code for data preprocessing, model training, and evaluation
- Scripts to reproduce each figure in the paper
# Citation
If you use *CellMentor* in your work, please cite:
CellMentor: Cell-Type Aware Dimensionality Reduction for Single-cell RNA-Sequencing Data
Or Hevdeli†, Ekaterina Petrenko†, Dvir Aran
*bioRxiv* 2025.06.17.660094
doi: [https://doi.org/10.1101/2025.06.17.660094](https://doi.org/10.1101/2025.06.17.660094)
† These authors contributed equally to this work.
# License
This project is covered under the **Apache 2.0 License**.