Name Mode Size
.github 040000
R 040000
inst 040000
man 040000
tests 040000
vignettes 040000
.Rbuildignore 100644 0 kb
.gitignore 100644 0 kb
DESCRIPTION 100644 2 kb
LICENSE 100644 1 kb
NAMESPACE 100644 1 kb
README.md 100644 10 kb
README.md
# nnSVG [![R build status](https://github.com/lmweber/nnSVG/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/lmweber/nnSVG/actions) ## Overview `nnSVG` is a method for scalable identification of spatially variable genes (SVGs) in spatially-resolved transcriptomics data. The `nnSVG` method is based on nearest-neighbor Gaussian processes ([Datta et al., 2016](https://www.tandfonline.com/doi/full/10.1080/01621459.2015.1044091), [Finley et al., 2019](https://www.tandfonline.com/doi/full/10.1080/10618600.2018.1537924)) and uses the BRISC algorithm ([Saha and Datta, 2018](https://onlinelibrary.wiley.com/doi/full/10.1002/sta4.184)) for model fitting and parameter estimation. `nnSVG` allows identification and ranking of SVGs with flexible length scales across a tissue slide or within spatial domains defined by covariates. The method scales linearly with the number of spatial locations and can be applied to datasets containing thousands or more spatial locations. `nnSVG` is implemented as an R package within the Bioconductor framework, and is available from [Bioconductor](https://bioconductor.org/packages/nnSVG). Our preprint describing the method is available from [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.05.16.492124v1). ## Installation The package can be installed from [Bioconductor](https://bioconductor.org/packages/nnSVG) as follows, using R version 4.2 or above: ```r install.packages("BiocManager") BiocManager::install("nnSVG") ``` Alternatively, the latest development version of the package can also be installed from GitHub: ```r remotes::install_github("lmweber/nnSVG") ``` If you are installing from GitHub, the following dependency packages may also need to be installed manually from Bioconductor and CRAN: ```r install.packages("BiocManager") BiocManager::install("SpatialExperiment") BiocManager::install("STexampleData") install.packages("BRISC") ``` ## Input data format In the examples below, we assume the input data are provided as a [SpatialExperiment](https://bioconductor.org/packages/SpatialExperiment) Bioconductor object. In this case, the outputs are stored in the `rowData` of the `SpatialExperiment` object. However, the inputs can also be provided as a numeric matrix of normalized and transformed counts (e.g. log-transformed normalized counts) and a numeric matrix of spatial coordinates. To provide the inputs as numeric matrices, please install the development version of the package from [GitHub](https://github.com/lmweber/nnSVG) or the `devel` version of Bioconductor (which will become the new Bioconductor release version in October 2022). ## Workflow An example workflow is provided in the package vignette from [Bioconductor](https://bioconductor.org/packages/nnSVG) and shown here in short form. **Load packages** ```r library(nnSVG) library(STexampleData) library(scran) library(ggplot2) ``` **Load example dataset** ```r # load example dataset from STexampleData package spe <- Visium_humanDLPFC() dim(spe) ``` ```r ## [1] 33538 4992 ``` **Preprocessing** ```r # keep spots over tissue spe <- spe[, colData(spe)$in_tissue == 1] dim(spe) ``` ```r ## [1] 33538 3639 ``` ```r # spot-level quality control: already performed on this example dataset ``` ```r # filter low-expressed and mitochondrial genes # using function from nnSVG package with default filtering parameters spe <- filter_genes(spe) ``` ```r ## Gene filtering: removing mitochondrial genes ## removed 13 mitochondrial genes ## Gene filtering: retaining genes with at least 3 counts in at least 0.5% (n = 19) of spatial locations ## removed 30216 out of 33525 genes due to low expression ``` ```r # calculate log-transformed normalized counts (logcounts) using scran package set.seed(123) qclus <- quickCluster(spe) spe <- computeSumFactors(spe, cluster = qclus) spe <- logNormCounts(spe) assayNames(spe) ``` ```r ## [1] "counts" "logcounts" ``` **Subset data for this example** ```r # select small set of random genes and several known SVGs for faster runtime in this example workflow set.seed(123) ix_random <- sample(seq_len(nrow(spe)), 10) known_genes <- c("MOBP", "PCP4", "SNAP25", "HBB", "IGKC", "NPY") ix_known <- which(rowData(spe)$gene_name %in% known_genes) ix <- c(ix_known, ix_random) spe <- spe[ix, ] dim(spe) ``` ```r ## [1] 16 3639 ``` **Run nnSVG** ```r # set seed for reproducibility # run nnSVG using a single thread for this example workflow set.seed(123) spe <- nnSVG(spe, n_threads = 1) # show results rowData(spe) ``` ```r ## DataFrame with 16 rows and 17 columns ## [...] ``` **Investigate results** The results are stored in the `rowData` of the `SpatialExperiment` object. The main results of interest are: - `LR_stat`: likelihood ratio (LR) statistics used to rank SVGs - `rank`: rank of top SVGs according to LR statistics - `pval`: approximate p-values - `padj`: approximate p-values adjusted for multiple testing - `prop_sv`: effect size defined as proportion of spatial variance ```r # number of significant SVGs table(rowData(spe)$padj <= 0.05) ``` ```r ## ## FALSE TRUE ## 7 9 ``` ```r # show results for top n SVGs n <- 10 rowData(spe)[order(rowData(spe)$rank)[1:n], ] ``` ```r ## DataFrame with 10 rows and 17 columns ## gene_id gene_name feature_type sigma.sq ## <character> <character> <character> <numeric> ## ENSG00000168314 ENSG00000168314 MOBP Gene Expression 1.86459294 ## ENSG00000132639 ENSG00000132639 SNAP25 Gene Expression 0.33829228 ## ENSG00000211592 ENSG00000211592 IGKC Gene Expression 0.59161928 ## ENSG00000244734 ENSG00000244734 HBB Gene Expression 0.34819123 ## ENSG00000183036 ENSG00000183036 PCP4 Gene Expression 0.22354847 ## ENSG00000122585 ENSG00000122585 NPY Gene Expression 0.29511061 ## ENSG00000129562 ENSG00000129562 DAD1 Gene Expression 0.03687246 ## ENSG00000114923 ENSG00000114923 SLC4A3 Gene Expression 0.01123674 ## ENSG00000133606 ENSG00000133606 MKRN1 Gene Expression 0.00543859 ## ENSG00000149923 ENSG00000149923 PPP4C Gene Expression 0.12004347 ## tau.sq phi loglik runtime mean var ## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ## ENSG00000168314 0.371646 0.922937 -3716.46 0.959 0.841100 1.382681 ## ENSG00000132639 0.440346 3.570016 -3940.04 0.668 3.464790 0.779762 ## ENSG00000211592 0.464762 20.035566 -4580.98 1.618 0.630200 1.042847 ## ENSG00000244734 0.365750 27.611193 -4114.59 2.233 0.418996 0.729640 ## ENSG00000183036 0.456889 8.700988 -4041.98 1.001 0.684281 0.681316 ## ENSG00000122585 0.302841 68.183198 -4087.69 1.375 0.401353 0.599801 ## ENSG00000129562 0.484816 8.805056 -3942.80 2.051 0.561114 0.523034 ## ENSG00000114923 0.237750 16.239042 -2621.78 0.992 0.249525 0.249055 ## ENSG00000133606 0.277671 0.537947 -2861.92 1.873 0.295867 0.283165 ## ENSG00000149923 0.132992 198.872410 -2660.25 5.132 0.235632 0.253096 ## spcov prop_sv loglik_lm LR_stat rank pval ## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ## ENSG00000168314 1.623470 0.8338075 -5752.58 4072.24582 1 0.00000e+00 ## ENSG00000132639 0.167868 0.4344665 -4710.39 1540.68468 2 0.00000e+00 ## ENSG00000211592 1.220514 0.5600433 -5239.35 1316.74199 3 0.00000e+00 ## ENSG00000244734 1.408312 0.4877029 -4589.50 949.82287 4 0.00000e+00 ## ENSG00000183036 0.690958 0.3285363 -4464.82 845.69334 5 0.00000e+00 ## ENSG00000122585 1.353525 0.4935363 -4232.97 290.54324 6 0.00000e+00 ## ENSG00000129562 0.342215 0.0706791 -3983.78 81.97837 7 0.00000e+00 ## ENSG00000114923 0.424820 0.0451298 -2633.77 23.96612 8 6.24917e-06 ## ENSG00000133606 0.249257 0.0192102 -2867.31 10.77036 9 4.58402e-03 ## ENSG00000149923 1.470399 0.4744140 -2663.05 5.60524 10 6.06510e-02 ## padj ## <numeric> ## ENSG00000168314 0.00000e+00 ## ENSG00000132639 0.00000e+00 ## ENSG00000211592 0.00000e+00 ## ENSG00000244734 0.00000e+00 ## ENSG00000183036 0.00000e+00 ## ENSG00000122585 0.00000e+00 ## ENSG00000129562 0.00000e+00 ## ENSG00000114923 1.24983e-05 ## ENSG00000133606 8.14937e-03 ## ENSG00000149923 9.70416e-02 ``` **Plot expression of top SVG** Plot expression of the top-ranked SVG. ```r # plot spatial expression of top-ranked SVG ix <- which(rowData(spe)$rank == 1) ix_name <- rowData(spe)$gene_name[ix] ix_name ``` ```r ## [1] "MOBP" ``` ```r df <- as.data.frame(cbind(spatialCoords(spe), expr = counts(spe)[ix, ])) ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres, color = expr)) + geom_point(size = 0.8) + coord_fixed() + scale_y_reverse() + scale_color_gradient(low = "gray90", high = "blue", trans = "sqrt", breaks = range(df$expr), name = "counts") + ggtitle(ix_name) + theme_bw() + theme(plot.title = element_text(face = "italic"), panel.grid = element_blank(), axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) ``` <img src="https://user-images.githubusercontent.com/8062417/179638201-a6d0cc21-a625-4899-8ab4-9b082b1d3a8c.png" alt="Spatial expression plot of top-ranked SVG" title="Spatial expression plot of top-ranked SVG" width="350px"> ## Citation Our preprint describing `nnSVG` is available from bioRxiv: - [Weber L.M. et al. (2022), "nnSVG: scalable identification of spatially variable genes using nearest-neighbor Gaussian processes", bioRxiv](https://www.biorxiv.org/content/10.1101/2022.05.16.492124v1)