---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# sparseMatrixStats <a href='https://github.com/const-ae/sparseMatrixStats'><img src='man/figures/logo.png' align="right" height="209" /></a>

<!-- badges: start -->
[![codecov](https://codecov.io/gh/const-ae/sparseMatrixStats/branch/master/graph/badge.svg)](https://codecov.io/gh/const-ae/sparseMatrixStats)

<!-- badges: end -->

The goal of `sparseMatrixStats` is to make the API of [matrixStats](https://github.com/HenrikBengtsson/matrixStats) available
for sparse matrices.

## Installation

You can install the release version of *[sparseMatrixStats](https://bioconductor.org/packages/sparseMatrixStats)* from BioConductor:

``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sparseMatrixStats")
```

Alternatively, you can get the development version of the package from [GitHub](https://github.com/const-ae/sparseMatrixStats) with:

``` r
# install.packages("devtools")
devtools::install_github("const-ae/sparseMatrixStats")
```


## Example

```{r include=FALSE}
set.seed(1)
```


```{r}
library(sparseMatrixStats)
```


```{r}
mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
```

The package provides an interface to quickly do common operations on the rows or columns. For example calculate
the variance:

```{r}
apply(mat, 2, var)
matrixStats::colVars(mat)
sparseMatrixStats::colVars(sparse_mat)
```

On this small example data, all methods are basically equally fast, but if we have a much larger dataset, the 
optimizations for the sparse data start to show.

I generate a dataset with 10,000 rows and 50 columns that is 99% empty

```{r}
big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")
```

I use the `bench` package to benchmark the performance difference:

```{r}
bench::mark(
  sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
  matrixStats=matrixStats::colVars(big_mat),
  apply=apply(big_mat, 2, var)
)
```


As you can see `sparseMatrixStats` is ca. 35 times fast than `matrixStats`, which in turn is 7 times faster than the `apply()` version.





# API

The package now supports all functions from the `matrixStats` API for column sparse matrices (`dgCMatrix`). And thanks to the [`MatrixGenerics`](https://bioconductor.org/packages/MatrixGenerics/) it can be easily integrated along-side [`matrixStats`](https://cran.r-project.org/package=matrixStats) and [`DelayedMatrixStats`](https://bioconductor.org/packages/DelayedMatrixStats/).
Note that the `rowXXX()` functions are called by transposing the input and calling the corresponding `colXXX()` function. Special optimized implementations are available for `rowSums2()`, `rowMeans2()`, and `rowVars()`.

```{r, echo=FALSE, results="asis"}
matrixStats_functions <- sort(
  c("colsum", "rowsum", grep("^(col|row)", 
                             getNamespaceExports("matrixStats"), 
                             value = TRUE)))
DelayedMatrixStats_functions <- grep("^(col|row)", getNamespaceExports("DelayedMatrixStats"), value=TRUE)
DelayedArray_functions <- grep("^(col|row)", getNamespaceExports("DelayedArray"), value=TRUE)
sparseMatrixStats_functions <- grep("^(col|row)", getNamespaceExports("sparseMatrixStats"), value=TRUE)

notes <- c("colAnyMissings"="Not implemented because it is deprecated in favor of `colAnyNAs()`",
           "rowAnyMissings"="Not implemented because it is deprecated in favor of `rowAnyNAs()`",
           "colsum"="Base R function",
           "rowsum"="Base R function",
           "colWeightedMedians"="Only equivalent if `interpolate=FALSE`",
           "rowWeightedMedians"="Only equivalent if `interpolate=FALSE`",
           "colWeightedMads"="Sparse version behaves slightly differently, because it always uses `interpolate=FALSE`.",
           "rowWeightedMads"="Sparse version behaves slightly differently, because it always uses `interpolate=FALSE`.")

api_df <- data.frame(
  Method = paste0(matrixStats_functions, "()"),
  matrixStats = ifelse(matrixStats_functions %in% matrixStats_functions, "✔", "❌"),
  sparseMatrixStats = ifelse(matrixStats_functions %in%sparseMatrixStats_functions, "✔", "❌"),
  Notes = ifelse(matrixStats_functions %in% names(notes), notes[matrixStats_functions], ""),
  stringsAsFactors = FALSE
)
knitr::kable(api_df, row.names = FALSE)

```