Name Mode Size
R 040000
data 040000
inst 040000
man 040000
tests 040000
vignettes 040000
.gitignore 100644 0 kb
DESCRIPTION 100644 1 kb
NAMESPACE 100644 1 kb
README.md 100644 5 kb
README.md
# qsmooth ### Global normalization methods and their assumptions Global normalization methods such as quantile normalization have become a standard part of the analysis pipeline for high-throughput data to remove unwanted technical variation. These methods and others that rely solely on observed data without external information (e.g. spike-ins) are based on the assumption that only a minority of genes are expected to be differentially expressed (or that an equivalent number of genes increase and decrease across biological conditions. This assumption can be interpreted in different ways leading to different global normalization procedures. For example, in one normalization procedure, the method assumes the mean expression level across genes should be the same across samples. In contrast, quantile normalization assumes the only difference between the statistical distribution of each sample is technical variation. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference. ### How to evaluate if global normalization methods are appropriate? While these assumptions may be reasonable in certain experiments, they may not always be appropriate. Recently, an R/Bioconductor package ([`quantro`](http://www.bioconductor.org/packages/release/bioc/html/quantro.html)) has been developed to test for global differences between groups of distributions to evaluate whether global normalization methods such as quantile normalization should be applied. If global differences are found between groups of distributions, these changes may be of technical or biological of interest. If these changes are of technical interest (e.g. batch effects), then global normalization methods should be applied. If these changes are related to a biological factor (e.g. normal/tumor or two tissues), then global normalization methods should not be applied because the methods will remove the interesting biological variation (i.e. differentially expressed genes) and artificially induce differences between genes that were not differentially expressed. In the cases with global differences between groups of distributions between biological conditions, quantile normalization is not an appropriate normalization method. In these cases, we can consider **a more relaxed assumption** about the data, namely that the statistical distribution of each sample should be the same within biological conditions or groups (compared to the more stringent assumption of quantile normalization, which states the statistical distribution is the same across all samples). ### qsmooth: a generalization of quantile normalization Here we introduce a generalization of quantile normalization, referred to as `smooth quantile normalization` (**qsmooth**), which is a weighted average of the two types of assumptions about the data. The **qsmooth** R-package contains the `qsmooth()` function, which computes a weight at every quantile that compares the variability between groups relative to within groups. In one extreme, quantile normalization is applied and in the other extreme quantile normalization within each biological condition is applied. The weight shrinks the group-level quantile normalized data towards the overall reference quantiles if variability between groups is sufficiently smaller than the variability within groups. The algorithm is described in the Figure below (see the [`vignettes/qsmooth-vignette.pdf`](https://github.com/stephaniehicks/qsmooth/blob/master/vignettes/qsmooth-vignette.pdf) for more details). ![qsmooth algorithm](https://github.com/stephaniehicks/qsmooth/raw/master/vignettes/qsmooth_algo.jpg) ### Installing qsmooth The R-package **qsmooth** can be installed from Github using the R package [devtools](https://github.com/hadley/devtools): Use to install the latest version of **qsmooth** from Github: ```s library(devtools) install_github("stephaniehicks/qsmooth") ``` It can also be installed using Bioconductor: ```s # install BiocManager from CRAN (if not already installed) if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") # install qsmooth package BiocManager::install("qsmooth") ``` After installation, the package can be loaded into R. ```s library(qsmooth) ``` # Using qsmooth The main function in the **qsmooth** package is `qsmooth()`. The `qsmooth()` function needs two objects: (1) a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns (e.g. let's call it `eset`) and (2) a group level factor called `group_factor` (let's call it `outcome`). This order of this factor variable must match the order of the columns in the `eset` object because it contains information about which group each sample is from. To run the `qsmooth()` function, ``` qs <- qsmooth(object = eset, group_factor = outcome) ``` Individual slots can be extracted using accessor methods: ``` qsmoothData(qs) # extract smoothed quantile normalized data qsmoothWeights(qs) # extract smoothed quantile normalized weights ``` The weights can be directly plotted using the `qsmoothPlotWeights()` function. ``` qsmoothPlotWeights(qs) # plot weights ``` See `vignettes/qsmooth-vignette.pdf` for more details. # Bug reports Report bugs as issues on the [GitHub repository](https://github.com/stephaniehicks/qsmooth) # Contributors * [Stephanie Hicks](https://github.com/stephaniehicks) * [Kwame Okrah](https://github.com/kokrah)