Name Mode Size
R 040000
README_figs 040000
data 040000
inst 040000
man 040000
tests 040000
vignettes 040000
.Rbuildignore 100644 0 kb
.gitignore 100644 0 kb
DESCRIPTION 100644 1 kb
NAMESPACE 100644 1 kb
README.Rmd 100644 10 kb
README.md 100644 9 kb
README.md
# Manual for the use of the RCM functions ## Publication The underlying method of the RCM package is described in detail in the following article: [“A unified framework for unconstrained and constrained ordination of microbiome read count data”](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205474). ## Install and load packages This repo contains R-code to fit and plot the RC(M)-models augmented with the negative binomial. The basic usage is demonstrated here, for more advanced examples see the [RCM vignette](https://bioconductor.org/packages/release/bioc/vignettes/RCM/inst/doc/RCMvignette.html) and the help pages of *RCM()*, *RCM\_NB()*, and *plot.RCM()* functions. The package can be installed from BioConductor and loaded using the following commands: ``` r library(BiocManager) install("RCM") ``` ``` r suppressPackageStartupMessages(library(RCM)) cat("RCM package version", as.character(packageVersion("RCM")), "\n") ``` ## RCM package version 1.5.8 Alternatively, the latest version can be installed directly from this GitHub repo as follows: ``` r library(devtools) install_github("CenterForStatistics-UGent/RCM") ``` ## Dataset As example data we use a study on the microbiome of colorectal cancer patients “Potential of fecal microbiota for early-stage detection of colorectal cancer” (2014) by Zeller *et al.*. ``` r library(phyloseq) data(Zeller) ``` The *Zeller* object is a phyloseq object, which contains all the information of a microbiome experiment. The *RCM* package is tailor-made for phyloseq objects. We therefore strongly recommend building a phyloseq object out of your data, and feeding that object into the *RCM()* function. More information on building the phyloseq object can be found [here](http://joey711.github.io/phyloseq/import-data.html). ## Unconstrained RCM ### Fitting the unconstrained RCM model The unconstrained RC(M) method represents all variability present in the data, regardless of covariate information. It should be used as a first step in an exploratory analysis. The *RCM* function is the front end function for this purpose, behind it is the *RCM\_NB* function that does the hard work but requires numeric matrix imputs. We apply it to the Zeller data. ``` r ZellerRCM2 = RCM(Zeller) ``` which took 1.6 minutes. ### Plotting the unconstrained RCM A simple plot-command will yield a plot of the RCM ordination ``` r plot(ZellerRCM2) ``` ![](README_figs/README-plotUnconstrainedRCMall-1.png)<!-- --> Samples are represented by dots, taxa by arrows. Both represent vectors with the origin as starting point. Valid interpretatations are the following: - Samples (endpoints of sample vectors, the red dots) close together have a similar taxon composition - Taxa are more abundant than average in samples to which their arrows point, and less abundant when their arrows point away from the samples. For example Fusobacterium mortiferum is more abundant than average in samples on the left side of the plot, and more abundant in samples on the right side. - The importance parameters \(\psi\) shown for every axis reflect the relative importance of the dimensions Distances between endpoints of taxon vectors are meaningless. The plot can be made more interpretable by adding some colour, e.g. by colouring the samples by cancer diagnosis. For this one can simply provide the name of the sample variable as present in the phyloseq object. For more plotting options see *?plot.RCM*. ``` r plot(ZellerRCM2, samColour = "Diagnosis") ``` ![](README_figs/README-plotUnconstrainedRCMallColour-1.png)<!-- --> Any richness measure defined in the *phyloseq* package (see *?estimateRichness*) can also be supplied as sample colour. ``` r plot(ZellerRCM2, samColour = "Shannon") ``` ![](README_figs/README-plotRichness-1.png)<!-- --> #### Conditioning In order to condition on certain variables, supply their names to the *confounders* argument. Here we condition on the *country* variable ``` r ZellerRCM2cond = RCM(Zeller, confounders = "Country") ``` Conditioning can be applied for unconstrained as well as constrained RCM, the ensuing plots have exactly the same interpretationas before, except that all variability attributable to the confounding variables has been removed. ``` r plot(ZellerRCM2cond) ``` ![](README_figs/README-plotCond-1.png)<!-- --> ## Constrained RCM In this second step we look for the variability in the dataset explained sample-specific variables. This should be done preferably only with variables that are believed to have an impact on the species’ abundances. Here we used the variables age, gender, BMI, country and diagnosis in the gradient. In this analysis all covariates values of a sample are projected onto a single scalar, the *environmental score* of this sample. The projection vector is called the *environmental gradient*, the magnitude of its components reveals the importance of each variable. The taxon-wise response functions then describe how the logged mean abundance depends on the environmental score. ### Fitting the constrained RCM model In order to request a constrained RCM fit it suffises to supply the names of the constraining variables to the *covariates* argument. The shape of the response function can be either “linear”, “quadratic” or “nonparametric” and must be provided to the *responseFun* argument. Here we illustrate the use of linear and nonparametric response functions. Linear response functions may be too simplistic, they have the advantage of being easy to interpret (and plot). Non-parametric ones (based on splines) are more flexible but are harder to plot. ``` r # Linear ZellerRCM2constr = RCM(Zeller, covariates = c("Age", "Gender", "BMI", "Country", "Diagnosis"), responseFun = "linear") # Nonparametric ZellerRCM2constrNonParam = RCM(Zeller, covariates = c("Age", "Gender", "BMI", "Country", "Diagnosis"), responseFun = "nonparametric") ``` #### Biplots In the constrained case two different biplots are meaningful: sample-taxon biplots and variable-taxon biplots. ##### Sample-taxon biplot ``` r plot(ZellerRCM2constr, plotType = c("species", "samples")) ``` ![](README_figs/README-plotlin2cor-1.png)<!-- --> The interpretation is similar as before: the orthogonal projection of a taxon’s arrow on a sample represents the departure from independence for that taxon in that sample, *explained by environmental variables*. New is also that the taxa arrows do not start from the origin, but all have their own starting point. This starting point represents the environmental scores for which there is no departure from independence. The direction of the arrow then represents in which direction of the environmental gradient its expected abundance increases. ##### Variable-taxon biplot ``` r plot(ZellerRCM2constr, plotType = c("species", "variables")) ``` ![](README_figs/README-plotlin3-1.png)<!-- --> The projection of species arrows on environmental variables (starting from the origin) represents the sensitivity of this taxon to changes in this variables. E.g. Pseudomonas fluorescens is much more abundant in healthy and small adenoma patients than in cancer patients. Note that the fact that the arrows of BMI and gender are of similar length indicates that one *standard deviation* in BMI has a similar effect to gender. We observe that country and diagnosis are the main drivers of the environmental gradient. We also see that healthy and small adenoma patients have similar taxon compositions, which are very different from the taxon composition of cancer patients. #### Triplot: linear response functions Triplots combine all the three components in a single plot. This is the default behaviour of the *plot.RCM* function. ``` r plot(ZellerRCM2constr, samColour = "Diagnosis") ``` ![](README_figs/README-plotlin3Triplot-1.png)<!-- --> Note that the samples and the environmental variables cannot be related to each other, but the sample-taxon and variable-taxon relationships discussed before are still valid on the triplot. #### Triplot: Non-parametric response functions For the non-parametric response functions we can only plot one dimensional triplots, whereby the shape of the response functions of the most strongly reacting taxa is shown. For this we use the *plotRespFun* function. ``` r plotRespFun(ZellerRCM2constrNonParam) ``` ![](README_figs/README-plotNPTriplot-1.png)<!-- --> The first dimension is shown by default, the environmental scores of the samples are shown below as ticks, the y-axis shows the response functions of the 8 most reacting species, whereby the x-axis represents the independence model. This graph demonstrates the power of non-parametric response function modelling to unravel differences in species niches. This document covers only the more basic use of the RCM plotting functions. Discover more advanced features in the [RCM vignette](https://bioconductor.org/packages/release/bioc/vignettes/RCM/inst/doc/RCMvignette.html) or in the help pages of the functions used.