Bioconductor Code: qsvaR

Name	Mode	Size
.github	040000
R	040000
data-raw	040000
data	040000
dev	040000
inst	040000
man	040000
pkgdown	040000
tests	040000
vignettes	040000
.Rbuildignore	100644	0 kb
.gitignore	100644	0 kb
DESCRIPTION	100644	2 kb
NAMESPACE	100644	0 kb
NEWS.md	100644	1 kb
README.Rmd	100644	12 kb
README.md	100644	13 kb
codecov.yml	100644	0 kb

README.md

# qsvaR  [![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable) [![Bioc release status](http://www.bioconductor.org/shields/build/release/bioc/qsvaR.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/qsvaR) [![Bioc devel status](http://www.bioconductor.org/shields/build/devel/bioc/qsvaR.svg)](https://bioconductor.org/checkResults/devel/bioc-LATEST/qsvaR) [![Bioc downloads rank](https://bioconductor.org/shields/downloads/release/qsvaR.svg)](http://bioconductor.org/packages/stats/bioc/qsvaR/) [![Bioc support](https://bioconductor.org/shields/posts/qsvaR.svg)](https://support.bioconductor.org/tag/qsvaR) [![Bioc history](https://bioconductor.org/shields/years-in-bioc/qsvaR.svg)](https://bioconductor.org/packages/release/bioc/html/qsvaR.html#since) [![Bioc last commit](https://bioconductor.org/shields/lastcommit/devel/bioc/qsvaR.svg)](http://bioconductor.org/checkResults/devel/bioc-LATEST/qsvaR/) [![Bioc dependencies](https://bioconductor.org/shields/dependencies/release/qsvaR.svg)](https://bioconductor.org/packages/release/bioc/html/qsvaR.html#since) [![Codecov test coverage](https://codecov.io/gh/LieberInstitute/qsvaR/branch/devel/graph/badge.svg)](https://codecov.io/gh/LieberInstitute/qsvaR?branch=devel) [![R build status](https://github.com/LieberInstitute/qsvaR/actions/workflows/check-bioc.yml/badge.svg)](https://github.com/LieberInstitute/qsvaR/actions/workflows/check-bioc.yml) [![GitHub issues](https://img.shields.io/github/issues/LieberInstitute/qsvaR)](https://github.com/LieberInstitute/qsvaR/issues) [![GitHub pulls](https://img.shields.io/github/issues-pr/LieberInstitute/qsvaR)](https://github.com/LieberInstitute/qsvaR/pulls) [![DOI](https://zenodo.org/badge/421556636.svg)](https://zenodo.org/badge/latestdoi/421556636)  Differential expressions analysis requires the ability to normalize complex datasets. In the case of postmortem brain tissue we are tasked with removing the effects of bench degradation. The `qsvaR` package combines an established method for removing the effects of degradation from RNA-seq data with easy to use functions. It is the second iteration of the qSVA framework ([Jaffe et al, PNAS, 2017](https://doi.org/10.1073/pnas.1617384114)). The first step in the `qsvaR` workflow is to create an [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) object with the transcripts identified in our qSVA experiment. If you already have a [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) of transcripts we can do this with the `getDegTx()` function as shown below.If not this can be generated with the [`SPEAQeasy`](http://research.libd.org/SPEAQeasy/index.html) (a RNA-seq pipeline maintained by our lab) pipeline using the `--qsva` flag. If you already have a [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) object with transcripts then you do not need to run [`SPEAQeasy`](http://research.libd.org/SPEAQeasy/index.html). This flag requires a full path to a text file, containing one Ensembl transcript ID per line for each transcript desired in the final transcripts R output object (called `rse_tx`). The `sig_transcripts` argument in this package should contain the same Ensembl transcript IDs as the text file for the `--qsva` flag.The goal of `qsvaR` is to provide software that can remove the effects of bench degradation from RNA-seq data. ## Installation Instructions Get the latest stable R release from CRAN. Then install `qsvaR` using from Bioconductor the following code: ``` r if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("qsvaR") ``` And the development version from GitHub with: ``` r BiocManager::install("LieberInstitute/qsvaR") ``` ## Example This is a basic example which shows how to obtain the quality surrogate variables (qSVs) for the brainseq [phase II dataset](http://eqtl.brainseq.org/phase2). qSVs are essentially principal components from an rna-seq experiment designed to model bench degradation. For more on principal components you can read and introductory article [here](https://towardsdatascience.com/tidying-up-with-pca-an-introduction-to-principal-components-analysis-f876599af383#:~:text=The%20goal%20of%20PCA%20is,eliminate%20ones%20that%20do%20not). At the start of this script we will have an [`RangedSummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) and a list of all the transcripts found in our degradation study. At the end we will have a table with differential expression results that is adjusted for qSVs. ``` r ## R packages we'll use library("qsvaR") library("limma") ``` ``` r library("qsvaR") ## We'll download example data from the BrainSeq Phase II project ## described at http://eqtl.brainseq.org/phase2/. ## ## We'll use BiocFileCache to cache these files so you don't have to download ## them again for other examples. bfc <- BiocFileCache::BiocFileCache() rse_file <- BiocFileCache::bfcrpath( "https://s3.us-east-2.amazonaws.com/libd-brainseq2/rse_tx_unfiltered.Rdata", x = bfc ) #> adding rname 'https://s3.us-east-2.amazonaws.com/libd-brainseq2/rse_tx_unfiltered.Rdata' ## Now that we have the data in our computer, we can load it. load(rse_file, verbose = TRUE) #> Loading objects: #> rse_tx ``` In this next step, we subset to the transcripts associated with degradation. `qsvaR` provides significant transcripts determined in four different linear models of transcript expression against degradation time, brain region, and potentially cell-type proportions: 1. `exp ~ DegradationTime + Region` 2. `exp ~ DegradationTime * Region` 3. `exp ~ DegradationTime + Region + CellTypeProp` 4. `exp ~ DegradationTime * Region + CellTypeProp` `select_transcripts()` returns degradation-associated transcripts and supports two parameters. First, `top_n` controls how many significant transcripts to extract from each model. When `cell_component = TRUE`, all four models are used; otherwise, just the first two are used. The union of significant transcripts from all used models is returned. As an example, we’ll subset our `RangedSummarizedExperiment` to the union of the top 1000 significant transcripts derived from each of the four models. ``` r # Subset 'rse_tx' to the top 1000 significant transcripts from the four # degradation models DegTx <- getDegTx( rse_tx, sig_transcripts = select_transcripts(top_n = 1000, cell_component = TRUE) ) #> Using 2496 degradation-associated transcripts. ## Now we can compute the Principal Components (PCs) of the degraded ## transcripts pcTx <- getPCs(DegTx, "tpm") ``` Next we use the `k_qsvs()` function to calculate how many PCs we will need to account for the variation. A model matrix accounting for relevant variables should be used. Common variables such as Age, Sex, Race and Region are often included in the model. Again we are using our `RangedSummarizedExperiment` `DegTx` as the `rse_tx` option. Next we specify the `mod` with our `model.matrix()`. `model.matrix()` creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly. For more information on creating a design matrix for your experiment see the documentation [here](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html). Again we use the `assayname` option to specify the we are using the `tpm` assay, where TPM stands for *transcripts per million*. ``` r ## Using a simple statistical model we determine the number of PCs needed (k) mod <- model.matrix(~ Dx + Age + Sex + Race + Region, data = colData(rse_tx) ) k <- k_qsvs(DegTx, mod, "tpm") print(k) #> [1] 20 ``` Now that we have our PCs and the number we need we can generate our qSVs. ``` r ## Obtain the k qSVs qsvs <- get_qsvs(pcTx, k) dim(qsvs) #> [1] 900 20 ``` This can be done in one step with our wrapper function `qSVA` which just combinds all the previous mentioned functions. ``` r ## Example use of the wrapper function qSVA() qsvs_wrapper <- qSVA( rse_tx = rse_tx, sig_transcripts = select_transcripts(top_n = 1000, cell_component = TRUE), mod = mod, assayname = "tpm" ) #> Using 2496 degradation-associated transcripts. dim(qsvs_wrapper) #> [1] 900 20 ``` ## Differential Expression Next we can use a standard `limma` package approach to do differential expression on the data. The key here is that we add our qSVs to the statistical model we use through `model.matrix()`. Here we input our [`Ranged SummarizedExperiment`](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class) object and our `model.matrix` with qSVs. Note here that the `Ranged SummarizedExperiment` object is the original object loaded with the full list of transcripts, not the the one we subsetted for qSVs. This is because while PCs can be generated from a subset of genes, differential expression is best done on the full dataset. The expected output is a `sigTx` object that shows the results of differential expression. ``` r library("limma") ## Add the qSVs to our statistical model mod_qSVA <- cbind( mod, qsvs ) ## Extract the transcript expression values and put them in the ## log2(TPM + 1) scale txExprs <- log2(assays(rse_tx)$tpm + 1) ## Run the standard linear model for differential expression fitTx <- lmFit(txExprs, mod_qSVA) eBTx <- eBayes(fitTx) ## Extract the differential expression results sigTx <- topTable(eBTx, coef = 2, p.value = 1, number = nrow(rse_tx) ) ## Explore the top results head(sigTx) #> logFC AveExpr t P.Value adj.P.Val #> ENST00000484223.1 -0.17439018 1.144051 -6.685583 4.099898e-11 8.121610e-06 #> ENST00000344423.9 0.09212678 1.837102 6.449533 1.855943e-10 1.838246e-05 #> ENST00000399808.4 0.28974369 4.246788 6.320041 4.165237e-10 2.233477e-05 #> ENST00000467370.5 0.06313938 0.301711 6.307179 4.509956e-10 2.233477e-05 #> ENST00000264657.9 0.09913353 2.450684 5.933186 4.280565e-09 1.375288e-04 #> ENST00000415912.6 0.09028757 1.736581 5.918230 4.671963e-09 1.375288e-04 #> B #> ENST00000484223.1 14.338379 #> ENST00000344423.9 12.865110 #> ENST00000399808.4 12.077344 #> ENST00000467370.5 11.999896 #> ENST00000264657.9 9.811110 #> ENST00000415912.6 9.726142 ``` Finally, you can compare the resulting t-statistics from your differential expression model against the degradation time t-statistics adjusting for the six different brain regions. This type of plot is called `DEqual` plot and was shown in the initial qSVA framework paper ([Jaffe et al, PNAS, 2017](https://doi.org/10.1073/pnas.1617384114)). We are really looking for two patterns exemplified here in Figure 1 (cartoon shown earlier). A direct positive correlation with degradation shown in Figure 1 on the right tells us that there is signal in the data associated with qSVs. An example of nonconfounded data or data that has been modeled can be seen in Figure 1 on the right with its lack of relationship between the x and y variables. <div class="figure"> <img src="./man/figures/DEqual_example.png" alt="Cartoon showing patterns in DEqual plots" width="100%" /> <p class="caption"> Cartoon showing patterns in DEqual plots </p> </div> ``` r ## Generate a DEqual() plot using the model results with qSVs DEqual(sigTx) ``` <div class="figure"> <img src="man/figures/README-DEqual-1.png" alt="Result of Differential Expression with qSVA normalization." width="100%" /> <p class="caption"> Result of Differential Expression with qSVA normalization. </p> </div> For comparison, here is the `DEqual()` plot for the model without qSVs. ``` r ## Generate a DEqual() plot using the model results without qSVs DEqual(topTable(eBayes(lmFit(txExprs, mod)), coef = 2, p.value = 1, number = nrow(rse_tx))) ``` <div class="figure"> <img src="man/figures/README-DEqual-no-qSVs-1.png" alt="Result of Differential Expression without qSVA normalization." width="100%" /> <p class="caption"> Result of Differential Expression without qSVA normalization. </p> </div> In these two DEqual plots we can see that the first is much better. With a correlation of -0.014 we can effectively conclude that we have removed the effects of degradation from the data. In the second plot after modeling for several common variables we still have a correlation of 0.5 with the degradation experiment. This high correlation shows we still have a large amount of signal from degradation in our data potentially confounding our case-control (SCZD vs neurotypical controls) differential expression results.