... | ... |
@@ -1,6 +1,6 @@ |
1 | 1 |
Package: GRaNIE |
2 | 2 |
Title: GRaNIE: Reconstruction cell type specific gene regulatory networks including enhancers using chromatin accessibility and RNA-seq data |
3 |
-Version: 1.3.2 |
|
3 |
+Version: 1.3.3 |
|
4 | 4 |
Encoding: UTF-8 |
5 | 5 |
Authors@R: c(person("Christian", "Arnold", email = |
6 | 6 |
"chrarnold@web.de", role = c("cre","aut")), |
... | ... |
@@ -1,4 +1,15 @@ |
1 |
-# GRaNIE 1.1.14 to 1.1.21 (2022-12-13) |
|
1 |
+# GRaNIE 1.1.22 to 1.3.3 (2022-11-29) |
|
2 |
+ |
|
3 |
+## Major changes |
|
4 |
+- additional normalization schemes have been implemented, including GC-aware normalization schemes for peaks, and existing normalization methods have been renamed for clarity. See `?addData` for details. |
|
5 |
+- further reduced the package burden; the large genome annotation packages are now more or less fully optional and only needed when a GC-aware normalization has been chosen or when additional peak annotation is wanted. However, in contrast to before, none of these annotation packages are strictly required anywhere anymore |
|
6 |
+ |
|
7 |
+## Minor changes |
|
8 |
+- various small changes in the code |
|
9 |
+- vignette updates |
|
10 |
+ |
|
11 |
+ |
|
12 |
+# GRaNIE 1.1.14 to 1.1.21 (2022-11-13) |
|
2 | 13 |
|
3 | 14 |
## Major changes |
4 | 15 |
- major object changes and optimizations, particularly related to storing the count matrices in an optimized and simpler format. In short, the count matrices are now stored either as normal or sparse matrices, depending on the amount of zeros present. In addition, only the counts after normalization are saved, the raw counts before applying normalization are not stored anymore. If no normalization is wished by the user, as before, the "normalized" counts are equal to the raw counts. `GRaNIE` is now more readily applicable for larger analyses and single-cell analysis even though we just started actively optimizing for it, so we cannot yet recommend applying our framework in a single-cell manner. Older GRN objects are automatically changed internally when executing the major functions upon the first invocation. |
... | ... |
@@ -685,7 +685,7 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq2_sizeFactors |
685 | 685 |
peak.ID = query$peakID, |
686 | 686 |
peak.GC.class = cut(.data$`G|C`, breaks = seq(0,1,1/nBins), include.lowest = TRUE, ordered_result = TRUE)) %>% |
687 | 687 |
dplyr::rename(peak.GC.perc = .data$`G|C`) %>% |
688 |
- dplyr::select(peak.ID, everything()) |
|
688 |
+ dplyr::select(.data$peak.ID, tidyselect::everything()) |
|
689 | 689 |
|
690 | 690 |
|
691 | 691 |
.printExecutionTime(start) |
... | ... |
@@ -819,7 +819,7 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq2_sizeFactors |
819 | 819 |
|
820 | 820 |
futile.logger::flog.info(paste0(" Using the csaw-derived TMM-derived normalization factors as size factors, overriding the DESeq-default size factors.")) |
821 | 821 |
|
822 |
- sizeFactors(dd) <- sizeFactors |
|
822 |
+ DESeq2::sizeFactors(dd) <- sizeFactors |
|
823 | 823 |
} |
824 | 824 |
|
825 | 825 |
dataNorm = DESeq2::counts(dd, normalized=TRUE) |
... | ... |
@@ -877,8 +877,8 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq2_sizeFactors |
877 | 877 |
# (2) effects related to between-lane distributional differences, e.g., sequencing depth. |
878 | 878 |
# Accordingly, withinLaneNormalization and betweenLaneNormalization adjust for the first and second type of effects, respectively. |
879 | 879 |
# We recommend to normalize for within-lane effects prior to between-lane normalization. |
880 |
- dataWithin <- withinLaneNormalization(data, y = peaks_GC_fraction, which= withinLane_method, num.bins = nBins, round = roundResults) |
|
881 |
- dataNorm <- betweenLaneNormalization(dataWithin, which=betweenLane_method, round = roundResults) |
|
880 |
+ dataWithin <- EDASeq::withinLaneNormalization(data, y = peaks_GC_fraction, which= withinLane_method, num.bins = nBins, round = roundResults) |
|
881 |
+ dataNorm <- EDASeq::betweenLaneNormalization(dataWithin, which=betweenLane_method, round = roundResults) |
|
882 | 882 |
|
883 | 883 |
|
884 | 884 |
} else if (normalization == "gcqn_peaks") { |
... | ... |
@@ -28,6 +28,9 @@ GRaNIE and GRaNPA: Inference and evaluation of enhancer-mediated gene regulatory |
28 | 28 |
|
29 | 29 |
For issues, bugs, and feature request, please see the [Issue Tracker](https://git.embl.de/grp-zaugg/GRaNIE/issues). |
30 | 30 |
|
31 |
+**We are actively working on the package and regularly improve upon features, add features, or change features for increased clarity. This sometimes results in minor changes to the workflow, changed argument names or other small incompatibilities that may result in errors when running a version of the package that differs from the version this vignette has been run for.** |
|
32 |
+**Thus, make sure to run a version of `GRaNIE` that is compatible with this vignette. If in doubt or when you receive errors, check the R help, which always contains the most up-to-date documentation.** |
|
33 |
+ |
|
31 | 34 |
If you have other questions or comments, feel free to contact us. We will be happy to answer any questions related to this project as well as questions related to the software implementation. For method-related questions, contact Judith B. Zaugg (judith.zaugg@embl.de). For technical questions, contact Christian Arnold (christian.arnold@embl.de). We will aim to respond in a timely manner. |
32 | 35 |
|
33 | 36 |
|
... | ... |
@@ -7,10 +7,10 @@ |
7 | 7 |
addData( |
8 | 8 |
GRN, |
9 | 9 |
counts_peaks, |
10 |
- normalization_peaks = "DESeq2_sizeFactor", |
|
10 |
+ normalization_peaks = "DESeq2_sizeFactors", |
|
11 | 11 |
idColumn_peaks = "peakID", |
12 | 12 |
counts_rna, |
13 |
- normalization_rna = "quantile", |
|
13 |
+ normalization_rna = "limma_quantile", |
|
14 | 14 |
idColumn_RNA = "ENSEMBL", |
15 | 15 |
sampleMetadata = NULL, |
16 | 16 |
additionalParams.l = list(), |
... | ... |
@@ -26,7 +26,7 @@ addData( |
26 | 26 |
In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_peaks} below. |
27 | 27 |
Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.} |
28 | 28 |
|
29 |
-\item{normalization_peaks}{Character. Default \code{DESeq2_sizeFactor}. Normalization procedure for peak data. |
|
29 |
+\item{normalization_peaks}{Character. Default \code{DESeq2_sizeFactors}. Normalization procedure for peak data. |
|
30 | 30 |
Must be one of \code{limma_cyclicloess}, \code{limma_quantile}, \code{limma_scale}, \code{csaw_cyclicLoess_orig}, \code{csaw_TMM}, |
31 | 31 |
\code{EDASeq_GC_peaks}, \code{gcqn_peaks}, \code{DESeq2_sizeFactors}, \code{none}.} |
32 | 32 |
|
... | ... |
@@ -38,7 +38,7 @@ of the peak coordinates, respectively. End must be bigger than start. Examples f |
38 | 38 |
In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_rna} below. |
39 | 39 |
Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.} |
40 | 40 |
|
41 |
-\item{normalization_rna}{Character. Default \code{quantile}. Normalization procedure for peak data. |
|
41 |
+\item{normalization_rna}{Character. Default \code{limma_quantile}. Normalization procedure for peak data. |
|
42 | 42 |
Must be one of \code{limma_cyclicloess}, \code{limma_quantile}, \code{limma_scale}, \code{csaw_cyclicLoess_orig}, \code{csaw_TMM}, \code{DESeq2_sizeFactors}, \code{none}.} |
43 | 43 |
|
44 | 44 |
\item{idColumn_RNA}{Character. Default \code{ENSEMBL}. Name of the column in the \code{counts_rna} data frame that contains Ensembl IDs.} |
... | ... |
@@ -6,9 +6,9 @@ |
6 | 6 |
\usage{ |
7 | 7 |
filterData( |
8 | 8 |
GRN, |
9 |
- minNormalizedMean_peaks = 5, |
|
9 |
+ minNormalizedMean_peaks = NULL, |
|
10 | 10 |
maxNormalizedMean_peaks = NULL, |
11 |
- minNormalizedMeanRNA = 1, |
|
11 |
+ minNormalizedMeanRNA = NULL, |
|
12 | 12 |
maxNormalizedMeanRNA = NULL, |
13 | 13 |
chrToKeep_peaks = NULL, |
14 | 14 |
minSize_peaks = 20, |
... | ... |
@@ -23,13 +23,17 @@ filterData( |
23 | 23 |
\arguments{ |
24 | 24 |
\item{GRN}{Object of class \code{\linkS4class{GRN}}} |
25 | 25 |
|
26 |
-\item{minNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.} |
|
26 |
+\item{minNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
|
27 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
27 | 28 |
|
28 |
-\item{maxNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.} |
|
29 |
+\item{maxNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
|
30 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
29 | 31 |
|
30 |
-\item{minNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.} |
|
32 |
+\item{minNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
|
33 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
31 | 34 |
|
32 |
-\item{maxNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.} |
|
35 |
+\item{maxNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
|
36 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
33 | 37 |
|
34 | 38 |
\item{chrToKeep_peaks}{Character vector or \code{NULL}. Default \code{NULL}. Vector of chromosomes that peaks are allowed to come from. This filter can be used to filter sex chromosomes from the peaks, for example (e.g, \code{c(paste0("chr", 1:22), "chrX", "chrY")})} |
35 | 39 |
|
... | ... |
@@ -37,13 +41,17 @@ filterData( |
37 | 41 |
|
38 | 42 |
\item{maxSize_peaks}{Integer[1,] or \code{NULL}. Default 10000. Maximum peak size (width, end - start) for a peak to be retained. Set to \code{NULL} for not applying the filter.} |
39 | 43 |
|
40 |
-\item{minCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.} |
|
44 |
+\item{minCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter. |
|
45 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
41 | 46 |
|
42 |
-\item{maxCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.} |
|
47 |
+\item{maxCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter. |
|
48 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
43 | 49 |
|
44 |
-\item{minCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter.} |
|
50 |
+\item{minCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter. |
|
51 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
45 | 52 |
|
46 |
-\item{maxCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter.} |
|
53 |
+\item{maxCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter. |
|
54 |
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.} |
|
47 | 55 |
|
48 | 56 |
\item{forceRerun}{\code{TRUE} or \code{FALSE}. Default \code{FALSE}. Force execution, even if the GRN object already contains the result. Overwrites the old results.} |
49 | 57 |
} |
... | ... |
@@ -51,8 +59,11 @@ filterData( |
51 | 59 |
An updated \code{\linkS4class{GRN}} object, with added data from this function. |
52 | 60 |
} |
53 | 61 |
\description{ |
54 |
-This function marks genes and/or peaks as \code{filtered} depending on the chosen filtering criteria. Filtered genes / peaks will then be |
|
55 |
-disregarded when adding connections in subsequent steps via \code{\link{addConnections_TF_peak}} and \code{\link{addConnections_peak_gene}}. \strong{This function does NOT (re)filter existing connections when the \code{\linkS4class{GRN}} object already contains connections. Thus, upon re-execution of this function with different filtering criteria, all downstream steps have to be re-run.} |
|
62 |
+This function marks genes and/or peaks as \code{filtered} depending on the chosen filtering criteria and is based on the count data AFTER |
|
63 |
+potential normalization as chosen when using the \code{\link{addData}} function. Most of the filters may not be meaningful and useful anymore to apply |
|
64 |
+after using particular normalization schemes that can give rise to, for example, negative values such as cyclic loess normalization. If normalized counts do |
|
65 |
+not represents counts anymore but rather a deviation from a mean or something a like, the filtering critieria usually do not make sense anymore. |
|
66 |
+Filtered genes / peaks will then be disregarded when adding connections in subsequent steps via \code{\link{addConnections_TF_peak}} and \code{\link{addConnections_peak_gene}}. \strong{This function does NOT (re)filter existing connections when the \code{\linkS4class{GRN}} object already contains connections. Thus, upon re-execution of this function with different filtering criteria, all downstream steps have to be re-run.} |
|
56 | 67 |
} |
57 | 68 |
\details{ |
58 | 69 |
All this function does is setting (or modifying) the filtering flag in \code{GRN@data$peaks$counts_metadata} and \code{GRN@data$RNA$counts_metadata}, respectively. |
... | ... |
@@ -71,7 +71,7 @@ Overall, we tried to minimize the installation burden and only require packages |
71 | 71 |
|
72 | 72 |
- **Genome assembly and annotation packages** (only one of the four is optionally needed, which of them depends on your genome assembly version) |
73 | 73 |
|
74 |
- - all of the following packages are ONLY needed for either additional peak annotation in combination with `ChIPseeker` (see below) or if the chosen peak normalization method is GC based: `EDASeq_GC_peaks`, `gcqn_peaks` |
|
74 |
+ - **all of the following packages are ONLY needed for either additional peak annotation in combination with `ChIPseeker` (see below) or if the chosen peak normalization method is GC based: `EDASeq_GC_peaks`, `gcqn_peaks`** |
|
75 | 75 |
- `org.Hs.eg.db`: Needed only when genome assembly version is `hg19` or `hg38` |
76 | 76 |
- `org.Mm.eg.db`: Needed only when genome assembly version is `mm9` or `mm10` |
77 | 77 |
- `BSgenome.Hsapiens.UCSC.hg19`, `TxDb.Hsapiens.UCSC.hg19.knownGene`: Needed only when genome assembly version is `hg19` |
... | ... |
@@ -162,7 +162,7 @@ In this section, we give methodological details and guidelines. |
162 | 162 |
|
163 | 163 |
## Data normalization {#methods_dataNorm} |
164 | 164 |
|
165 |
-An important consideration is data normalization for RNA and open chromatin data. We currently support three choices of normalization of either peak or RNA-Seq data: `quantile` (using `limma::normalizeQuantiles`), `DESeq_sizeFactor` and `none` and refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it is `DESeq_sizeFactor` (i.e., a "regular" `DESeq` size factor normalization). Importantly, `DESeq_sizeFactor` requires raw data, while `quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or "none". |
|
165 |
+An important consideration is data normalization for RNA and open chromatin data. We currently support six choices of normalization of either peak or RNA-Seq data: `limma_quantile`, `DESeq2_sizeFactors`, `limma_cyclicloess`, `limma_scale`, `csaw_cyclicLoess_orig`, `csaw_TMM` plus `none` to skip normalization altogether. For peaks, two additional GC-based normalization schemes are offered: `EDASeq_GC_peaks` and `gcqn_peaks`. We refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it is `DESeq2_sizeFactora` (i.e., a "regular" `DESeq2` size factor normalization). Importantly, `DESeq2_sizeFactora` requires raw data, while `limma_quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or `none`. |
|
166 | 166 |
|
167 | 167 |
While we recommend raw counts for both peaks and RNA-Seq as input and offer several normalization choices in the pipeline, it is also possible to provide pre-normalized data. Note that the normalization method may have a large influence on the resulting *eGRN* network, so make sure the choice of normalization is reasonable. For more details, see the next sections. |
168 | 168 |
|
... | ... |
@@ -105,6 +105,12 @@ library(GRaNIE) |
105 | 105 |
|
106 | 106 |
When installing *GRaNIE*, all required dependency packages are automatically installed. In addition, *GRaNIE* needs some additional packages for special functionality, packages that are not strictly necessary for the workflow but which enhance the functionality, may be required depending on certain parameters (such as your genome assembly version), or may be required only when using a particular functionality (such as the `robust` package for robust regression). The package will automatically check if any of these packages are missing during execution, and inform the user when a package is missing, along with a line to copy for pasting into R for installation. |
107 | 107 |
|
108 |
+## Note on version compatibility and errors in the vignette |
|
109 |
+ |
|
110 |
+**We are actively working on the package and regularly improve upon features, add features, or change features for increased clarity. This sometimes results in minor changes to the workflow, changed argument names or other small incompatibilities that may result in errors when running a version of the package that differs from the version this vignette has been run for.** |
|
111 |
+ |
|
112 |
+**Thus, make sure to run a version of `GRaNIE` that is compatible with this vignette. If in doubt or when you receive errors, check the R help, which always contains the most up-to-date documentation.** |
|
113 |
+ |
|
108 | 114 |
|
109 | 115 |
|
110 | 116 |
## General notes |
... | ... |
@@ -193,7 +199,7 @@ At any time point, we can simply "print" a `GRaNIE` object by typing its name an |
193 | 199 |
|
194 | 200 |
We are now ready to fill our empty object with data! After preparing the data beforehand, we can now use the data import function `addData()` to import both enhancers and RNA-seq data to the `GRaNIE` object. In addition to the count tables, we explicitly specify the name of the ID columns. As mentioned before, the sample metadata is optional but recommended if available. |
195 | 201 |
|
196 |
-An important consideration is data normalization for RNA and ATAC. We support many different choices of normalization, the selection of which also depends on whether RNA or peaks is considered, and possible choices are: `limma_quantile`, `DESeq2_sizeFactors` and `none` and refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin enhancer data, it is `DESeq2_sizeFactor2` (i.e., a "regular" `DESeq` size factor normalization). Importantly, `DESeq2_sizeFactors` requires raw data, while `quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or "none". |
|
202 |
+An important consideration is data normalization for RNA and ATAC. We support many different choices of normalization, the selection of which also depends on whether RNA or peaks is considered, and possible choices are: `limma_quantile`, `DESeq2_sizeFactors` and `none` and refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin enhancer data, it is `DESeq2_sizeFactors` (i.e., a "regular" `DESeq2` size factor normalization). Importantly, `DESeq2_sizeFactors` requires raw data, while `limma_quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or `none`. |
|
197 | 203 |
|
198 | 204 |
|
199 | 205 |
```{r addData, echo=TRUE, include=TRUE, eval = FALSE, class.output="scroll-200"} |