Browse code

Documentation updates

Christian Arnold authored on 29/11/2022 17:53:49
Showing9 changed files

... ...
@@ -1,6 +1,6 @@
1 1
 Package: GRaNIE
2 2
 Title: GRaNIE: Reconstruction cell type specific gene regulatory networks including enhancers using chromatin accessibility and RNA-seq data
3
-Version: 1.3.2
3
+Version: 1.3.3
4 4
 Encoding: UTF-8
5 5
 Authors@R: c(person("Christian", "Arnold", email =
6 6
         "chrarnold@web.de", role = c("cre","aut")),
... ...
@@ -38,6 +38,7 @@ export(plotPCA_all)
38 38
 export(plotTFEnrichment)
39 39
 export(plot_stats_connectionSummary)
40 40
 export(visualizeGRN)
41
+import(GenomeInfoDb)
41 42
 import(GenomicRanges)
42 43
 import(checkmate)
43 44
 import(ggplot2)
... ...
@@ -1,4 +1,15 @@
1
-# GRaNIE 1.1.14 to 1.1.21 (2022-12-13)
1
+# GRaNIE 1.1.22 to 1.3.3 (2022-11-29)
2
+
3
+## Major changes
4
+- additional normalization schemes have been implemented, including GC-aware normalization schemes for peaks, and existing normalization methods have been renamed for clarity. See `?addData` for details.
5
+- further reduced the package burden; the large genome annotation packages are now more or less fully optional and only needed when a GC-aware normalization has been chosen or when additional peak annotation is wanted. However, in contrast to before, none of these annotation packages are strictly required anywhere anymore
6
+
7
+## Minor changes
8
+- various small changes in the code
9
+- vignette updates
10
+
11
+
12
+# GRaNIE 1.1.14 to 1.1.21 (2022-11-13)
2 13
 
3 14
 ## Major changes
4 15
 - major object changes and optimizations, particularly related to storing the count matrices in an optimized and simpler format. In short, the count matrices are now stored either as normal or sparse matrices, depending on the amount of zeros present. In addition, only the counts after normalization are saved, the raw counts before applying normalization are not stored anymore. If no normalization is wished by the user, as before, the "normalized" counts are equal to the raw counts. `GRaNIE` is now more readily applicable for larger analyses and single-cell analysis even though we just started actively optimizing for it, so we cannot yet recommend applying our framework in a single-cell manner. Older GRN objects are automatically changed internally when executing the major functions upon the first invocation.
... ...
@@ -685,7 +685,7 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq2_sizeFactors
685 685
                   peak.ID = query$peakID,
686 686
                   peak.GC.class = cut(.data$`G|C`, breaks = seq(0,1,1/nBins), include.lowest = TRUE, ordered_result = TRUE)) %>%
687 687
     dplyr::rename(peak.GC.perc    = .data$`G|C`) %>%
688
-    dplyr::select(peak.ID, everything())
688
+    dplyr::select(.data$peak.ID, tidyselect::everything())
689 689
   
690 690
 
691 691
   .printExecutionTime(start)
... ...
@@ -819,7 +819,7 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq2_sizeFactors
819 819
             
820 820
             futile.logger::flog.info(paste0("  Using the csaw-derived TMM-derived normalization factors as size factors, overriding the DESeq-default size factors."))
821 821
             
822
-            sizeFactors(dd) <- sizeFactors
822
+            DESeq2::sizeFactors(dd) <- sizeFactors
823 823
         }
824 824
         
825 825
         dataNorm = DESeq2::counts(dd, normalized=TRUE)
... ...
@@ -877,8 +877,8 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq2_sizeFactors
877 877
         # (2) effects related to between-lane distributional differences, e.g., sequencing depth. 
878 878
         # Accordingly, withinLaneNormalization and betweenLaneNormalization adjust for the first and second type of effects, respectively. 
879 879
         # We recommend to normalize for within-lane effects prior to between-lane normalization.
880
-        dataWithin <- withinLaneNormalization(data, y = peaks_GC_fraction, which= withinLane_method, num.bins = nBins, round = roundResults)
881
-        dataNorm <- betweenLaneNormalization(dataWithin, which=betweenLane_method, round = roundResults)
880
+        dataWithin <- EDASeq::withinLaneNormalization(data, y = peaks_GC_fraction, which= withinLane_method, num.bins = nBins, round = roundResults)
881
+        dataNorm <- EDASeq::betweenLaneNormalization(dataWithin, which=betweenLane_method, round = roundResults)
882 882
         
883 883
         
884 884
     } else if (normalization == "gcqn_peaks") {
... ...
@@ -28,6 +28,9 @@ GRaNIE and GRaNPA: Inference and evaluation of enhancer-mediated gene regulatory
28 28
 
29 29
 For issues, bugs, and feature request, please see the [Issue Tracker](https://git.embl.de/grp-zaugg/GRaNIE/issues). 
30 30
 
31
+**We are actively working on the package and regularly improve upon features, add features, or change features for increased clarity. This sometimes results in minor changes to the workflow, changed argument names or other small incompatibilities that may result in errors when running a version of the package that differs from the version this vignette has been run for.**
32
+**Thus, make sure to run a version of `GRaNIE` that is compatible with this vignette. If in doubt or when you receive errors, check the R help, which always contains the most up-to-date documentation.**
33
+
31 34
 If you have other questions or comments, feel free to contact us. We will be happy to answer any questions related to this project as well as questions related to the software implementation. For method-related questions, contact Judith B. Zaugg (judith.zaugg@embl.de). For technical questions, contact Christian Arnold (christian.arnold@embl.de). We will aim to respond in a timely manner.
32 35
 
33 36
  
... ...
@@ -7,10 +7,10 @@
7 7
 addData(
8 8
   GRN,
9 9
   counts_peaks,
10
-  normalization_peaks = "DESeq2_sizeFactor",
10
+  normalization_peaks = "DESeq2_sizeFactors",
11 11
   idColumn_peaks = "peakID",
12 12
   counts_rna,
13
-  normalization_rna = "quantile",
13
+  normalization_rna = "limma_quantile",
14 14
   idColumn_RNA = "ENSEMBL",
15 15
   sampleMetadata = NULL,
16 16
   additionalParams.l = list(),
... ...
@@ -26,7 +26,7 @@ addData(
26 26
 In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_peaks} below. 
27 27
 Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.}
28 28
 
29
-\item{normalization_peaks}{Character. Default \code{DESeq2_sizeFactor}. Normalization procedure for peak data. 
29
+\item{normalization_peaks}{Character. Default \code{DESeq2_sizeFactors}. Normalization procedure for peak data. 
30 30
 Must be one of \code{limma_cyclicloess}, \code{limma_quantile}, \code{limma_scale}, \code{csaw_cyclicLoess_orig}, \code{csaw_TMM}, 
31 31
 \code{EDASeq_GC_peaks}, \code{gcqn_peaks}, \code{DESeq2_sizeFactors}, \code{none}.}
32 32
 
... ...
@@ -38,7 +38,7 @@ of the peak coordinates, respectively. End must be bigger than start. Examples f
38 38
 In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_rna} below. 
39 39
 Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.}
40 40
 
41
-\item{normalization_rna}{Character. Default \code{quantile}. Normalization procedure for peak data. 
41
+\item{normalization_rna}{Character. Default \code{limma_quantile}. Normalization procedure for peak data. 
42 42
 Must be one of \code{limma_cyclicloess}, \code{limma_quantile}, \code{limma_scale}, \code{csaw_cyclicLoess_orig}, \code{csaw_TMM}, \code{DESeq2_sizeFactors}, \code{none}.}
43 43
 
44 44
 \item{idColumn_RNA}{Character. Default \code{ENSEMBL}. Name of the column in the \code{counts_rna} data frame that contains Ensembl IDs.}
... ...
@@ -6,9 +6,9 @@
6 6
 \usage{
7 7
 filterData(
8 8
   GRN,
9
-  minNormalizedMean_peaks = 5,
9
+  minNormalizedMean_peaks = NULL,
10 10
   maxNormalizedMean_peaks = NULL,
11
-  minNormalizedMeanRNA = 1,
11
+  minNormalizedMeanRNA = NULL,
12 12
   maxNormalizedMeanRNA = NULL,
13 13
   chrToKeep_peaks = NULL,
14 14
   minSize_peaks = 20,
... ...
@@ -23,13 +23,17 @@ filterData(
23 23
 \arguments{
24 24
 \item{GRN}{Object of class \code{\linkS4class{GRN}}}
25 25
 
26
-\item{minNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.}
26
+\item{minNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
27
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
27 28
 
28
-\item{maxNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.}
29
+\item{maxNormalizedMean_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
30
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
29 31
 
30
-\item{minNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.}
32
+\item{minNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default 5. Minimum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
33
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
31 34
 
32
-\item{maxNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.}
35
+\item{maxNormalizedMeanRNA}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
36
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
33 37
 
34 38
 \item{chrToKeep_peaks}{Character vector or \code{NULL}. Default \code{NULL}. Vector of chromosomes that peaks are allowed to come from. This filter can be used to filter sex chromosomes from the peaks, for example (e.g, \code{c(paste0("chr", 1:22), "chrX", "chrY")})}
35 39
 
... ...
@@ -37,13 +41,17 @@ filterData(
37 41
 
38 42
 \item{maxSize_peaks}{Integer[1,] or \code{NULL}. Default 10000. Maximum peak size (width, end - start) for a peak to be retained. Set to \code{NULL} for not applying the filter.}
39 43
 
40
-\item{minCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.}
44
+\item{minCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.
45
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
41 46
 
42
-\item{maxCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.}
47
+\item{maxCV_peaks}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.
48
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
43 49
 
44
-\item{minCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter.}
50
+\item{minCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter.
51
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
45 52
 
46
-\item{maxCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter.}
53
+\item{maxCV_genes}{Numeric[0,] or \code{NULL}. Default \code{NULL}. Maximum CV (coefficient of variation, a unitless measure of variation) for a gene to be retained. Set to \code{NULL} for not applying the filter.
54
+Be aware that depending on the chosen normalization, this filter may not make sense and should NOT be applied. See the notes for this function.}
47 55
 
48 56
 \item{forceRerun}{\code{TRUE} or \code{FALSE}. Default \code{FALSE}. Force execution, even if the GRN object already contains the result. Overwrites the old results.}
49 57
 }
... ...
@@ -51,8 +59,11 @@ filterData(
51 59
 An updated \code{\linkS4class{GRN}} object, with added data from this function.
52 60
 }
53 61
 \description{
54
-This function marks genes and/or peaks as \code{filtered} depending on the chosen filtering criteria. Filtered genes / peaks will then be 
55
-disregarded when adding connections in subsequent steps via \code{\link{addConnections_TF_peak}} and  \code{\link{addConnections_peak_gene}}. \strong{This function does NOT (re)filter existing connections when the \code{\linkS4class{GRN}} object already contains connections. Thus, upon re-execution of this function with different filtering criteria, all downstream steps have to be re-run.}
62
+This function marks genes and/or peaks as \code{filtered} depending on the chosen filtering criteria and is based on the count data AFTER
63
+potential normalization as chosen when using the \code{\link{addData}} function. Most of the filters may not be meaningful and useful anymore to apply
64
+after using particular normalization schemes that can give rise to, for example, negative values such as cyclic loess normalization. If normalized counts do
65
+not represents counts anymore but rather a deviation from a mean or something a like, the filtering critieria usually do not make sense anymore.
66
+Filtered genes / peaks will then be disregarded when adding connections in subsequent steps via \code{\link{addConnections_TF_peak}} and  \code{\link{addConnections_peak_gene}}. \strong{This function does NOT (re)filter existing connections when the \code{\linkS4class{GRN}} object already contains connections. Thus, upon re-execution of this function with different filtering criteria, all downstream steps have to be re-run.}
56 67
 }
57 68
 \details{
58 69
 All this function does is setting (or modifying) the filtering flag in \code{GRN@data$peaks$counts_metadata} and \code{GRN@data$RNA$counts_metadata}, respectively.
... ...
@@ -71,7 +71,7 @@ Overall, we tried to minimize the installation burden and only require packages
71 71
 
72 72
 - **Genome assembly and annotation packages** (only one of the four is optionally needed, which of them depends on your genome assembly version)
73 73
 
74
-    - all of the following packages are ONLY needed for either additional peak annotation in combination with `ChIPseeker` (see below) or if the chosen peak normalization method is GC based: `EDASeq_GC_peaks`, `gcqn_peaks`
74
+    - **all of the following packages are ONLY needed for either additional peak annotation in combination with `ChIPseeker` (see below) or if the chosen peak normalization method is GC based: `EDASeq_GC_peaks`, `gcqn_peaks`**
75 75
     - `org.Hs.eg.db`: Needed only when genome assembly version is `hg19` or `hg38` 
76 76
     - `org.Mm.eg.db`: Needed only when genome assembly version is `mm9` or `mm10`
77 77
     - `BSgenome.Hsapiens.UCSC.hg19`,  `TxDb.Hsapiens.UCSC.hg19.knownGene`: Needed only when genome assembly version is `hg19`  
... ...
@@ -162,7 +162,7 @@ In this section, we give methodological details and guidelines.
162 162
 
163 163
 ## Data normalization {#methods_dataNorm}
164 164
 
165
-An important consideration is data normalization for RNA and open chromatin data. We currently support three choices of normalization of either peak or RNA-Seq data: `quantile` (using `limma::normalizeQuantiles`), `DESeq_sizeFactor` and `none` and refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it is `DESeq_sizeFactor` (i.e., a "regular" `DESeq` size factor normalization). Importantly, `DESeq_sizeFactor` requires raw data, while `quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or "none". 
165
+An important consideration is data normalization for RNA and open chromatin data. We currently support six choices of normalization of either peak or RNA-Seq data: `limma_quantile`, `DESeq2_sizeFactors`, `limma_cyclicloess`, `limma_scale`, `csaw_cyclicLoess_orig`, `csaw_TMM` plus `none` to skip normalization altogether. For peaks, two additional GC-based normalization schemes are offered: `EDASeq_GC_peaks` and `gcqn_peaks`. We refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it is `DESeq2_sizeFactora` (i.e., a "regular" `DESeq2` size factor normalization). Importantly, `DESeq2_sizeFactora` requires raw data, while `limma_quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or `none`. 
166 166
 
167 167
 While we recommend raw counts for both peaks and RNA-Seq as input and offer several normalization choices in the pipeline, it is also possible to provide pre-normalized data. Note that the normalization method may have a large influence on the resulting *eGRN* network, so make sure the choice of normalization is reasonable. For more details, see the next sections.
168 168
 
... ...
@@ -105,6 +105,12 @@ library(GRaNIE)
105 105
 
106 106
 When installing *GRaNIE*, all required dependency packages are automatically installed. In addition, *GRaNIE* needs some additional packages for special functionality, packages that are not strictly necessary for the workflow but which enhance the functionality, may be required depending on certain parameters (such as your genome assembly version), or may be required only when using a particular functionality (such as the `robust` package for robust regression). The package will automatically check if any of these packages are missing during execution, and inform the user when a package is missing, along with a line to copy for pasting into R for installation.
107 107
 
108
+## Note on version compatibility and errors in the vignette
109
+
110
+**We are actively working on the package and regularly improve upon features, add features, or change features for increased clarity. This sometimes results in minor changes to the workflow, changed argument names or other small incompatibilities that may result in errors when running a version of the package that differs from the version this vignette has been run for.**
111
+
112
+**Thus, make sure to run a version of `GRaNIE` that is compatible with this vignette. If in doubt or when you receive errors, check the R help, which always contains the most up-to-date documentation.**
113
+
108 114
 
109 115
 
110 116
 ## General notes
... ...
@@ -193,7 +199,7 @@ At any time point, we can simply "print" a `GRaNIE` object by typing its name an
193 199
 
194 200
 We are now ready to fill our empty object with data! After preparing the data beforehand, we can now use the data import function `addData()` to import both enhancers and RNA-seq data to the `GRaNIE` object. In addition to the count tables, we explicitly specify the name of the ID columns. As mentioned before, the sample metadata is optional but recommended if available.
195 201
 
196
-An important consideration is data normalization for RNA and ATAC. We support many different choices of normalization, the selection of which also depends on whether RNA or peaks is considered, and possible choices are: `limma_quantile`, `DESeq2_sizeFactors` and `none` and refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin enhancer data, it is `DESeq2_sizeFactor2` (i.e., a "regular" `DESeq` size factor normalization). Importantly, `DESeq2_sizeFactors` requires raw data, while `quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or "none". 
202
+An important consideration is data normalization for RNA and ATAC. We support many different choices of normalization, the selection of which also depends on whether RNA or peaks is considered, and possible choices are: `limma_quantile`, `DESeq2_sizeFactors` and `none` and refer to the R help for more details (`?addData`). The default for RNA-Seq is a quantile normalization, while for the open chromatin enhancer data, it is `DESeq2_sizeFactors` (i.e., a "regular" `DESeq2` size factor normalization). Importantly, `DESeq2_sizeFactors` requires raw data, while `limma_quantile` does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or `none`. 
197 203
 
198 204
 
199 205
 ```{r addData, echo=TRUE, include=TRUE, eval = FALSE, class.output="scroll-200"}