Browse code

Doc updates and forceRerun adjustments

Christian Arnold authored on 05/03/2022 01:26:49
Showing29 changed files

... ...
@@ -3,9 +3,9 @@
3 3
 #' Initialize a \code{\linkS4class{GRN}} object
4 4
 #' 
5 5
 #' @export
6
-#' @param objectMetadata List. Default list(). Optional (named) list with an arbitrary number of elements, all of which capture metadata for the object. This is mainly used to distinguish GRN objects from one another by storing object-specific metadata along with the data.
6
+#' @param objectMetadata List. Default \code{list()}. Optional (named) list with an arbitrary number of elements, all of which capture metadata for the object. This is mainly used to distinguish GRN objects from one another by storing object-specific metadata along with the data.
7 7
 #' @param outputFolder Absolute file path. No default. Default output folder where all pipeline output will be put unless specified otherwise. We recommend specifying an absolute path. Note that for Windows-based systems, the path must be correctly specified with "/" as path separator.
8
-#' @param genomeAssembly Character. No default. The genome assembly of all data that to be used within this object. Currently, supported genomes are: \code{hg19}, \code{hg38}, and \code{mm10}
8
+#' @param genomeAssembly Character. No default. The genome assembly of all data that to be used within this object. Currently, supported genomes are: \code{hg19}, \code{hg38}, and \code{mm10}.
9 9
 #' @return Empty \code{\linkS4class{GRN}} object
10 10
 #' @examples 
11 11
 #' meta.l = list(name = "exampleName", date = "01.03.22")
... ...
@@ -101,14 +101,14 @@ initializeGRN <- function(objectMetadata = list(),
101 101
 #' 
102 102
 #' @export
103 103
 #' @template GRN 
104
-#' @param counts_peaks Data frame. No default. Counts for the peaks, with raw or normalized counts for each peak (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument *idColumn_peaks* below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.
105
-#' @param normalization_peaks Character. Default "DESeq_sizeFactor". Normalization procedure for peak data. Must be one of "DESeq_sizeFactor", "none", or "quantile"
106
-#' @param idColumn_peaks Character. Default "peakID". Name of the column in the counts_peaks data frame that contains peak IDs. The required format must be "{chr}:{start}-{end}", with {chr} denoting the abbreviated chromosome name, and {start} and {end} the begin and end of the peak coordinates, respectively. End must be bigger than start. Examples for valid peak IDs are chr1:400-800 or chrX:20-25.
107
-#' @param counts_rna Data frame. No default. Counts for the RNA-seq data, with raw or normalized counts for each gene (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument *idColumn_rna* below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.
108
-#' @param normalization_rna Character. Default "quantile". Normalization procedure for peak data. Must be one of "DESeq_sizeFactor", "none", or "quantile"
109
-#' @param idColumn_RNA Character. Default "ENSEMBL". Name of the column in the counts_rna data frame that contains Ensembl IDs.
110
-#' @param sampleMetadata Data frame. Default NULL. Optional, additional metadata for the samples, such as age, sex, gender etc. If provided, the @seealso [plotPCA_all()] function can then incorporate and plot it. Sample names must match with those from both peak and RNA-Seq data. The first column is expected to contain the sample IDs, the actual column name is irrelevant.
111
-#' @param allowOverlappingPeaks \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Should overlapping peaks be allowed (then only a warning is issued when overlapping peaks are found) or (the default) should an error be raised?
104
+#' @param counts_peaks Data frame. No default. Counts for the peaks, with raw or normalized counts for each peak (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_peaks} below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.
105
+#' @param normalization_peaks Character. Default \code{DESeq_sizeFactor}. Normalization procedure for peak data. Must be one of \code{DESeq_sizeFactor}, \code{none}, or \code{quantile}.
106
+#' @param idColumn_peaks Character. Default \code{peakID}. Name of the column in the counts_peaks data frame that contains peak IDs. The required format must be {chr}:{start}-{end}", with {chr} denoting the abbreviated chromosome name, and {start} and {end} the begin and end of the peak coordinates, respectively. End must be bigger than start. Examples for valid peak IDs are \code{chr1:400-800} or \code{chrX:20-25}.
107
+#' @param counts_rna Data frame. No default. Counts for the RNA-seq data, with raw or normalized counts for each gene (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_rna} below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.
108
+#' @param normalization_rna Character. Default \code{quantile}. Normalization procedure for peak data. Must be one of "DESeq_sizeFactor", "none", or "quantile"
109
+#' @param idColumn_RNA Character. Default \code{ENSEMBL}. Name of the column in the \code{counts_rna} data frame that contains Ensembl IDs.
110
+#' @param sampleMetadata Data frame. Default \code{NULL}. Optional, additional metadata for the samples, such as age, sex, gender etc. If provided, the @seealso [plotPCA_all()] function can then incorporate and plot it. Sample names must match with those from both peak and RNA-Seq data. The first column is expected to contain the sample IDs, the actual column name is irrelevant.
111
+#' @param allowOverlappingPeaks \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should overlapping peaks be allowed (then only a warning is issued when overlapping peaks are found) or (the default) should an error be raised?
112 112
 #' @template forceRerun
113 113
 #' @return The same \code{\linkS4class{GRN}} object, with added data from this function.
114 114
 #' @examples 
... ...
@@ -117,9 +117,9 @@ initializeGRN <- function(objectMetadata = list(),
117 117
 #' # rna.df   = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/rna.tsv.gz")
118 118
 #' # peaks.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/peaks.tsv.gz")
119 119
 #' # meta.df  = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/sampleMetadata.tsv.gz")
120
-#' GRN = loadExampleObject()
121
-#' # We omit sampleMetadata = meta.df here, lines becomes too long otherwise
122
-#' GRN = addData(GRN, counts_peaks = peaks.df, counts_rna = rna.df, forceRerun = FALSE)
120
+#' # GRN = loadExampleObject()
121
+#' # We omit sampleMetadata = meta.df in the following line, becomes too long otherwise
122
+#' # GRN = addData(GRN, counts_peaks = peaks.df, counts_rna = rna.df, forceRerun = FALSE)
123 123
 
124 124
 addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq_sizeFactor", idColumn_peaks = "peakID", 
125 125
                     counts_rna, normalization_rna = "quantile", idColumn_RNA = "ENSEMBL", sampleMetadata = NULL,
... ...
@@ -591,7 +591,7 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq_sizeFactor",
591 591
 #' @param maxNormalizedMean_peaks Numeric or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
592 592
 #' @param minNormalizedMeanRNA Numeric or \code{NULL}. Default 5. Minimum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
593 593
 #' @param maxNormalizedMeanRNA Numeric or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter.
594
-#' @param chrToKeep_peaks Character vector. Default c(paste0("chr", 1:22), "chrX", "chrY"). Vector of chromosomes that peaks are allowed to come from. This filter can be used to filter sex chromosomes from the peaks, for example.
594
+#' @param chrToKeep_peaks Character vector. Default \code{c(paste0("chr", 1:22), "chrX", "chrY")}. Vector of chromosomes that peaks are allowed to come from. This filter can be used to filter sex chromosomes from the peaks, for example.
595 595
 #' @param minSize_peaks Integer or \code{NULL}. Default \code{NULL}. Minimum peak size (width, end - start) for a peak to be retained. Set to \code{NULL} for not applying the filter.
596 596
 #' @param maxSize_peaks Integer or \code{NULL}. Default 10000. Maximum peak size (width, end - start) for a peak to be retained. Set to \code{NULL} for not applying the filter.
597 597
 #' @param minCV_peaks Numeric or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter.
... ...
@@ -845,9 +845,9 @@ filterData <- function (GRN,
845 845
 #' 
846 846
 #' @template GRN 
847 847
 #' @param motifFolder Character. No default. Path to the folder that contains the TFBS predictions. The files must be in BED format, 6 columns, one file per TF. See the other parameters for more details.
848
-#' @param TFs Character vector. Default "all". Vector of TF names to include. The special keyword "all" can be used to include all TF found in the folder as specified by motifFolder. If "all" is specified anywhere, all TFs will be included. TF names must otherwise match the file names that are found in the folder, without the file suffix.
849
-#' @param filesTFBSPattern Character. Default "_TFBS". Suffix for the file names in the TFBS folder that is not part of the TF name. Can be empty. For example, for the TF CTCF, if the file is called CTCF.all.TFBS.bed, set this parameter to ".all.TFBS".
850
-#' @param fileEnding Character. Default ".bed". File ending for the files from the motif folder.
848
+#' @param TFs Character vector. Default \code{all}. Vector of TF names to include. The special keyword \code{all} can be used to include all TF found in the folder as specified by \code{motifFolder}. If \code{all} is specified anywhere, all TFs will be included. TF names must otherwise match the file names that are found in the folder, without the file suffix.
849
+#' @param filesTFBSPattern Character. Default \code{"_TFBS"}. Suffix for the file names in the TFBS folder that is not part of the TF name. Can be empty. For example, for the TF CTCF, if the file is called \code{CTCF.all.TFBS.bed}, set this parameter to \code{".all.TFBS"}.
850
+#' @param fileEnding Character. Default \code{".bed"}. File ending for the files from the motif folder.
851 851
 #' @param nTFMax \code{NULL} or integer. Default \code{NULL}. Maximal number of TFs to import. Can be used for testing purposes, e.g., setting to 5 only imports 5 TFs even though the whole \code{motifFolder} has many more TFs defined.
852 852
 #' @template forceRerun
853 853
 #' @return The same \code{\linkS4class{GRN}} object, with added data from this function.
... ...
@@ -1496,14 +1496,14 @@ importTFData <- function(GRN, data, name, idColumn = "ENSEMBL", nameColumn = "TF
1496 1496
 #' Run the activator-repressor classification for the TFs for a \code{\linkS4class{GRN}} object
1497 1497
 #' 
1498 1498
 #' @template GRN
1499
-#' @param significanceThreshold_Wilcoxon Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and diffTF paper for details.
1499
+#' @param significanceThreshold_Wilcoxon Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and *diffTF* paper for details.
1500 1500
 #' @param plot_minNoTFBS_heatmap Integer. Default 100. Minimum number of TFBS for a TF to be included in the heatmap that is part of the output of this function.
1501 1501
 #' @param deleteIntermediateData \code{TRUE} or \code{FALSE}.  Default \code{TRUE}. Should intermediate data be deleted before returning the object after a successful run? Due to the size of the produced intermediate data, we recommend setting this to \code{TRUE}, but if memory or object size are not an issue, the information can also be kept.
1502 1502
 #' @template plotDiagnosticPlots
1503 1503
 #' @template outputFolder
1504 1504
 #' @template corMethod
1505 1505
 #' @template forceRerun
1506
-#' @return The same \code{\linkS4class{GRN}} object, with added data from this function.  TF_classification_densityPlotsForegroundBackground_expression_perm{0,1}.pdf, TF_classification_stringencyThresholds_expression_perm0.pdf, TF_classification_summaryHeatmap_expression_perm0.pdf,
1506
+#' @return The same \code{\linkS4class{GRN}} object, with added data from this function.  
1507 1507
 #' @examples 
1508 1508
 #' # See the Workflow vignette on the GRaNIE website for examples
1509 1509
 #' GRN = loadExampleObject()
... ...
@@ -1631,7 +1631,7 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05
1631 1631
                        corMethod,
1632 1632
                        fileCur, width = 5, height = 5)
1633 1633
         } else {
1634
-          futile.logger::flog.info(paste0("File ", fileCur, " already exists, not overwriting since forceRerun = FALSE"))
1634
+          futile.logger::flog.info(paste0("  File ", fileCur, " already exists, not overwriting since forceRerun = FALSE"))
1635 1635
         }
1636 1636
         
1637 1637
         fileCur = paste0(outputFolder, .getOutputFileName("plot_class_medianClass"), "_", connectionTypeCur, suffixFile)
... ...
@@ -1644,7 +1644,7 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05
1644 1644
             corMethod = corMethod,
1645 1645
             file = fileCur,  width = 4, height = 8)
1646 1646
         } else {
1647
-          futile.logger::flog.info(paste0("File ", fileCur, " already exists, not overwriting since forceRerun = FALSE"))
1647
+          futile.logger::flog.info(paste0("  File ", fileCur, " already exists, not overwriting since forceRerun = FALSE"))
1648 1648
         }
1649 1649
         
1650 1650
         fileCur = paste0(outputFolder, .getOutputFileName("plot_class_densityClass"), "_", connectionTypeCur, suffixFile)
... ...
@@ -1661,7 +1661,7 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05
1661 1661
                          finalClassification = GRN@data$TFs$classification[[permIndex]] [[connectionTypeCur]]$TF.classification,
1662 1662
                          file = fileCur, width = 8, height = 15)
1663 1663
         } else {
1664
-          futile.logger::flog.info(paste0("File ", fileCur, " already exists, not overwriting since forceRerun = FALSE"))
1664
+          futile.logger::flog.info(paste0("  File ", fileCur, " already exists, not overwriting since forceRerun = FALSE"))
1665 1665
         }
1666 1666
       }
1667 1667
       
... ...
@@ -1714,14 +1714,14 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05
1714 1714
 #' @template plotDetails
1715 1715
 #' @template outputFolder
1716 1716
 #' @template corMethod
1717
-#' @param connectionTypes Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{addData_TFActivity} for details.
1718
-#' @param removeNegativeCorrelation  \code{TRUE} or \code{FALSE} (vector). Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is FALSE, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}.  
1717
+#' @param connectionTypes Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{\link{addData_TFActivity}} for details.
1718
+#' @param removeNegativeCorrelation  Vector of \code{TRUE} or \code{FALSE}. Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is \code{FALSE}, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}.  
1719 1719
 #' @param maxFDRToStore Numeric. Default 0.3. Maximum TF-peak FDR value to permanently store a particular TF-peak connection in the object? This parameter has a large influence on the overall memory size of the object, and we recommend not storing connections with a high FDR due to their sheer number.
1720 1720
 #' @param useGCCorrection \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. EXPERIMENTAL. Should a GC-matched background be used when calculating FDRs?
1721 1721
 #' @param percBackground_size Numeric (0 to 100). Default 75. EXPERIMENTAL. Description will follow. Only relevant if \code{useGCCorrection} is set to \code{TRUE}, ignored otherwise.
1722 1722
 #' @param percBackground_resample \code{TRUE} or \code{FALSE}.  Default \code{TRUE}. EXPERIMENTAL. Should resampling be enabled for those GC bins for which not enough background peaks are available?. Only relevant if \code{useGCCorrection} is set to \code{TRUE}, ignored otherwise.
1723 1723
 #' @template forceRerun
1724
-#' @return The same \code{\linkS4class{GRN}} object, with added data from this function.  TF_peak.fdrCurves_perm{o,1}.pdf
1724
+#' @return The same \code{\linkS4class{GRN}} object, with added data from this function.
1725 1725
 #' @examples 
1726 1726
 #' # See the Workflow vignette on the GRaNIE website for examples
1727 1727
 #' GRN = loadExampleObject()
... ...
@@ -2257,13 +2257,13 @@ addConnections_TF_peak <- function (GRN, plotDiagnosticPlots = TRUE, plotDetails
2257 2257
 #' 
2258 2258
 #' @export
2259 2259
 #' @template GRN
2260
-#' @param  overlapTypeGene Character. "TSS" or "full". Default "TSS". If set to "TSS", only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to "full", the whole annotated gene (including all exons and introns) is used instead. 
2260
+#' @param  overlapTypeGene Character. \code{"TSS"} or \code{"full"}. Default \code{"TSS"}. If set to \code{"TSS"}, only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to \code{"full"}, the whole annotated gene (including all exons and introns) is used instead. 
2261 2261
 #' @template corMethod
2262 2262
 #' @param  promoterRange Integer >=0. Default 250000. The size of the neighborhood in bp to correlate peaks and genes in vicinity. Only peak-gene pairs will be correlated if they are within the specified range. Increasing this value leads to higher running times and more peak-gene pairs to be associated, while decreasing results in the opposite.
2263
-#' @param TADs Data frame with TAD domains. Default NULL. If provided, the neighborhood of a peak is defined by the TAD domain the peak is in rather than a fixed-sized neighborhood. The expected format is a BED-like data frame with at least 3 columns in this particular order: chromosome, start, end, the 4th column is optional and will be taken as ID column. All additional columns as well as column names are ignored. For the first 3 columns, the type is checked as part of a data integrity check.
2263
+#' @param TADs Data frame with TAD domains. Default \code{NULL}. If provided, the neighborhood of a peak is defined by the TAD domain the peak is in rather than a fixed-sized neighborhood. The expected format is a BED-like data frame with at least 3 columns in this particular order: chromosome, start, end, the 4th column is optional and will be taken as ID column. All additional columns as well as column names are ignored. For the first 3 columns, the type is checked as part of a data integrity check.
2264 2264
 #' @template nCores
2265 2265
 #' @template plotDiagnosticPlots
2266
-#' @param plotGeneTypes List of character vectors. Default list(c("all"), c("protein_coding"), c("protein_coding", "lincRNA")). Each list element may consist of one or multiple gene types that are plotted collectively in one PDF. The special keyword "all" denotes all gene types that are found (be aware: this typically contains 20+ gene types, see https://www.gencodegenes.org/pages/biotypes.html for details).
2266
+#' @param plotGeneTypes List of character vectors. Default \code{list(c("all"), c("protein_coding"), c("protein_coding", "lincRNA"))}. Each list element may consist of one or multiple gene types that are plotted collectively in one PDF. The special keyword \code{"all"} denotes all gene types that are found (be aware: this typically contains 20+ gene types, see \url{https://www.gencodegenes.org/pages/biotypes.html} for details).
2267 2267
 #' @template outputFolder
2268 2268
 #' @template addRobustRegression
2269 2269
 #' @template forceRerun
... ...
@@ -2799,17 +2799,17 @@ addConnections_peak_gene <- function(GRN, overlapTypeGene = "TSS", corMethod = "
2799 2799
 #' @template gene.types
2800 2800
 #' @param allowMissingTFs \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Should connections be returned for which the TF is NA (i.e., connections consisting only of peak-gene links?). If set to \code{TRUE}, this generally greatly increases the number of connections but it may not be what you aim for.
2801 2801
 #' @param allowMissingGenes \code{TRUE} or \code{FALSE}.  Default \code{TRUE}. Should connections be returned for which the gene is NA (i.e., connections consisting only of TF-peak links?). If set to \code{TRUE}, this generally increases the number of connections.
2802
-#' @param peak_gene.r_range Numeric(2) . Default c(0,1). Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links.
2803
-#' @param peak_gene.selection "all" or "closest". Default "all". Filter for the selection of genes for each peak. If set to "all", all previously identified peak-gene are used, while "closest" only retains the closest gene for each peak that is retained until the point the filter is applied.
2804
-#' @param peak_gene.maxDistance Integer >0. Default NULL. Maximum peak-gene distance to retain a peak-gene connection.
2805
-#' @param filterTFs Character vector. Default NULL. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out.
2806
-#' @param filterGenes Character vector. Default NULL. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out.
2807
-#' @param filterPeaks Character vector. Default NULL. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out.
2802
+#' @param peak_gene.r_range Numeric(2). Default \code{c(0,1)}. Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links.
2803
+#' @param peak_gene.selection \code{"all"} or \code{"closest"}. Default \code{"all"}. Filter for the selection of genes for each peak. If set to \code{"all"}, all previously identified peak-gene are used, while \code{"closest"} only retains the closest gene for each peak that is retained until the point the filter is applied.
2804
+#' @param peak_gene.maxDistance Integer >0. Default \code{NULL}. Maximum peak-gene distance to retain a peak-gene connection.
2805
+#' @param filterTFs Character vector. Default \code{NULL}. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out.
2806
+#' @param filterGenes Character vector. Default \code{NULL}. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out.
2807
+#' @param filterPeaks Character vector. Default \code{NULL}. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out.
2808 2808
 #' @param TF_peak_FDR_selectViaCorBins \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Use a modified procedure for selecting TF-peak links that is based on the user-specified FDR but that retains also links that may have a higher FDR but a more extreme correlation.
2809 2809
 #' @param silent \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Print progress messages and filter statistics.
2810 2810
 #' @param filterLoops  \code{TRUE} or \code{FALSE}. Default \code{TRUE}. If a TF regulates itself (i.e., the TF and the gene are the same entity), should such loops be filtered from the GRN?
2811 2811
 #' @template outputFolder
2812
-#' @return The same \code{\linkS4class{GRN}} object, with the filtered and merged TF-peak and peak-gene connections in the slot connections$all.filtered. The filtered
2812
+#' @return The same \code{\linkS4class{GRN}} object, with the filtered and merged TF-peak and peak-gene connections in the slot connections$all.filtered.
2813 2813
 #' @examples 
2814 2814
 #' # See the Workflow vignette on the GRaNIE website for examples
2815 2815
 #' GRN = loadExampleObject()
... ...
@@ -3621,18 +3621,18 @@ addSNPOverlap <- function(grn, SNPData, col_chr = "chr", col_pos = "pos", col_pe
3621 3621
 
3622 3622
 #' Generate a summary PDF for the number of connections for a \code{\linkS4class{GRN}} object. 
3623 3623
 #' 
3624
-#' Essentially, this functions calls \code{filterGRNAndConnectGenes} repeatedly and stores the total number of connections and other statistics each time to summarize them afterwards. All arguments are identical to the ones in \code{filterGRNAndConnectGenes}, see the help for this function for details.
3624
+#' Essentially, this functions calls \code{\link{filterGRNAndConnectGenes}} repeatedly and stores the total number of connections and other statistics each time to summarize them afterwards. All arguments are identical to the ones in \code{\link{filterGRNAndConnectGenes}}, see the help for this function for details.
3625 3625
 #' 
3626 3626
 #' @export
3627 3627
 #' @template GRN 
3628
-#' @param TF_peak.fdr Numeric vector. Default c(0.001, 0.01, 0.05, 0.1, 0.2). TF-peak FDR values to iterate over.
3628
+#' @param TF_peak.fdr Numeric vector. Default \code{c(0.001, 0.01, 0.05, 0.1, 0.2)}. TF-peak FDR values to iterate over.
3629 3629
 #' @template TF_peak.connectionTypes
3630
-#' @param peak_gene.fdr Numeric vector. Default c(0.001, 0.01, 0.05, 0.1, 0.2). Peak-gene FDR values to iterate over.
3631
-#' @param peak_gene.p_raw  Numeric vector. Default NULL. Peak-gene raw p-value values to iterate over. Skipepd if set to NULL.
3632
-#' @param peak_gene.r_range Numeric vector of length 2 (minimum -1, maximum 1). Default c(0,1). The correlation range of peak-gene connections to keep.
3630
+#' @param peak_gene.fdr Numeric vector. Default \code{c(0.001, 0.01, 0.05, 0.1, 0.2)}. Peak-gene FDR values to iterate over.
3631
+#' @param peak_gene.p_raw  Numeric vector. Default \code{NULL}. Peak-gene raw p-value values to iterate over. Skipped if set to NULL.
3632
+#' @param peak_gene.r_range Numeric vector of length 2 (minimum -1, maximum 1). Default \code{c(0,1)}. The correlation range of peak-gene connections to keep.
3633 3633
 #' @template gene.types
3634
-#' @param allowMissingGenes Logical vector. Default c(FALSE, TRUE). Allow genes to be missing for peak-gene connections?
3635
-#' @param allowMissingTFs Logical vector. Default c(FALSE). Allow TFs to be missing for TF-peak connections?
3634
+#' @param allowMissingGenes Logical vector. Default \code{c(FALSE, TRUE)}. Allow genes to be missing for peak-gene connections?
3635
+#' @param allowMissingTFs Logical vector. Default \code{c(FALSE)}. Allow TFs to be missing for TF-peak connections?
3636 3636
 #' @template forceRerun
3637 3637
 #' @return The same \code{\linkS4class{GRN}} object, with added data from this function.
3638 3638
 #' @examples 
... ...
@@ -3666,128 +3666,135 @@ generateStatsSummary <- function(GRN,
3666 3666
 
3667 3667
   checkmate::assertFlag(forceRerun)
3668 3668
   
3669
-  GRN@stats$connections = .initializeStatsDF()
3670
-  
3671
-  if (TF_peak.connectionTypes == "all") {
3672
-    TF_peak.connectionTypesAllComb =.getAll_TF_peak_connectionTypes(GRN)
3673
-  } else {
3674
-    TF_peak.connectionTypesAllComb = unique(TF_peak.connectionTypes)
3675
-  }
3676
-  
3677
-  
3678
-  futile.logger::flog.info(paste0("Generating summary. This may take a while..."))
3679
-  
3680
-  
3681
-  for (permutationCur in 0:.getMaxPermutation(GRN)) {
3682
-    
3683
-    futile.logger::flog.info(paste0("\n", .getPermStr(permutationCur), "...\n"))
3684
-    
3685
-    permIndex = as.character(permutationCur)
3686
-    GRN@stats$connectionDetails.l[[permIndex]] = list()
3687
-    
3688
-    # Iterate over different stringency thresholds and collect statistics
3689
-    
3690
-    for (TF_peak.fdr_cur in TF_peak.fdr) {
3669
+  if (is.null(GRN@stats$connections) | is.null(GRN@stats$connectionDetails.l) | forceRerun) {
3691 3670
       
3692
-      TF_peak.fdr_cur.str = as.character(TF_peak.fdr_cur)
3671
+      GRN@stats$connections = .initializeStatsDF()
3693 3672
       
3694
-      futile.logger::flog.info(paste0("Calculate network stats for TF-peak FDR of ", TF_peak.fdr_cur))
3695
-      GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] =list()
3673
+      if (TF_peak.connectionTypes == "all") {
3674
+          TF_peak.connectionTypesAllComb =.getAll_TF_peak_connectionTypes(GRN)
3675
+      } else {
3676
+          TF_peak.connectionTypesAllComb = unique(TF_peak.connectionTypes)
3677
+      }
3696 3678
       
3697
-      futile.logger::flog.debug(paste0("Iterating over different peak-gene FDR thresholds..."))
3698
-      for (peak_gene.fdr_cur in peak_gene.fdr) {
3699
-        
3700
-        futile.logger::flog.debug(paste0("Peak-gene FDR = ", peak_gene.fdr_cur))
3701
-        peak_gene.fdr_cur.str = as.character(peak_gene.fdr_cur)
3702
-        GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] =list()
3703
-        
3704
-        for (allowMissingTFsCur in allowMissingTFs) {
3679
+      
3680
+      futile.logger::flog.info(paste0("Generating summary. This may take a while..."))
3681
+      
3682
+      
3683
+      for (permutationCur in 0:.getMaxPermutation(GRN)) {
3705 3684
           
3706
-          futile.logger::flog.debug(paste0("  allowMissingTFs = ", allowMissingTFsCur))
3707
-          allowMissingTFsCur.str = as.character(allowMissingTFsCur)
3708
-          GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] = list()
3685
+          futile.logger::flog.info(paste0("\n", .getPermStr(permutationCur), "...\n"))
3709 3686
           
3710
-          for (allowMissingGenesCur in allowMissingGenes) {
3711
-            
3712
-            futile.logger::flog.debug(paste0("  allowMissingGenes = ", allowMissingGenesCur))
3713
-            allowMissingGenesCur.str = as.character(allowMissingGenesCur)
3714
-            GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] = list()
3715
-            
3716
-            for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb) {
3687
+          permIndex = as.character(permutationCur)
3688
+          GRN@stats$connectionDetails.l[[permIndex]] = list()
3689
+          
3690
+          # Iterate over different stringency thresholds and collect statistics
3691
+          
3692
+          for (TF_peak.fdr_cur in TF_peak.fdr) {
3717 3693
               
3718
-              GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = list()
3694
+              TF_peak.fdr_cur.str = as.character(TF_peak.fdr_cur)
3719 3695
               
3720
-              futile.logger::flog.debug(paste0("    TF_peak.connectionType = ", TF_peak.connectionTypeCur))
3696
+              futile.logger::flog.info(paste0("Calculate network stats for TF-peak FDR of ", TF_peak.fdr_cur))
3697
+              GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] =list()
3721 3698
               
3722
-              GRN = filterGRNAndConnectGenes(GRN, 
3723
-                                             TF_peak.fdr.threshold = TF_peak.fdr_cur, 
3724
-                                             TF_peak.connectionTypes = TF_peak.connectionTypeCur, 
3725
-                                             peak_gene.p_raw.threshold = NULL, 
3726
-                                             peak_gene.fdr.threshold= peak_gene.fdr_cur,
3727
-                                             gene.types = gene.types, 
3728
-                                             allowMissingGenes = allowMissingGenesCur, 
3729
-                                             allowMissingTFs = allowMissingTFsCur,
3730
-                                             peak_gene.r_range = peak_gene.r_range,
3731
-                                             filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL,
3732
-                                             silent = TRUE)
3699
+              futile.logger::flog.debug(paste0("Iterating over different peak-gene FDR thresholds..."))
3700
+              for (peak_gene.fdr_cur in peak_gene.fdr) {
3701
+                  
3702
+                  futile.logger::flog.debug(paste0("Peak-gene FDR = ", peak_gene.fdr_cur))
3703
+                  peak_gene.fdr_cur.str = as.character(peak_gene.fdr_cur)
3704
+                  GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] =list()
3705
+                  
3706
+                  for (allowMissingTFsCur in allowMissingTFs) {
3707
+                      
3708
+                      futile.logger::flog.debug(paste0("  allowMissingTFs = ", allowMissingTFsCur))
3709
+                      allowMissingTFsCur.str = as.character(allowMissingTFsCur)
3710
+                      GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] = list()
3711
+                      
3712
+                      for (allowMissingGenesCur in allowMissingGenes) {
3713
+                          
3714
+                          futile.logger::flog.debug(paste0("  allowMissingGenes = ", allowMissingGenesCur))
3715
+                          allowMissingGenesCur.str = as.character(allowMissingGenesCur)
3716
+                          GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] = list()
3717
+                          
3718
+                          for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb) {
3719
+                              
3720
+                              GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = list()
3721
+                              
3722
+                              futile.logger::flog.debug(paste0("    TF_peak.connectionType = ", TF_peak.connectionTypeCur))
3723
+                              
3724
+                              GRN = filterGRNAndConnectGenes(GRN, 
3725
+                                                             TF_peak.fdr.threshold = TF_peak.fdr_cur, 
3726
+                                                             TF_peak.connectionTypes = TF_peak.connectionTypeCur, 
3727
+                                                             peak_gene.p_raw.threshold = NULL, 
3728
+                                                             peak_gene.fdr.threshold= peak_gene.fdr_cur,
3729
+                                                             gene.types = gene.types, 
3730
+                                                             allowMissingGenes = allowMissingGenesCur, 
3731
+                                                             allowMissingTFs = allowMissingTFsCur,
3732
+                                                             peak_gene.r_range = peak_gene.r_range,
3733
+                                                             filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL,
3734
+                                                             silent = TRUE)
3735
+                              
3736
+                              results.l = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], 
3737
+                                                    perm = permutationCur, 
3738
+                                                    TF_peak.fdr = TF_peak.fdr_cur, TF_peak.connectionType = TF_peak.connectionTypeCur,
3739
+                                                    peak_gene.p_raw = NA,
3740
+                                                    peak_gene.fdr = peak_gene.fdr_cur, 
3741
+                                                    peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","),
3742
+                                                    gene.types = paste0(gene.types, collapse = ","),
3743
+                                                    allowMissingGenes = allowMissingGenesCur, 
3744
+                                                    allowMissingTFs   = allowMissingTFsCur)
3745
+                              
3746
+                              GRN@stats$connections  = results.l[["summary"]]
3747
+                              
3748
+                              GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = 
3749
+                                  results.l[["details"]]
3750
+                              
3751
+                          } # end of  for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb)
3752
+                          
3753
+                      } # end of  for (allowMissingGenesCur in allowMissingGenes) 
3754
+                      
3755
+                  } # end of  for (allowMissingTFsCur in allowMissingTFs)
3756
+                  
3757
+              } # end of for (peak_gene.fdr_cur in peak_gene.fdr)
3733 3758
               
3734
-              results.l = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], 
3735
-                                    perm = permutationCur, 
3736
-                                    TF_peak.fdr = TF_peak.fdr_cur, TF_peak.connectionType = TF_peak.connectionTypeCur,
3737
-                                    peak_gene.p_raw = NA,
3738
-                                    peak_gene.fdr = peak_gene.fdr_cur, 
3739
-                                    peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","),
3740
-                                    gene.types = paste0(gene.types, collapse = ","),
3741
-                                    allowMissingGenes = allowMissingGenesCur, 
3742
-                                    allowMissingTFs   = allowMissingTFsCur)
3743 3759
               
3744
-              GRN@stats$connections  = results.l[["summary"]]
3760
+              if (!is.null(peak_gene.p_raw)) {
3761
+                  
3762
+                  futile.logger::flog.info(paste0(" Iterating over different peak-gene raw p-value thresholds..."))
3763
+                  for (peak_gene.p_raw_cur in peak_gene.p_raw) {
3764
+                      
3765
+                      futile.logger::flog.info(paste0("  Peak-gene raw p-value = ", peak_gene.p_raw_cur))
3766
+                      
3767
+                      # TODO: Add the other ones here also
3768
+                      for (allowMissingGenesCur in allowMissingGenes) {
3769
+                          
3770
+                          GRN = filterGRNAndConnectGenes(GRN, 
3771
+                                                         TF_peak.fdr.threshold = TF_peak.fdr_cur, peak_gene.p_raw.threshold = peak_gene.p_raw_cur, 
3772
+                                                         peak_gene.fdr.threshold= NULL,
3773
+                                                         gene.types = gene.types, 
3774
+                                                         allowMissingGenes = allowMissingGenesCur, peak_gene.r_range = peak_gene.r_range,
3775
+                                                         filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL,
3776
+                                                         silent = TRUE)
3777
+                          
3778
+                          
3779
+                          GRN@stats$connections = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], 
3780
+                                                            perm = permutationCur, TF_peak.fdr = TF_peak.fdr_cur, peak_gene.fdr = NA, 
3781
+                                                            allowMissingGenes = allowMissingGenesCur, peak_gene.p_raw = peak_gene.p_raw_cur,
3782
+                                                            peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","),
3783
+                                                            gene.types = paste0(gene.types, collapse = ","))
3784
+                          
3785
+                      } # end of for (allowMissingGenesCur in allowMissingGenes)
3786
+                      
3787
+                  } # end of   for (peak_gene.p_raw_cur in peak_gene.p_raw) 
3788
+              } # end of  if (!is.null(peak_gene.p_raw))
3745 3789
               
3746
-              GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = 
3747
-                results.l[["details"]]
3748 3790
               
3749
-            } # end of  for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb)
3750
-            
3751
-          } # end of  for (allowMissingGenesCur in allowMissingGenes) 
3752
-          
3753
-        } # end of  for (allowMissingTFsCur in allowMissingTFs)
3754
-        
3755
-      } # end of for (peak_gene.fdr_cur in peak_gene.fdr)
3756
-      
3757
-      
3758
-      if (!is.null(peak_gene.p_raw)) {
3759
-        
3760
-        futile.logger::flog.info(paste0(" Iterating over different peak-gene raw p-value thresholds..."))
3761
-        for (peak_gene.p_raw_cur in peak_gene.p_raw) {
3762
-          
3763
-          futile.logger::flog.info(paste0("  Peak-gene raw p-value = ", peak_gene.p_raw_cur))
3764
-          
3765
-          # TODO: Add the other ones here also
3766
-          for (allowMissingGenesCur in allowMissingGenes) {
3767
-            
3768
-            GRN = filterGRNAndConnectGenes(GRN, 
3769
-                                           TF_peak.fdr.threshold = TF_peak.fdr_cur, peak_gene.p_raw.threshold = peak_gene.p_raw_cur, 
3770
-                                           peak_gene.fdr.threshold= NULL,
3771
-                                           gene.types = gene.types, 
3772
-                                           allowMissingGenes = allowMissingGenesCur, peak_gene.r_range = peak_gene.r_range,
3773
-                                           filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL,
3774
-                                           silent = TRUE)
3775
-            
3776
-            
3777
-            GRN@stats$connections = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], 
3778
-                                              perm = permutationCur, TF_peak.fdr = TF_peak.fdr_cur, peak_gene.fdr = NA, 
3779
-                                              allowMissingGenes = allowMissingGenesCur, peak_gene.p_raw = peak_gene.p_raw_cur,
3780
-                                              peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","),
3781
-                                              gene.types = paste0(gene.types, collapse = ","))
3782
-            
3783
-          } # end of for (allowMissingGenesCur in allowMissingGenes)
3784
-          
3785
-        } # end of   for (peak_gene.p_raw_cur in peak_gene.p_raw) 
3786
-      } # end of  if (!is.null(peak_gene.p_raw))
3787
-      
3791
+          }
3792
+      } # end for each permutation
3788 3793
       
3789
-    }
3790
-  } # end for each permutation
3794
+  } else {
3795
+      futile.logger::flog.info(paste0("Data already exists in object. Set forceRerun = TRUE to regenerate and overwrite."))
3796
+  }
3797
+  
3791 3798
   
3792 3799
   
3793 3800
   GRN
... ...
@@ -3906,7 +3913,7 @@ generateStatsSummary <- function(GRN,
3906 3913
 #' 
3907 3914
 #' @export
3908 3915
 #' @param forceDownload \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should the download be enforced even if the local cached file is already present?
3909
-#' @param fileURL Character. Default https://www.embl.de/download/zaugg/GRaNIE/GRN.rds. URL to the GRN example object in rds format.
3916
+#' @param fileURL Character. Default \url{https://www.embl.de/download/zaugg/GRaNIE/GRN.rds}. URL to the GRN example object in rds format.
3910 3917
 #' @examples 
3911 3918
 #' GRN = loadExampleObject()
3912 3919
 #' @return 
... ...
@@ -3943,7 +3950,7 @@ loadExampleObject <- function(forceDownload = FALSE, fileURL = "https://www.embl
3943 3950
 #' Get counts for the various data defined in a \code{\linkS4class{GRN}} object.
3944 3951
 #' 
3945 3952
 #' @template GRN 
3946
-#' @param type Character. Either "peaks" or "rna". "peaks" corresponds to the counts for the open chromatin data, while "rna" refers to th RNA-seq counts. If set to "rna", both permuted and non-permuted data can be retrieved, while for "peaks", only the non-permuted one (i.e., 0) can be retrieved.
3953
+#' @param type Character. Either \code{peaks} or \code{rna}. \code{peaks} corresponds to the counts for the open chromatin data, while \code{rna} refers to th RNA-seq counts. If set to \code{rna}, both permuted and non-permuted data can be retrieved, while for \code{peaks}, only the non-permuted one (i.e., 0) can be retrieved.
3947 3954
 #' @param norm Logical. \code{TRUE} or \code{FALSE}.  Should original (often raw, but this may not necessarily be the case) or normalized counts be returned?
3948 3955
 #' @template permuted
3949 3956
 #' @export
... ...
@@ -3951,9 +3958,9 @@ loadExampleObject <- function(forceDownload = FALSE, fileURL = "https://www.embl
3951 3958
 #' @examples 
3952 3959
 #' # See the Workflow vignette on the GRaNIE website for examples
3953 3960
 #' GRN = loadExampleObject()
3954
-#' GRN = getCounts(GRN, type = "peaks", norm = TRUE, permuted = FALSE)
3961
+#' counts.df = getCounts(GRN, type = "peaks", norm = TRUE, permuted = FALSE)
3955 3962
 #' @return Data frame of counts, with the type as indicated by the function parameters.
3956
-#' getCounts(GRN, type = "peaks", norm = TRUE)
3963
+
3957 3964
 getCounts <- function(GRN, type, norm, permuted = FALSE) {
3958 3965
   
3959 3966
   GRN = .addFunctionLogToObject(GRN)      
... ...
@@ -4036,9 +4043,9 @@ getCounts <- function(GRN, type, norm, permuted = FALSE) {
4036 4043
 #' @export
4037 4044
 #' @template GRN 
4038 4045
 #' @template permuted
4039
-#' @param type Character. Default "all.filtered". Must be one of "TF_peaks", "peak_genes", "all.filtered". The type of connections to retrieve.
4040
-#' @param include_TF_gene_correlations Logical. \code{TRUE} or \code{FALSE}.  Should TFs and gene correlations be returned as well? If set to \code{TRUE}, they must have been computed beforehand with \code{\link{add_TF_gene_correlation}}.
4041
-#' @return A data frame with the connections. Importantly, this function does NOT return a \code{\linkS4class{GRN}} object.
4046
+#' @param type Character. Default \code{all.filtered}. Must be one of \code{TF_peaks}, \code{peak_genes}, \code{all.filtered}. The type of connections to retrieve.
4047
+#' @param include_TF_gene_correlations Logical. \code{TRUE} or \code{FALSE}. Should TFs and gene correlations be returned as well? If set to \code{TRUE}, they must have been computed beforehand with \code{\link{add_TF_gene_correlation}}.
4048
+#' @return A data frame with the connections. Importantly, this function does **NOT** return a \code{\linkS4class{GRN}} object.
4042 4049
 #' @examples 
4043 4050
 #' # See the Workflow vignette on the GRaNIE website for examples
4044 4051
 #' GRN = loadExampleObject()
... ...
@@ -4220,8 +4227,7 @@ getGRNConnections <- function(GRN, type = "all.filtered",  permuted = FALSE, inc
4220 4227
 #' @examples 
4221 4228
 #' # See the Workflow vignette on the GRaNIE website for examples
4222 4229
 #' GRN = loadExampleObject()
4223
-#' # getParameters(GRN, type = "parameter", name = "all")
4224
-#' 
4230
+#' getParameters(GRN, type = "parameter", name = "all")
4225 4231
 getParameters <- function (GRN, type = "parameter", name = "all") {
4226 4232
   
4227 4233
   checkmate::assertClass(GRN, "GRN")
... ...
@@ -4266,7 +4272,7 @@ getParameters <- function (GRN, type = "parameter", name = "all") {
4266 4272
 }
4267 4273
 
4268 4274
 
4269
-#' Optional convenience function to delete intermediate data from the function \link{AR_classification_wrapper} and summary statistics that may occupy a lot of space
4275
+#' Optional convenience function to delete intermediate data from the function \code{\link{AR_classification_wrapper}} and summary statistics that may occupy a lot of space
4270 4276
 #' @export
4271 4277
 #' @template GRN
4272 4278
 #' @return The same \code{\linkS4class{GRN}} object, with some slots being deleted (\code{GRN@data$TFs$classification} as well as \code{GRN@stats$connectionDetails.l})
... ...
@@ -2,7 +2,7 @@
2 2
 #' Builds a graph out of a set of connections
3 3
 #'
4 4
 #' @template GRN
5
-#' @param model_TF_gene_nodes_separately \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. TODO description follows 
5
+#' @param model_TF_gene_nodes_separately \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Should TF and gene nodes be modeled separately? If set to \code{TRUE},this may lead to unwanted effects in case of TF-TF connections (i.e., a TF regulating another TF)
6 6
 #' @param allowLoops \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Allow loops in the network (i.e., a TF that regulates itself)
7 7
 #' @param removeMultiple \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Remove loops with the same start and end point? This can happen if multiple TF originate from the same gene, for example.
8 8
 #' @param directed \code{TRUE} or \code{FALSE}.  Default \code{FALSE}. Should the network be directed?
... ...
@@ -285,11 +285,11 @@ performAllNetworkAnalyses <- function(GRN, ontology = c("GO_BP", "GO_MF"),
285 285
 #' 
286 286
 #' This function runs an enrichment analysis for the genes in the filtered network.
287 287
 #' @template GRN
288
-#' @param ontology Character vector of ontologies. Default c("GO_BP", "GO_MF"). Valid values are "GO_BP", "GO_MF", "GO_CC", "KEGG", "DO", and "Reactome", referring to GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Disease Ontology, and Reactome Pathways. The GO enrichments 
289
-#' @param algorithm Character. Default "weight01". One of: "classic", "elim", "weight", "weight01", "lea", "parentchild." Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), ignored otherwise. Name of the algorithm that handles the GO graph structures. Valid inputs are those supported by the topGO library.
290
-#' @param statistic Character. Default "fisher". One of: "fisher", "ks", "t", "globaltest", "sum", "ks.ties". Statistical test to be used. Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), and valid inputs are those supported by the topGO library, ignored otherwise. For the other ontologies the test statistic is always Fisher. 
291
-#' @param background Character. Default "neighborhood". One of: "all_annotated", "all_RNA", "neighborhood". Set of genes to be used to construct the background for the enrichment analysis. This can either be all annotated genes in the reference genome (all_annotated), all differentially expressed genes (all_RNA), or all the genes that are within the neighbourhood of a peak in the GRN (neighbourhood)
292
-#' @param pAdjustMethod Character. Default "BH". One of: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr". This parameter is only relevant for the following ontologies: KEGG, DO, Reactome. For the other ontologies, the algorithm serves as an adjustment.
288
+#' @param ontology Character vector of ontologies. Default \code{c("GO_BP", "GO_MF")}. Valid values are \code{"GO_BP"}, \code{"GO_MF"}, \code{"GO_CC"}, \code{"KEGG"}, \code{"DO"}, and \code{"Reactome"}, referring to GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Disease Ontology, and Reactome Pathways.
289
+#' @param algorithm Character. Default \code{"weight01"}. One of: \code{"classic"}, \code{"elim"}, \code{"weight"}, \code{"weight01"}, \code{"lea"}, \code{"parentchild"}. Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), ignored otherwise. Name of the algorithm that handles the GO graph structures. Valid inputs are those supported by the \code{topGO} library.
290
+#' @param statistic Character. Default \code{"fisher"}. One of: \code{"fisher"}, \code{"ks"}, \code{"t"}, \code{"globaltest"}, \code{"sum"}, \code{"ks.ties"}. Statistical test to be used. Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), and valid inputs are those supported by the topGO library, ignored otherwise. For the other ontologies the test statistic is always Fisher. 
291
+#' @param background Character. Default \code{"neighborhood"}. One of: \code{"all_annotated"}, \code{"all_RNA"}, \code{"neighborhood"}. Set of genes to be used to construct the background for the enrichment analysis. This can either be all annotated genes in the reference genome (all_annotated), all differentially expressed genes (all_RNA), or all the genes that are within the neighborhood of a peak in the GRN (neighborhood)
292
+#' @param pAdjustMethod Character. Default \code{"BH"}. One of: \code{"holm"}, \code{"hochberg"}, \code{"hommel"}, \code{"bonferroni"}, \code{"BH"}, \code{"BY"}, \code{"fdr"}. This parameter is only relevant for the following ontologies: KEGG, DO, Reactome. For the other ontologies, the algorithm serves as an adjustment.
293 293
 #' @template forceRerun
294 294
 #' @return The same \code{\linkS4class{GRN}} object, with the enrichment results stored in the \code{stats$Enrichment$general} slot.
295 295
 #' @seealso \code{\link{plotGeneralEnrichment}}
... ...
@@ -802,8 +802,8 @@ calculateCommunitiesStats <- function(GRN, clustering = "louvain", forceRerun =
802 802
 #' 
803 803
 #' After the vertices of the filtered GRN are clustered into communities using \code{\link{calculateCommunitiesStats}}, this function will run a per-community enrichment analysis.
804 804
 #' @inheritParams calculateGeneralEnrichment
805
-#' @param selection Character. Default "byRank". One of: "byRank", "byLabel". Specify whether the communities enrichment will by calculated based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank.
806
-#' @param communities Numeric vector. Default c(1:10). Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities} = c(1,4), if \code{display} = "byRank" the GO enrichment for the first and fourth largest communities will be calculated if \code{display} = "byLabel", the results for the communities labeled "1", and "4" will be plotted.
805
+#' @param selection Character. Default \code{"byRank"}. One of: \code{"byRank"}, \code{"byLabel"}. Specify whether the communities enrichment will by calculated based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank.
806
+#' @param communities Numeric vector. Default \code{c(1:10)}. Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities = c(1,4)}, if \code{display = "byRank"} the GO enrichment for the first and fourth largest communities will be calculated if \code{display = "byLabel"}, the results for the communities labeled "1", and "4" will be plotted.
807 807
 #' @return The same \code{\linkS4class{GRN}} object, with the enrichment results stored in the \code{stats$Enrichment$byCommunity} slot.
808 808
 #' @seealso \code{\link{plotCommunitiesEnrichment}}
809 809
 #' @seealso \code{\link{plotGeneralEnrichment}}
... ...
@@ -930,8 +930,8 @@ calculateCommunitiesEnrichment <- function(GRN,
930 930
 #' Retrieve top Nodes in the filtered \code{\linkS4class{GRN}}
931 931
 #' 
932 932
 #' @template GRN
933
-#' @param nodeType Character. One of: "gene", "TF". 
934
-#' @param rankType Character. One of: "degree", "EV". This parameter will determine the criterion to be used to identify the "top" nodes. If set to "degree", the function will select top nodes based on the number of connections they have, i.e. based on their degree-centrality. If set to "EV" it will select the top nodes based on their eigenvector-centrality score in the network.
933
+#' @param nodeType Character. One of: \code{"gene"} or \code{"TF"}. Node type.
934
+#' @param rankType Character. One of: \code{"degree"}, \code{"EV"}. This parameter will determine the criterion to be used to identify the "top" nodes. If set to "degree", the function will select top nodes based on the number of connections they have, i.e. based on their degree-centrality. If set to "EV" it will select the top nodes based on their eigenvector-centrality score in the network.
935 935
 #' @param n Numeric. Default 0.1. If this parameter is passed as a value between (0,1), it is treated as a percentage of top nodes. If the value is passed as an integer it will be treated as the number of top nodes.
936 936
 #' @param use_TF_gene_network \code{TRUE} or \code{FALSE}. Default \code{TRUE}. Should the TF-gene network be used (\code{TRUE}) or the TF-peak-gene network (\code{FALSE})?
937 937
 #' @return A dataframe with the node names and the corresponding scores used to rank them
... ...
@@ -1017,15 +1017,14 @@ getTopNodes <- function(GRN, nodeType, rankType, n = 0.1, use_TF_gene_network =
1017 1017
 #' This function calculates the GO enrichment per TF, i.e. for the set of genes a given TF is connected to in the filtered \code{\linkS4class{GRN}}. 
1018 1018
 #' 
1019 1019
 #' @inheritParams calculateGeneralEnrichment
1020
-#' @param rankType Character. One of: "degree", "EV", "custom". This parameter will determine the criterion to be used to identify the "top" TFs. If set to "degree", the function will select top TFs based on the number of connections to genes they have, i.e. based on their degree-centrality. If set to "EV" it will select the top TFs based on their eigenvector-centrality score in the network. If set to custom, a set of TF names will have to be passed to the "TF.names" parameter.
1021
-#' @param n Numeric. Default 0.1. If this parameter is passed as a value between (0,1), it is treated as a percentage of top nodes. If the value is passed as an integer it will be treated as the number of top nodes. This parameter is not relevant if rankType = "custom".
1022
-#' @param TF.names Character vector. If the rank type is set to "custom", a vector of TF names for which the GO enrichment should be calculated should be passed to this parameter.
1020
+#' @param rankType Character. Default \code{"degree"}. One of: \code{"degree"}, \code{"EV"}, \code{"custom"}. This parameter will determine the criterion to be used to identify the "top" TFs. If set to "degree", the function will select top TFs based on the number of connections to genes they have, i.e. based on their degree-centrality. If set to \code{"EV"} it will select the top TFs based on their eigenvector-centrality score in the network. If set to custom, a set of TF names will have to be passed to the "TF.names" parameter.
1021
+#' @param n Numeric. Default 0.1. If this parameter is passed as a value between (0,1), it is treated as a percentage of top nodes. If the value is passed as an integer it will be treated as the number of top nodes. This parameter is not relevant if \code{rankType = "custom"}.
1022
+#' @param TF.names Character vector. Default \code{NULL}. If the rank type is set to \code{"custom"}, a vector of TF names for which the GO enrichment should be calculated should be passed to this parameter.
1023 1023
 #' @return The same \code{\linkS4class{GRN}} object, with the enrichment results stored in the \code{stats$Enrichment$byTF} slot.
1024 1024
 #' @examples 
1025 1025
 #' # See the Workflow vignette on the GRaNIE website for examples
1026 1026
 #' GRN =  loadExampleObject()
1027 1027
 #' GRN =  calculateTFEnrichment(GRN, n = 5, ontology = "GO_BP", forceRerun = FALSE)
1028
-#' GRN =  calculateTFEnrichment(GRN, n = 5, ontology = "GO_BP", forceRerun = FALSE)
1029 1028
 #' @export
1030 1029
 calculateTFEnrichment <- function(GRN, rankType = "degree", n = 0.1, TF.names = NULL,
1031 1030
                                   ontology = c("GO_BP", "GO_MF"), algorithm = "weight01", 
... ...
@@ -791,9 +791,10 @@ plotDiagnosticPlots_TFPeaks <- function(GRN,
791 791
 #' @return The same \code{\linkS4class{GRN}} object, with added data from this function.
792 792
 #' @examples 
793 793
 #' # See the Workflow vignette on the GRaNIE website for examples
794
-#' GRN = loadExampleObject()
795
-#' GRN = plotDiagnosticPlots_peakGene(GRN, forceRerun = FALSE)
794
+#' # GRN = loadExampleObject()
795
+#' # GRN = plotDiagnosticPlots_peakGene(GRN, forceRerun = FALSE)
796 796
 #' @export
797
+# TODO: implement forceRerun correctly
797 798
 plotDiagnosticPlots_peakGene <- function(GRN, 
798 799
                                          outputFolder = NULL, 
799 800
                                          basenameOutput = NULL, 
... ...
@@ -845,7 +846,8 @@ plotDiagnosticPlots_peakGene <- function(GRN,
845 846
                                     useFiltered = useFiltered, 
846 847
                                     plotPerTF = plotPerTF,
847 848
                                     pdf_width = pdf_width, pdf_height = pdf_height,
848
-                                    plotDetails = plotDetails)
849
+                                    plotDetails = plotDetails,
850
+                                    forceRerun = forceRerun)
849 851
   
850 852
   # Summarize all data, both real and random
851 853
   # filteredStr = dplyr::if_else(useFiltered, "_filtered", "")
... ...
@@ -871,564 +873,578 @@ plotDiagnosticPlots_peakGene <- function(GRN,
871 873
                                               plotPerTF = FALSE,
872 874
                                               fileBase = NULL,
873 875
                                               pdf_width = 12,
874
-                                              pdf_height = 12) {
876
+                                              pdf_height = 12,
877
+                                              forceRerun = FALSE) {
875 878
   
876 879
   start = Sys.time()
877
-  futile.logger::flog.info(paste0("Plotting diagnostic plots for peak-gene correlations", dplyr::if_else(is.null(fileBase), "", paste0(" to file(s) with basename ", fileBase))))
878
-  
879
-  cols_keep = c("peak.ID", "gene.ENSEMBL", "peak_gene.r", "peak_gene.p_raw", "peak_gene.distance")
880
-
881
-  # Change from 10 to 5 
882
-  nCategoriesBinning = 5
883
-  probs = seq(0,1,1/nCategoriesBinning)
884
-  
885
-  
886
-  range = GRN@config$parameters$promoterRange
887
-  
888
-  networkType_details = c(paste0("real_",range), paste0("random_",range))
889
-  
890
-  colors_vec = c("black", "darkgray")
891
-  networkType_vec = c("real", "permuted")
892
-  names(colors_vec) = names(networkType_vec) = networkType_details
893
-  
894
-  options(dplyr.summarise.inform = FALSE) 
895
-  
896
-  includeRobustCols = FALSE
897
-  #Either take all or the filtered set of connections
898
-  if (useFiltered) {
899
-    
900
-    # TODO: Robust columns filter
901
-    peakGeneCorrelations.all = 
902
-      rbind(dplyr::select(GRN@connections$all.filtered[["0"]], dplyr::everything()) %>%
903
-              dplyr::mutate(class = paste0("real_",range))) %>%
904
-      rbind(dplyr::select(GRN@connections$all.filtered[["1"]],  dplyr::everything()) %>% 
905
-              dplyr::mutate(class = paste0("random_",range)))
906
-    
907
-  } else {
908
-    
909
-    robustColumns = c("peak_gene.p_raw.robust", "peak_gene.bias_M_p.raw", "peak_gene.bias_LS_p.raw", "peak_gene.r_robust")
910
-    if (all(robustColumns %in% colnames(GRN@connections$peak_genes[["0"]]))) {
911
-      includeRobustCols = TRUE
912
-      cols_keep = c(cols_keep, robustColumns)
913
-    }
914
-    
915
-    class_levels = c(paste0("real_",range), paste0("random_",range))
916
-    
917
-    peakGeneCorrelations.all = 
918
-      rbind(
919
-          dplyr::select(GRN@connections$peak_genes[["0"]], tidyselect::all_of(cols_keep)) %>%
920
-                 dplyr::mutate(class = factor(paste0("real_",range), levels = class_levels)),
921
-          dplyr::select(GRN@connections$peak_genes[["1"]],  tidyselect::all_of(cols_keep)) %>% 
922
-                 dplyr::mutate(class = factor(paste0("random_",range), levels = class_levels))) %>%
923
-      dplyr::left_join(dplyr::select(GRN@annotation$genes, gene.ENSEMBL, gene.type, gene.mean, gene.median, gene.CV), by = "gene.ENSEMBL") %>%
924
-      dplyr::left_join(GRN@annotation$consensusPeaks %>% 
925
-                             dplyr::select(-dplyr::starts_with("peak.gene."), -peak.GC.perc), by = "peak.ID") %>%
926
-      dplyr::select(-gene.ENSEMBL)
927
-    
928
-  }
929
-  
930
-  if (!plotPerTF) {
931
-      peakGeneCorrelations.all = dplyr::select(peakGeneCorrelations.all, -peak.ID)
932
-  }
933
-  
934
-  nClasses_distance  = 10
935
-  
936
-  peakGeneCorrelations.all = peakGeneCorrelations.all %>%
937
-    dplyr::mutate(r_positive = peak_gene.r > 0,
938
-                  peak_gene.distance_class = 
939
-                    forcats::fct_explicit_na(addNA(cut(peak_gene.distance, breaks = nClasses_distance, include.lowest = TRUE)), "random"),
940
-                  peak_gene.distance_class_abs = forcats::fct_explicit_na(addNA(cut(abs(peak_gene.distance), 
941
-                                                                                    breaks = nClasses_distance, include.lowest = TRUE, ordered_result = TRUE)), "random"),
942
-                  peak_gene.p.raw.class = cut(peak_gene.p_raw, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE),
943
-                  peak_gene.r.class = cut(peak_gene.r, breaks = seq(-1,1,0.05), include.lowest = TRUE, ordered_result = TRUE)) %>%
944
-    dplyr::filter(!is.na(peak_gene.r)) # Eliminate rows with NA for peak_gene.r. This can happen if the normalized gene counts are identical across ALL samples, and due to the lack of any variation, the correlation cannot be computed
945
-  
946
-  
947
-  # Oddity of cut: When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals. 
948
-  levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1] = 
949
-      gsub("(-\\d+)", "0", levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1], perl = TRUE)
950
-  
951
-
952
-  if (includeRobustCols) {
953
-    peakGeneCorrelations.all = peakGeneCorrelations.all %>%
954
-      dplyr::mutate(peak_gene.p_raw.robust.class = 
955
-                      cut(peak_gene.p_raw.robust, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE))
956
-  }
957
-  
958
-
959
-  # Prepare plots #
960
-
961
-  colors_class = c("black", "black")
962
-  names(colors_class)= unique(peakGeneCorrelations.all$class)
963
-  colors_class[which(grepl("random", names(colors_class)))] = "darkgray"
964
-  
965
-  r_pos_class = c("black", "darkgray")
966
-  names(r_pos_class) =c("TRUE", "FALSE")
967
-  
968
-  dist_class = c("dark red", "#fc9c9c")
969
-  names(dist_class) = class_levels
970
-  
971
-  freqs= table(peakGeneCorrelations.all$class)
972
-  freq_class = paste0(gsub(names(freqs), pattern = "(.+)(_.*)", replacement = "\\1"), " (n=", .prettyNum(freqs) , ")")
973
-  # Change upstream and go with "permuted" everywhere
974
-  freq_class = gsub(freq_class, pattern = "random", replacement = "permuted")
975
-  names(freq_class) <- names(freqs)
976 880
   
881
+  # get list of all filenames that are going to be created
882
+  filenames.all = c()
977 883
   filteredStr = dplyr::if_else(useFiltered, "_filtered", "")
978
-  
979
-  xlabels_peakGene_r.class = levels(peakGeneCorrelations.all$peak_gene.r.class)
980
-  nCur = length(xlabels_peakGene_r.class)
981
-  xlabels_peakGene_r.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- ""
982
-  
983
-  # For the last plot, which is wider, we label a few more
984
-  xlabels_peakGene_r.class2 = levels(peakGeneCorrelations.all$peak_gene.r.class)
985
-  nCur = length(xlabels_peakGene_r.class2)
986
-  xlabels_peakGene_r.class2[setdiff(seq_len(nCur), c(1, floor(nCur/4), floor(nCur/2), floor(nCur/4*3), nCur))] <- ""
987
-  
988
-  xlabels_peakGene_praw.class = levels(peakGeneCorrelations.all$peak_gene.p.raw.class)
989
-  nCur = length(xlabels_peakGene_praw.class)
990
-  xlabels_peakGene_praw.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- ""
991
-  
992
-  
993 884
   for (geneTypesSelected in gene.types) {
994
-      
995
-      futile.logger::flog.info(paste0(" Gene type ", paste0(geneTypesSelected, collapse = "+")))
996
-
997
-      if ("all" %in% geneTypesSelected) {
998
-          indexCur = seq_len(nrow(peakGeneCorrelations.all))
999
-      } else {
1000
-          indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected)
1001
-      }
1002
-      
1003
-  
1004
-      # START PLOTTING #
1005
-    
1006
-      if (!is.null(fileBase)) {
1007
-          filenameCur = paste0(fileBase, paste0(geneTypesSelected, collapse = "+"), filteredStr, ".pdf")
1008
-           grDevices::pdf(file = filenameCur, width = pdf_width, height = pdf_height)
1009
-      }
1010
-      
1011
-      if (plotPerTF) {
1012
-          
1013
-          TF.nRows = rep(-1, length(GRN@config$allTF))
1014
-          #TF.nRows = rep(-1, 10)
1015
-          TF.peaks = list()
1016
-          names(TF.nRows) = GRN@config$allTF
1017
-          for (TFCur in GRN@config$allTF) {
1018
-              TF.peaks[[TFCur]] = names(which(GRN@data$TFs$TF_peak_overlap[,TFCur] == 1))
1019
-              TF.nRows[TFCur] = peakGeneCorrelations.all[indexCur,] %>% dplyr::filter(peak.ID %in% TF.peaks[[TFCur]]) %>% nrow()
1020
-          }
1021
-          
1022
-          TFs_sorted = names(sort(TF.nRows, decreasing = TRUE))
1023
-          
1024
-          allTF = c("all", TFs_sorted)
1025
-          
1026
-      } else {
1027
-          allTF = "all"
1028
-      }
1029
-      
1030
-      counter = 0
1031
-      for (TFCur in allTF) {
1032
-          
1033
-          counter = counter + 1
1034
-          
1035
-          if (length(allTF) > 1) {
1036
-              futile.logger::flog.info(paste0(" QC plots for TF ", TFCur, " (", counter, " of ", length(allTF), ")"))
1037
-          }
1038
-          
1039
-          
1040
-          if ("all" %in% geneTypesSelected) {
1041
-              indexCur = seq_len(nrow(peakGeneCorrelations.all))
1042
-          } else {
1043
-              indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected)
1044
-          }
1045
-          
1046
-          
1047
-          if (TFCur != "all") {
1048
-              indexCur = intersect(indexCur, which(peakGeneCorrelations.all$peak.ID %in% TF.peaks[[TFCur]]))
1049
-          }
1050
-          
1051
-          # Get subset also for just the real data
1052
-          indexCurReal = intersect(indexCur, which(peakGeneCorrelations.all$class == names(dist_class)[1]))
1053
-          
1054
-          
1055
-          xlabel = paste0("Correlation raw p-value")
1056
-          
1057
-          # DENSITY PLOTS P VALUE
1058
-          
1059
-          # TODO: Densities as ratio
1060
-          # https://stackoverflow.com/questions/58629959/how-can-i-extract-data-from-a-kernel-density-function-in-r-for-many-samples-at-o#:~:text=To%20compute%20the%20density%20you,can%20use%20the%20package%20spatstat%20.
1061
-          
1062
-          # Produce the labels for the class-specific subtitles
1063
-          customLabel_class = .customLabeler(table(peakGeneCorrelations.all[indexCur,]$class))
1064
-          
1065
-          r_pos_freq = table(peakGeneCorrelations.all[indexCur,]$r_positive)
1066
-          labeler_r_pos = labeller(r_positive = c("TRUE"  = paste0("r positive (", r_pos_freq["TRUE"], ")"), 
1067
-                                                  "FALSE" = paste0("r negative (", r_pos_freq["FALSE"], ")")) )
1068
-          theme_main = theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = rel(0.8)),
1069
-                             axis.text.y = element_text(size=rel(0.8)),
1070
-                             panel.grid.major = element_blank(), panel.grid.minor = element_blank())
1071
-          
1072
-
1073
-
1074
-          ## p-val density curves stratified by real/permuted ##
1075
-          
1076
-          gA2 = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.p_raw, color = r_positive)) + geom_density()  +
1077
-              facet_wrap(~ class, labeller = labeller(class=freq_class) ) +
1078
-              xlab(xlabel) + ylab("Density") +  theme_bw() +
1079
-              scale_color_manual(labels = names(r_pos_class), values = r_pos_class) +
1080
-              theme(legend.position = "none", axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)))
1081
-          
1082
-          # Helper function to retrieve all tables and data aggregation steps for subsequent visualization
1083
-          tbl.l = .createTables_peakGeneQC(peakGeneCorrelations.all[indexCur,], networkType_details, colors_vec, range)
1084
-          
1085
-          
1086
-          xlabel = paste0("Correlation raw\np-value (binned)")
1087
-          
1088
-   
1089
-          xlabels = levels(tbl.l$d_merged$peak_gene.p.raw.class)
1090
-          xlabels[setdiff(seq_len(length(xlabels)), c(1, floor(length(xlabels)/2), length(xlabels)))] <- ""
1091
-          
1092
-          gB3 = ggplot(tbl.l$d_merged, aes(peak_gene.p.raw.class, ratio, fill = classAll)) + 
1093
-              geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.5) + 
1094
-              geom_hline(yintercept = 1, linetype = "dotted") + 
1095
-              xlab(xlabel) + ylab("Ratio") +
1096
-              scale_fill_manual("Class", values = c(dist_class, r_pos_class), 
1097
-                                labels = c("real", "permuted", "r+ (r>0)", "r- (r<=0)"), 
1098
-                                ) + # labels vector can be kind of manually specified here because the levels were previosly defined in a certain order
1099
-              scale_x_discrete(labels = xlabels_peakGene_praw.class) +
1100
-              theme_bw() +  
1101
-              #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) +
1102
-              # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) +
1103
-              theme_main +
1104
-              facet_wrap(~ factor(set), nrow = 2, scales = "free_y", strip.position = "left") 
1105
-          
1106
-          #  plot two FDR plots as well: (fraction of negative / negative+positive) and (fraction of permuted / permuted + real)
1107
-          # that could give an indication of whether an FDR based on the permuted or based on the negative correlations would be more stringent
1108
-
1109
-          # R PEAK GENE #
1110
-          
1111
-          xlabel = paste0("Correlation coefficient r")
1112
-          
1113
-          sum_real = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[1]]
1114
-          sum_rnd  = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[2]]
1115
-          binData.r = peakGeneCorrelations.all[indexCur,] %>%
1116
-              dplyr::group_by(class) %>%
1117
-              dplyr::count(peak_gene.r.class) %>%
1118
-              dplyr::mutate(nnorm = dplyr::case_when(class == !! (names(dist_class)[1]) ~ .data$n / (sum_real / sum_rnd), 
1119
-                                              TRUE ~ as.double(.data$n)))
1120
-          
1121
-          xlabel = paste0("Correlation coefficient r (binned)")
1122
-
1123
-          gD = ggplot(binData.r, aes(peak_gene.r.class, nnorm, group = class, fill = class)) + 
1124
-              geom_bar(stat = "identity", position = position_dodge(preserve = "single"), na.rm = FALSE, width = 0.5) +
1125
-              geom_line(aes(peak_gene.r.class, nnorm, group = class, color= class), stat = "identity") +
1126
-              scale_fill_manual("Group", labels = names(dist_class), values = dist_class) +
1127
-              scale_color_manual("Group", labels = names(dist_class), values = dist_class) +
1128
-              scale_x_discrete(labels = xlabels_peakGene_r.class2, drop = FALSE) +
1129
-              theme_bw() + theme(legend.position = "none") +
1130
-              xlab(xlabel) + ylab("Abundance") +
1131
-              theme_main  +
1132
-              scale_y_continuous(labels = scales::scientific)
1133
-          
1134
-          
1135
-          mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ",\n", .prettyNum(range), " bp promoter range)")
1136
-          
1137
-          plots_all = ( ((gA2 | gB3 ) + 
1138
-                             patchwork::plot_layout(widths = c(2.5,1.5))) / ((gD) + 
1139
-                                                                                 patchwork::plot_layout(widths = c(4))) ) + 
1140
-              patchwork::plot_layout(heights = c(2,1), guides = 'collect') +
1141
-              patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5)))
1142
-          
1143
-          plot(plots_all)
1144
-          
1145
-          
1146
-      } # end for each TF
1147
-      
1148
-      
1149
-      
1150
-      if (!is.null(GRN@annotation$consensusPeaks_obj) & is.installed("ChIPseeker")) {
1151
-          #plot(ChIPseeker::plotAnnoBar(GRN@annotation$consensusPeaks_obj))
1152
-          
1153
-          # no plot, as this is somehow just a list and no ggplot object
1154
-          ChIPseeker::plotAnnoPie(GRN@annotation$consensusPeaks_obj, 
1155
-                                  main = paste0("\nPeak annotation BEFORE filtering (n = ", nrow(GRN@annotation$consensusPeaks), ")"))
1156
-          plot(ChIPseeker::plotDistToTSS(GRN@annotation$consensusPeaks_obj))
1157
-      }
1158
-      
1159
-      # PEAK AND GENE PROPERTIES #
1160
-
1161
-      xlabel = paste0("Correlation raw\np-value (binned)")
1162
-      mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ", .prettyNum(range), " bp promoter range)")
1163
-      
1164
-      allVars = c("peak.GC.class",  "peak.width", "peak.mean","peak.median",
1165
-                  "peak.CV", "gene.median",  "gene.mean", "gene.CV", "peak.gene.combined.CV")
1166
-      
1167
-      # If ChIPseeker is not installed, remove peak.annotation from it
1168
-      if ("peak.annotation" %in% colnames(peakGeneCorrelations.all)) {
1169
-          allVars = c(allVars, "peak.annotation")
1170
-      }
1171
-    
1172
-      
1173
-      for (varCur in allVars) {
1174
-          #next
1175
-          # Save memory and prune the table and add only the variable we need here
1176
-          if (varCur != "peak.gene.combined.CV") {
1177
-              dataCur = peakGeneCorrelations.all[indexCurReal,] %>%
1178
-                  dplyr::select(peak_gene.p_raw, tidyselect::all_of(varCur), class, r_positive, peak_gene.p.raw.class, peak_gene.distance) 
1179
-          } else {
1180
-              dataCur = peakGeneCorrelations.all[indexCurReal,] %>%
1181
-                  dplyr::select(peak_gene.p_raw, class, gene.CV, peak.CV, r_positive, peak_gene.p.raw.class, peak_gene.distance)
1182
-          }
885
+      filenameCur = paste0(fileBase, paste0(geneTypesSelected, collapse = "+"), filteredStr, ".pdf")
886
+      filenames.all = c(filenames.all, filenameCur)
887
+  }
888
+ 
889
+  if (!all(file.exists(filenames.all)) | forceRerun) {
890
+     
891
+     futile.logger::flog.info(paste0("Plotting diagnostic plots for peak-gene correlations", dplyr::if_else(is.null(fileBase), "", paste0(" to file(s) with basename ", fileBase))))
892
+     
893
+     cols_keep = c("peak.ID", "gene.ENSEMBL", "peak_gene.r", "peak_gene.p_raw", "peak_gene.distance")
894
+     
895
+     # Change from 10 to 5 
896
+     nCategoriesBinning = 5
897
+     probs = seq(0,1,1/nCategoriesBinning)
898
+     
899
+     
900
+     range = GRN@config$parameters$promoterRange
901
+     
902
+     networkType_details = c(paste0("real_",range), paste0("random_",range))
903
+     
904
+     colors_vec = c("black", "darkgray")
905
+     networkType_vec = c("real", "permuted")
906
+     names(colors_vec) = names(networkType_vec) = networkType_details
907
+     
908
+     options(dplyr.summarise.inform = FALSE) 
909
+     
910
+     includeRobustCols = FALSE
911
+     #Either take all or the filtered set of connections
912
+     if (useFiltered) {
1183 913
          
1184
-          
1185
-          # Choose colors depending on type of variable: gradient or not?
1186
-          
1187
-          if (varCur %in% c("peak.annotation","peak.GC.class")) {
914
+         # TODO: Robust columns filter
915
+         peakGeneCorrelations.all = 
916
+             rbind(dplyr::select(GRN@connections$all.filtered[["0"]], dplyr::everything()) %>%
917
+                       dplyr::mutate(class = paste0("real_",range))) %>%
918
+             rbind(dplyr::select(GRN@connections$all.filtered[["1"]],  dplyr::everything()) %>% 
919
+                       dplyr::mutate(class = paste0("random_",range)))
920
+         
921
+     } else {
922
+         
923
+         robustColumns = c("peak_gene.p_raw.robust", "peak_gene.bias_M_p.raw", "peak_gene.bias_LS_p.raw", "peak_gene.r_robust")
924
+         if (all(robustColumns %in% colnames(GRN@connections$peak_genes[["0"]]))) {
925
+             includeRobustCols = TRUE
926
+             cols_keep = c(cols_keep, robustColumns)
927
+         }
928
+         
929
+         class_levels = c(paste0("real_",range), paste0("random_",range))
930
+         
931
+         peakGeneCorrelations.all = 
932
+             rbind(
933
+                 dplyr::select(GRN@connections$peak_genes[["0"]], tidyselect::all_of(cols_keep)) %>%
934
+                     dplyr::mutate(class = factor(paste0("real_",range), levels = class_levels)),
935
+                 dplyr::select(GRN@connections$peak_genes[["1"]],  tidyselect::all_of(cols_keep)) %>% 
936
+                     dplyr::mutate(class = factor(paste0("random_",range), levels = class_levels))) %>%
937
+             dplyr::left_join(dplyr::select(GRN@annotation$genes, gene.ENSEMBL, gene.type, gene.mean, gene.median, gene.CV), by = "gene.ENSEMBL") %>%
938
+             dplyr::left_join(GRN@annotation$consensusPeaks %>% 
939
+                                  dplyr::select(-dplyr::starts_with("peak.gene."), -peak.GC.perc), by = "peak.ID") %>%
940
+             dplyr::select(-gene.ENSEMBL)
941
+         
942
+     }
1188 943
      
1189
-              newColName = varCur
1190
-              
1191
-          } else {
1192
-              
1193
-              newColName = paste0(varCur, ".class")
1194
-              
1195
-              if (varCur == "peak.gene.combined.CV") {
1196
-                  
1197
-                  dataCur = dataCur %>%
1198
-                      dplyr::mutate(!!(newColName) := dplyr::case_when(
1199
-                          gene.CV < 0.5                & peak.CV < 0.5                ~ "gene.CV+peak.CV<0.5", ##
1200
-                          # gene.CV >= 0.5 & gene.CV < 1 & peak.CV < 0.5                ~ "gene.CV<1_peak.CV<0.5",
1201
-                          # gene.CV >= 1                 & peak.CV < 0.5                ~ "gene.CV>1_peak.CV<0.5",
1202
-                          # 
1203
-                          # gene.CV < 0.5                & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<0.5_peak.CV<1",
1204
-                          # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<1_peak.CV<1",
1205
-                          # gene.CV >= 1                 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV>1_peak.CV<1",
1206
-                          # 
1207
-                          # gene.CV < 0.5                & peak.CV >= 1                 ~ "gene.CV<0.5_peak.CV>1",
1208
-                          # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 1                 ~ "gene.CV<1_peak.CV>1",
1209
-                          gene.CV >= 1                 & peak.CV >= 1                 ~ "gene.CV+peak.CV>1", ##
1210
-                          TRUE ~ "other"
1211
-                      )) %>%
1212
-                      dplyr::select(-gene.CV, -peak.CV)
1213
-                  
1214
-              } else {
1215
-                  
1216
-                  dataCur = dataCur %>%
1217
-                      dplyr::mutate(!!(newColName) := cut(.data[[varCur]], breaks = unique(quantile(.data[[varCur]], probs = probs, na.rm = TRUE)), 
1218
-                                                      include.lowest = TRUE, ordered_result = TRUE)) %>%
1219
-                      dplyr::select(-tidyselect::all_of(varCur))
1220
-              }
1221
-              
944
+     if (!plotPerTF) {
945
+         peakGeneCorrelations.all = dplyr::select(peakGeneCorrelations.all, -peak.ID)
946
+     }
947
+     
948
+     nClasses_distance  = 10
949
+     
950
+     peakGeneCorrelations.all = peakGeneCorrelations.all %>%
951
+         dplyr::mutate(r_positive = peak_gene.r > 0,
952
+                       peak_gene.distance_class = 
953
+                           forcats::fct_explicit_na(addNA(cut(peak_gene.distance, breaks = nClasses_distance, include.lowest = TRUE)), "random"),
954
+                       peak_gene.distance_class_abs = forcats::fct_explicit_na(addNA(cut(abs(peak_gene.distance), 
955
+                                                                                         breaks = nClasses_distance, include.lowest = TRUE, ordered_result = TRUE)), "random"),
956
+                       peak_gene.p.raw.class = cut(peak_gene.p_raw, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE),
957
+                       peak_gene.r.class = cut(peak_gene.r, breaks = seq(-1,1,0.05), include.lowest = TRUE, ordered_result = TRUE)) %>%
958
+         dplyr::filter(!is.na(peak_gene.r)) # Eliminate rows with NA for peak_gene.r. This can happen if the normalized gene counts are identical across ALL samples, and due to the lack of any variation, the correlation cannot be computed
959
+     
960
+     
961
+     # Oddity of cut: When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals. 
962
+     levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1] = 
963
+         gsub("(-\\d+)", "0", levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1], perl = TRUE)
964
+     
965
+     
966
+     if (includeRobustCols) {
967
+         peakGeneCorrelations.all = peakGeneCorrelations.all %>%
968
+             dplyr::mutate(peak_gene.p_raw.robust.class = 
969
+                               cut(peak_gene.p_raw.robust, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE))
970
+     }
971
+     
972
+     
973
+     # Prepare plots #
974
+     
975
+     colors_class = c("black", "black")
976
+     names(colors_class)= unique(peakGeneCorrelations.all$class)
977
+     colors_class[which(grepl("random", names(colors_class)))] = "darkgray"
978
+     
979
+     r_pos_class = c("black", "darkgray")
980
+     names(r_pos_class) =c("TRUE", "FALSE")
981
+     
982
+     dist_class = c("dark red", "#fc9c9c")
983
+     names(dist_class) = class_levels
984
+     
985
+     freqs= table(peakGeneCorrelations.all$class)
986
+     freq_class = paste0(gsub(names(freqs), pattern = "(.+)(_.*)", replacement = "\\1"), " (n=", .prettyNum(freqs) , ")")
987
+     # Change upstream and go with "permuted" everywhere
988
+     freq_class = gsub(freq_class, pattern = "random", replacement = "permuted")
989
+     names(freq_class) <- names(freqs)
990
+     
991
+     
992
+     
993
+     xlabels_peakGene_r.class = levels(peakGeneCorrelations.all$peak_gene.r.class)
994
+     nCur = length(xlabels_peakGene_r.class)
995
+     xlabels_peakGene_r.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- ""
996
+     
997
+     # For the last plot, which is wider, we label a few more
998
+     xlabels_peakGene_r.class2 = levels(peakGeneCorrelations.all$peak_gene.r.class)
999
+     nCur = length(xlabels_peakGene_r.class2)
1000
+     xlabels_peakGene_r.class2[setdiff(seq_len(nCur), c(1, floor(nCur/4), floor(nCur/2), floor(nCur/4*3), nCur))] <- ""
1001
+     
1002
+     xlabels_peakGene_praw.class = levels(peakGeneCorrelations.all$peak_gene.p.raw.class)
1003
+     nCur = length(xlabels_peakGene_praw.class)
1004
+     xlabels_peakGene_praw.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- ""
1005
+     
1006
+     
1007
+     for (geneTypesSelected in gene.types) {
1008
+         
1009
+         futile.logger::flog.info(paste0(" Gene type ", paste0(geneTypesSelected, collapse = "+")))
1010
+         
1011
+         if ("all" %in% geneTypesSelected) {
1012
+             indexCur = seq_len(nrow(peakGeneCorrelations.all))
1013
+         } else {
1014
+             indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected)
1015
+         }
1016
+         
1017
+         
1018
+         # START PLOTTING #
1019
+         
1020
+         if (!is.null(fileBase)) {
1021
+             filenameCur = paste0(fileBase, paste0(geneTypesSelected, collapse = "+"), filteredStr, ".pdf")
1022
+             grDevices::pdf(file = filenameCur, width = pdf_width, height = pdf_height)
1023
+         }
1024
+         
1025
+         if (plotPerTF) {
1026
+             
1027
+             TF.nRows = rep(-1, length(GRN@config$allTF))
1028
+             #TF.nRows = rep(-1, 10)
1029
+             TF.peaks = list()
1030
+             names(TF.nRows) = GRN@config$allTF
1031
+             for (TFCur in GRN@config$allTF) {
1032
+                 TF.peaks[[TFCur]] = names(which(GRN@data$TFs$TF_peak_overlap[,TFCur] == 1))
1033
+                 TF.nRows[TFCur] = peakGeneCorrelations.all[indexCur,] %>% dplyr::filter(peak.ID %in% TF.peaks[[TFCur]]) %>% nrow()
1034
+             }
1035
+             
1036
+             TFs_sorted = names(sort(TF.nRows, decreasing = TRUE))
1037
+             
1038
+             allTF = c("all", TFs_sorted)
1039
+             
1040
+         } else {
1041
+             allTF = "all"
1042
+         }
1043
+         
1044
+         counter = 0
1045
+         for (TFCur in allTF) {
1046
+             
1047
+             counter = counter + 1
1048
+             
1049
+             if (length(allTF) > 1) {
1050
+                 futile.logger::flog.info(paste0(" QC plots for TF ", TFCur, " (", counter, " of ", length(allTF), ")"))
1051
+             }
1052
+             
1053
+             
1054
+             if ("all" %in% geneTypesSelected) {
1055
+                 indexCur = seq_len(nrow(peakGeneCorrelations.all))
1056
+             } else {
1057
+                 indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected)
1058
+             }
1059
+             
1060
+             
1061
+             if (TFCur != "all") {
1062
+                 indexCur = intersect(indexCur, which(peakGeneCorrelations.all$peak.ID %in% TF.peaks[[TFCur]]))
1063
+             }
1064
+             
1065
+             # Get subset also for just the real data
1066
+             indexCurReal = intersect(indexCur, which(peakGeneCorrelations.all$class == names(dist_class)[1]))
1067
+             
1068
+             
1069
+             xlabel = paste0("Correlation raw p-value")
1070
+             
1071
+             # DENSITY PLOTS P VALUE
1072
+             
1073
+             # TODO: Densities as ratio
1074
+             # https://stackoverflow.com/questions/58629959/how-can-i-extract-data-from-a-kernel-density-function-in-r-for-many-samples-at-o#:~:text=To%20compute%20the%20density%20you,can%20use%20the%20package%20spatstat%20.
1075
+             
1076
+             # Produce the labels for the class-specific subtitles
1077
+             customLabel_class = .customLabeler(table(peakGeneCorrelations.all[indexCur,]$class))
1078
+             
1079
+             r_pos_freq = table(peakGeneCorrelations.all[indexCur,]$r_positive)
1080
+             labeler_r_pos = labeller(r_positive = c("TRUE"  = paste0("r positive (", r_pos_freq["TRUE"], ")"), 
1081
+                                                     "FALSE" = paste0("r negative (", r_pos_freq["FALSE"], ")")) )
1082
+             theme_main = theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = rel(0.8)),
1083
+                                axis.text.y = element_text(size=rel(0.8)),
1084
+                                panel.grid.major = element_blank(), panel.grid.minor = element_blank())
1085
+             
1086
+             
1087
+             
1088
+             ## p-val density curves stratified by real/permuted ##
1089
+             
1090
+             gA2 = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.p_raw, color = r_positive)) + geom_density()  +
1091
+                 facet_wrap(~ class, labeller = labeller(class=freq_class) ) +
1092
+                 xlab(xlabel) + ylab("Density") +  theme_bw() +
1093
+                 scale_color_manual(labels = names(r_pos_class), values = r_pos_class) +
1094
+                 theme(legend.position = "none", axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)))
1095
+             
1096
+             # Helper function to retrieve all tables and data aggregation steps for subsequent visualization
1097
+             tbl.l = .createTables_peakGeneQC(peakGeneCorrelations.all[indexCur,], networkType_details, colors_vec, range)
1098
+             
1099
+             
1100
+             xlabel = paste0("Correlation raw\np-value (binned)")
1101
+             
1102
+             
1103
+             xlabels = levels(tbl.l$d_merged$peak_gene.p.raw.class)
1104
+             xlabels[setdiff(seq_len(length(xlabels)), c(1, floor(length(xlabels)/2), length(xlabels)))] <- ""
1105
+             
1106
+             gB3 = ggplot(tbl.l$d_merged, aes(peak_gene.p.raw.class, ratio, fill = classAll)) + 
1107
+                 geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.5) + 
1108
+                 geom_hline(yintercept = 1, linetype = "dotted") + 
1109
+                 xlab(xlabel) + ylab("Ratio") +
1110
+                 scale_fill_manual("Class", values = c(dist_class, r_pos_class), 
1111
+                                   labels = c("real", "permuted", "r+ (r>0)", "r- (r<=0)"), 
1112
+                 ) + # labels vector can be kind of manually specified here because the levels were previosly defined in a certain order
1113
+                 scale_x_discrete(labels = xlabels_peakGene_praw.class) +
1114
+                 theme_bw() +  
1115
+                 #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) +
1116
+                 # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) +
1117
+                 theme_main +
1118
+                 facet_wrap(~ factor(set), nrow = 2, scales = "free_y", strip.position = "left") 
1119
+             
1120
+             #  plot two FDR plots as well: (fraction of negative / negative+positive) and (fraction of permuted / permuted + real)
1121
+             # that could give an indication of whether an FDR based on the permuted or based on the negative correlations would be more stringent
1122
+             
1123
+             # R PEAK GENE #
1124
+             
1125
+             xlabel = paste0("Correlation coefficient r")
1126
+             
1127
+             sum_real = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[1]]
1128
+             sum_rnd  = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[2]]
1129
+             binData.r = peakGeneCorrelations.all[indexCur,] %>%
1130
+                 dplyr::group_by(class) %>%
1131
+                 dplyr::count(peak_gene.r.class) %>%
1132
+                 dplyr::mutate(nnorm = dplyr::case_when(class == !! (names(dist_class)[1]) ~ .data$n / (sum_real / sum_rnd), 
1133
+                                                        TRUE ~ as.double(.data$n)))
1134
+             
1135
+             xlabel = paste0("Correlation coefficient r (binned)")
1136
+             
1137
+             gD = ggplot(binData.r, aes(peak_gene.r.class, nnorm, group = class, fill = class)) + 
1138
+                 geom_bar(stat = "identity", position = position_dodge(preserve = "single"), na.rm = FALSE, width = 0.5) +
1139
+                 geom_line(aes(peak_gene.r.class, nnorm, group = class, color= class), stat = "identity") +
1140
+                 scale_fill_manual("Group", labels = names(dist_class), values = dist_class) +
1141
+                 scale_color_manual("Group", labels = names(dist_class), values = dist_class) +
1142
+                 scale_x_discrete(labels = xlabels_peakGene_r.class2, drop = FALSE) +
1143
+                 theme_bw() + theme(legend.position = "none") +
1144
+                 xlab(xlabel) + ylab("Abundance") +
1145
+                 theme_main  +
1146
+                 scale_y_continuous(labels = scales::scientific)
1147
+             
1148
+             
1149
+             mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ",\n", .prettyNum(range), " bp promoter range)")
1150
+             
1151
+             plots_all = ( ((gA2 | gB3 ) + 
1152
+                                patchwork::plot_layout(widths = c(2.5,1.5))) / ((gD) + 
1153
+                                                                                    patchwork::plot_layout(widths = c(4))) ) + 
1154
+                 patchwork::plot_layout(heights = c(2,1), guides = 'collect') +
1155
+                 patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5)))
1156
+             
1157
+             plot(plots_all)
1158
+             
1159
+             
1160
+         } # end for each TF
1161
+         
1162
+         
1163
+         
1164
+         if (!is.null(GRN@annotation$consensusPeaks_obj) & is.installed("ChIPseeker")) {
1165
+             #plot(ChIPseeker::plotAnnoBar(GRN@annotation$consensusPeaks_obj))
1166
+             
1167
+             # no plot, as this is somehow just a list and no ggplot object
1168
+             ChIPseeker::plotAnnoPie(GRN@annotation$consensusPeaks_obj, 
1169
+                                     main = paste0("\nPeak annotation BEFORE filtering (n = ", nrow(GRN@annotation$consensusPeaks), ")"))
1170
+             plot(ChIPseeker::plotDistToTSS(GRN@annotation$consensusPeaks_obj))
1171
+         }
1172
+         
1173
+         # PEAK AND GENE PROPERTIES #
1174
+         
1175
+         xlabel = paste0("Correlation raw\np-value (binned)")
1176
+         mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ", .prettyNum(range), " bp promoter range)")
1177
+         
1178
+         allVars = c("peak.GC.class",  "peak.width", "peak.mean","peak.median",
1179
+                     "peak.CV", "gene.median",  "gene.mean", "gene.CV", "peak.gene.combined.CV")
1180
+         
1181
+         # If ChIPseeker is not installed, remove peak.annotation from it
1182
+         if ("peak.annotation" %in% colnames(peakGeneCorrelations.all)) {
1183
+             allVars = c(allVars, "peak.annotation")
1184
+         }
1185
+         
1186
+         
1187
+         for (varCur in allVars) {
1188
+             #next
1189
+             # Save memory and prune the table and add only the variable we need here
1190
+             if (varCur != "peak.gene.combined.CV") {
1191
+                 dataCur = peakGeneCorrelations.all[indexCurReal,] %>%
1192
+                     dplyr::select(peak_gene.p_raw, tidyselect::all_of(varCur), class, r_positive, peak_gene.p.raw.class, peak_gene.distance) 
1193
+             } else {
1194
+                 dataCur = peakGeneCorrelations.all[indexCurReal,] %>%
1195
+                     dplyr::select(peak_gene.p_raw, class, gene.CV, peak.CV, r_positive, peak_gene.p.raw.class, peak_gene.distance)
1196
+             }
1197
+             
1198
+             
1199
+             # Choose colors depending on type of variable: gradient or not?
1200
+             
1201
+             if (varCur %in% c("peak.annotation","peak.GC.class")) {
1222 1202
                  
1223
-          }
1203
+                 newColName = varCur
1204
+                 
1205
+             } else {
1206
+                 
1207
+                 newColName = paste0(varCur, ".class")
1208
+                 
1209
+                 if (varCur == "peak.gene.combined.CV") {
1210
+                     
1211
+                     dataCur = dataCur %>%
1212
+                         dplyr::mutate(!!(newColName) := dplyr::case_when(
1213
+                             gene.CV < 0.5                & peak.CV < 0.5                ~ "gene.CV+peak.CV<0.5", ##
1214
+                             # gene.CV >= 0.5 & gene.CV < 1 & peak.CV < 0.5                ~ "gene.CV<1_peak.CV<0.5",
1215
+                             # gene.CV >= 1                 & peak.CV < 0.5                ~ "gene.CV>1_peak.CV<0.5",
1216
+                             # 
1217
+                             # gene.CV < 0.5                & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<0.5_peak.CV<1",
1218
+                             # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<1_peak.CV<1",
1219
+                             # gene.CV >= 1                 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV>1_peak.CV<1",
1220
+                             # 
1221
+                             # gene.CV < 0.5                & peak.CV >= 1                 ~ "gene.CV<0.5_peak.CV>1",
1222
+                             # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 1                 ~ "gene.CV<1_peak.CV>1",
1223
+                             gene.CV >= 1                 & peak.CV >= 1                 ~ "gene.CV+peak.CV>1", ##
1224
+                             TRUE ~ "other"
1225
+                         )) %>%
1226
+                         dplyr::select(-gene.CV, -peak.CV)
1227
+                     
1228
+                 } else {
1229
+                     
1230
+                     dataCur = dataCur %>%
1231
+                         dplyr::mutate(!!(newColName) := cut(.data[[varCur]], breaks = unique(quantile(.data[[varCur]], probs = probs, na.rm = TRUE)), 
1232
+                                                             include.lowest = TRUE, ordered_result = TRUE)) %>%
1233
+                         dplyr::select(-tidyselect::all_of(varCur))
1234
+                 }
1235
+                 
1236
+                 
1237
+             }
1238
+             
1239
+             # Filter groups with fewer than 100 observations
1240
+             nGroupsMin = 100
1241
+             dataCur = dataCur %>%
1242
+                 dplyr::group_by(.data[[newColName]]) %>%  
1243
+                 dplyr::filter(dplyr::n() >= nGroupsMin) %>%
1244
+                 dplyr::ungroup()
1245
+             
1246
+             var.label = .prettyNum(table(dataCur[, newColName]))
1247
+             var.label = paste0(names(var.label), "\n(n=", var.label , ")\n")
1248
+             
1249
+             # Set colors
1250
+             if (varCur != "peak.annotation") {
1251
+                 mycolors <- viridis::viridis(length(var.label))
1252
+             } else {
1253
+                 # Only here, we want to have colors that are not a gradient
1254
+                 mycolors = var.label
1255
+                 downstream  = which(grepl("downstream", var.label, ignore.case = TRUE))
1256
+                 promoter    = which(grepl("promoter", var.label, ignore.case = TRUE))
1257
+                 
1258
+                 # Remove the first, white-like color from the 2 palettes
1259
+                 mycolors[downstream] = RColorBrewer::brewer.pal(length(downstream) + 1, "Greens")[-1]
1260
+                 mycolors[promoter]   = RColorBrewer::brewer.pal(length(promoter) + 1, "Purples")[-1]
1261
+                 mycolors[which(grepl("3' UTR", mycolors))] = "yellow"
1262
+                 mycolors[which(grepl("5' UTR", mycolors))] = "orange"
1263
+                 mycolors[which(grepl("Distal Intergenic", mycolors))] = "red"
1264
+                 mycolors[which(grepl("Exon", mycolors))] = "maroon"
1265
+                 mycolors[which(grepl("Intron", mycolors))] = "lightblue"
1266
+                 
1267
+             }
1268
+             
1269
+             r_pos_tbl = dataCur %>% dplyr::group_by(r_positive) %>% dplyr::pull(r_positive) %>% table()
1270
+             r_positive_label = c("TRUE"   = paste0("r+(r>0, n=", .prettyNum(r_pos_tbl[["TRUE"]]), ")"), 
1271
+                                  "FALSE" = paste0("r-(r<=0, n=" ,.prettyNum(r_pos_tbl[["FALSE"]]), ")"))
1272
+             
1273
+             # Class in the facet_wrap has been removed, as this is only for real data here
1274
+             gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2)  +
1275
+                 facet_wrap(~r_positive, labeller = labeller(r_positive = r_positive_label), nrow = 2) +
1276
+                 geom_density(aes(color = classNew), color = "black",  linetype = "dotted", alpha = 1) + 
1277
+                 xlab(xlabel) + ylab("Density (for real data only)") +  theme_bw() +
1278
+                 scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) +
1279
+                 theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), 
1280
+                       legend.text=element_text(size=rel(0.6)), legend.position = "none")
1281
+             
1282
+             # The following causes an error with patchwork
1283
+             # gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2)  +
1284
+             #   facet_wrap(~ class + r_positive, labeller = labeller(class=freq_class, r_positive = r_positive_label), nrow = 2) +
1285
+             #   geom_density(aes(color = classNew), color = "black",  linetype = "dotted", alpha = 1) + 
1286
+             #   xlab(xlabel) + ylab("Density") +  theme_bw() +
1287
+             #   scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) +
1288
+             #   theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), 
1289
+             #         legend.text=element_text(size=rel(0.6)), legend.position = "none")
1290
+             
1291
+             
1292
+             # Ratios for r+ / r-
1293
+             freq =  dataCur %>%
1294
+                 dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class, r_positive) %>%
1295
+                 dplyr::summarise(n = dplyr::n()) %>%
1296
+                 dplyr::ungroup() %>%
1297
+                 tidyr::complete(class, .data[[newColName]], peak_gene.p.raw.class, r_positive, fill = list(n = 0)) %>% # SOme cases might be missing
1298
+                 dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class) %>% # dont group by r_positive because we want to calculate the ratio within each group
1299
+                 dplyr::mutate(  
1300
+                     n = .data$n + 1, # to allow ratios to be computed even for 0 counts
1301
+                     ratio_pos_raw = .data$n[r_positive] / .data$n[!r_positive]) %>%
1302
+                 dplyr::filter(r_positive, class == names(dist_class)[1])# Keep only one r_positive row per grouping as we operate via the ratio and this data is duplicated otherwise. Remove random data also because these have been filtered out before are only back due to the complete invokation.
1303
+             
1304
+             # Cap ratios > 10 at 10 to avoid visual artefacts
1305
+             freq$ratio_pos_raw[which(freq$ratio_pos_raw > 10)] = 10
1306
+             
1307
+             # Without proper colors for now, this will be added after the next plot
1308
+             gB3 = ggplot(freq, aes(peak_gene.p.raw.class, ratio_pos_raw, fill = .data[[newColName]])) + 
1309
+                 geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.8) + 
1310
+                 geom_hline(yintercept = 1, linetype = "dotted") + 
1311
+                 xlab(xlabel) + ylab("Ratio r+ / r- (capped at 10)") +
1312
+                 scale_x_discrete(labels = xlabels_peakGene_praw.class) +
1313
+                 scale_fill_manual (varCur, values = mycolors, labels = var.label, drop = FALSE )  +
1314
+                 theme_bw() +  
1315
+                 #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) +
1316
+                 # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) +
1317
+                 theme_main +
1318
+                 facet_wrap(~ factor(class), nrow = 2, scales = "free_y", strip.position = "left", , labeller = labeller(class=freq_class)) 
1319
+             
1320
+             
1321
+             plots_all = ( ((gA3 | gB3 ) + 
1322
+                                patchwork::plot_layout(widths = c(2.5,1.5))) ) + 
1323
+                 patchwork::plot_layout(guides = 'collect') +
1324
+                 patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5)))
1325
+             
1326
+             plot(plots_all)
1327
+             
1328
+             
1329
+             # VERSION JUDITH: simplified peak.gene distance + another variable as histogram, no pemuted data
1330
+             
1331
+             datasetName = ""
1332
+             if (!is.null(GRN@config$metadata$name)) {
1333
+                 datasetName = GRN@config$metadata$name
1334
+             }
1335
+             
1336
+             mainTitle2 = paste0(datasetName, "\nSummary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ",
1337
+                                 .prettyNum(range), " bp promoter range, stratified by distance + ", varCur, ")")
1338
+             
1339
+             # r+ and r-
1340
+             binwidth = 0.1
1341
+             mycolors <- viridis::viridis(2)
1342
+             xlabel = paste0("Correlation raw p-value (", binwidth, " bins)")
1343
+             
1344
+             dataCur = dataCur %>%
1345
+                 dplyr::mutate(peak_gene.distance.class250k = factor(dplyr::if_else(peak_gene.distance <= 250000, "<=250k", ">250k"))) %>%
1346
+                 dplyr::select(-peak_gene.distance)
1347
+             
1348
+             # closed = "left", boundary = 0, ensure correct numbers. See https://github.com/tidyverse/ggplot2/issues/1739
1349
+             # Without boundary = 0, counts are actually wrong
1350
+             
1351
+             
1352
+             nrows_plot = 2
1353
+             if (length(unique(dataCur[[newColName]])) > 9) {
1354
+                 nrows_plot = 3
1355
+             }
1356
+             
1357
+             gA5 = ggplot(dataCur, aes(peak_gene.p_raw, fill = r_positive)) + 
1358
+                 geom_histogram(binwidth = binwidth, position="dodge", closed = "left", boundary = 0)  +
1359
+                 facet_wrap(~ peak_gene.distance.class250k  + .data[[newColName]], nrow = nrows_plot, scales = "free_y") +
1360
+                 xlab(xlabel) + ylab(paste0("Abundance for classes with n>=", nGroupsMin)) +  theme_bw() +
1361
+                 ggtitle(mainTitle2) + 
1362
+                 scale_fill_manual("Class for r", values = mycolors, labels = r_positive_label, drop = FALSE ) +
1363
+                 theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.6)), 
1364
+                       legend.text=element_text(size=rel(0.7)))
1365
+             
1366
+             
1367
+             plot(gA5)
1368
+             
1369
+             
1370
+             
1371
+         } #end for each variable
1224 1372
          
1225
-          # Filter groups with fewer than 100 observations
1226
-          nGroupsMin = 100
1227
-          dataCur = dataCur %>%
1228
-              dplyr::group_by(.data[[newColName]]) %>%  
1229
-              dplyr::filter(dplyr::n() >= nGroupsMin) %>%
1230
-              dplyr::ungroup()
1231
-          
1232
-          var.label = .prettyNum(table(dataCur[, newColName]))
1233
-          var.label = paste0(names(var.label), "\n(n=", var.label , ")\n")
1234
-          
1235
-          # Set colors
1236
-          if (varCur != "peak.annotation") {
1237
-              mycolors <- viridis::viridis(length(var.label))
1238
-          } else {
1239
-              # Only here, we want to have colors that are not a gradient
1240
-              mycolors = var.label
1241
-              downstream  = which(grepl("downstream", var.label, ignore.case = TRUE))
1242
-              promoter    = which(grepl("promoter", var.label, ignore.case = TRUE))
1243
-              
1244
-              # Remove the first, white-like color from the 2 palettes
1245
-              mycolors[downstream] = RColorBrewer::brewer.pal(length(downstream) + 1, "Greens")[-1]
1246
-              mycolors[promoter]   = RColorBrewer::brewer.pal(length(promoter) + 1, "Purples")[-1]
1247
-              mycolors[which(grepl("3' UTR", mycolors))] = "yellow"
1248
-              mycolors[which(grepl("5' UTR", mycolors))] = "orange"
1249
-              mycolors[which(grepl("Distal Intergenic", mycolors))] = "red"
1250
-              mycolors[which(grepl("Exon", mycolors))] = "maroon"
1251
-              mycolors[which(grepl("Intron", mycolors))] = "lightblue"
1252
-              
1253
-          }
1254
-          
1255
-          r_pos_tbl = dataCur %>% dplyr::group_by(r_positive) %>% dplyr::pull(r_positive) %>% table()
1256
-          r_positive_label = c("TRUE"   = paste0("r+(r>0, n=", .prettyNum(r_pos_tbl[["TRUE"]]), ")"), 
1257
-                               "FALSE" = paste0("r-(r<=0, n=" ,.prettyNum(r_pos_tbl[["FALSE"]]), ")"))
1258 1373
          
1259
-          # Class in the facet_wrap has been removed, as this is only for real data here
1260
-          gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2)  +
1261
-              facet_wrap(~r_positive, labeller = labeller(r_positive = r_positive_label), nrow = 2) +
1262
-              geom_density(aes(color = classNew), color = "black",  linetype = "dotted", alpha = 1) + 
1263
-              xlab(xlabel) + ylab("Density (for real data only)") +  theme_bw() +
1264
-              scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) +
1265
-              theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), 
1266
-                    legend.text=element_text(size=rel(0.6)), legend.position = "none")
1267
-          
1268
-          # The following causes an error with patchwork
1269
-          # gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2)  +
1270
-          #   facet_wrap(~ class + r_positive, labeller = labeller(class=freq_class, r_positive = r_positive_label), nrow = 2) +
1271
-          #   geom_density(aes(color = classNew), color = "black",  linetype = "dotted", alpha = 1) + 
1272
-          #   xlab(xlabel) + ylab("Density") +  theme_bw() +
1273
-          #   scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) +
1274
-          #   theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), 
1275
-          #         legend.text=element_text(size=rel(0.6)), legend.position = "none")
1276
-          
1277
-          
1278
-          # Ratios for r+ / r-
1279
-          freq =  dataCur %>%
1280
-              dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class, r_positive) %>%
1281
-              dplyr::summarise(n = dplyr::n()) %>%
1282
-              dplyr::ungroup() %>%
1283
-              tidyr::complete(class, .data[[newColName]], peak_gene.p.raw.class, r_positive, fill = list(n = 0)) %>% # SOme cases might be missing
1284
-              dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class) %>% # dont group by r_positive because we want to calculate the ratio within each group
1285
-              dplyr::mutate(  
1286
-                  n = .data$n + 1, # to allow ratios to be computed even for 0 counts
1287
-                  ratio_pos_raw = .data$n[r_positive] / .data$n[!r_positive]) %>%
1288
-              dplyr::filter(r_positive, class == names(dist_class)[1])# Keep only one r_positive row per grouping as we operate via the ratio and this data is duplicated otherwise. Remove random data also because these have been filtered out before are only back due to the complete invokation.
1289
-          
1290
-          # Cap ratios > 10 at 10 to avoid visual artefacts
1291
-          freq$ratio_pos_raw[which(freq$ratio_pos_raw > 10)] = 10
1292
-          
1293
-          # Without proper colors for now, this will be added after the next plot
1294
-          gB3 = ggplot(freq, aes(peak_gene.p.raw.class, ratio_pos_raw, fill = .data[[newColName]])) + 
1295
-              geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.8) + 
1296
-              geom_hline(yintercept = 1, linetype = "dotted") + 
1297
-              xlab(xlabel) + ylab("Ratio r+ / r- (capped at 10)") +
1298
-              scale_x_discrete(labels = xlabels_peakGene_praw.class) +
1299
-              scale_fill_manual (varCur, values = mycolors, labels = var.label, drop = FALSE )  +
1300
-              theme_bw() +  
1301
-              #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) +
1302
-              # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) +
1303
-              theme_main +
1304
-              facet_wrap(~ factor(class), nrow = 2, scales = "free_y", strip.position = "left", , labeller = labeller(class=freq_class)) 
1305
-          
1306
-          
1307
-          plots_all = ( ((gA3 | gB3 ) + 
1308
-              patchwork::plot_layout(widths = c(2.5,1.5))) ) + 
1309
-              patchwork::plot_layout(guides = 'collect') +
1310
-              patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5)))
1311
-          
1312
-          plot(plots_all)
1313
-          
1314
-          
1315
-          # VERSION JUDITH: simplified peak.gene distance + another variable as histogram, no pemuted data
1316
-
1317
-          datasetName = ""
1318
-          if (!is.null(GRN@config$metadata$name)) {
1319
-              datasetName = GRN@config$metadata$name
1320
-          }
1321
-          
1322
-          mainTitle2 = paste0(datasetName, "\nSummary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ",
1323
-                              .prettyNum(range), " bp promoter range, stratified by distance + ", varCur, ")")
1324
-          
1325
-          # r+ and r-
1326
-          binwidth = 0.1
1327
-          mycolors <- viridis::viridis(2)
1328
-          xlabel = paste0("Correlation raw p-value (", binwidth, " bins)")
1329
-          
1330
-          dataCur = dataCur %>%
1331
-              dplyr::mutate(peak_gene.distance.class250k = factor(dplyr::if_else(peak_gene.distance <= 250000, "<=250k", ">250k"))) %>%
1332
-              dplyr::select(-peak_gene.distance)
1333
-          
1334
-          # closed = "left", boundary = 0, ensure correct numbers. See https://github.com/tidyverse/ggplot2/issues/1739
1335
-          # Without boundary = 0, counts are actually wrong
1336
-          
1337
-          
1338
-          nrows_plot = 2
1339
-          if (length(unique(dataCur[[newColName]])) > 9) {
1340
-              nrows_plot = 3
1341
-          }
1342
-
1343
-          gA5 = ggplot(dataCur, aes(peak_gene.p_raw, fill = r_positive)) + 
1344
-              geom_histogram(binwidth = binwidth, position="dodge", closed = "left", boundary = 0)  +
1345
-              facet_wrap(~ peak_gene.distance.class250k  + .data[[newColName]], nrow = nrows_plot, scales = "free_y") +
1346
-              xlab(xlabel) + ylab(paste0("Abundance for classes with n>=", nGroupsMin)) +  theme_bw() +
1347
-              ggtitle(mainTitle2) + 
1348
-              scale_fill_manual("Class for r", values = mycolors, labels = r_positive_label, drop = FALSE ) +
1349
-              theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.6)), 
1350
-                    legend.text=element_text(size=rel(0.7)))
1351
-    
1352
-          
1353
-          plot(gA5)
1354
-     
1355
-          
1356
-          
1357
-      } #end for each variable
1358
-      
1359
-
1360
-      # DISTANCE FOCUSED #
1361
-
1362
-      
1363
-      # Here, we focus on distance and exclude distance classes with too few points and create a new subset of the data
1364
-      
1365
-      # Filter distance classes with too few points
1366
-      distance_class_abund = table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs)
1367
-      indexFilt = which(peakGeneCorrelations.all$peak_gene.distance_class_abs %in% 
1368
-                            names(distance_class_abund)[which(distance_class_abund > 50)])
1369
-      indexFilt = intersect(indexFilt, indexCur)
1370
-      
1371
-      if (length(indexFilt) > 0) {
1372
-          g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = peak_gene.distance_class_abs)) + geom_density() + 
1373
-              ggtitle(paste0("Density of the raw p-value distributions")) + 
1374
-              facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + 
1375
-              scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) +
1376
-              theme_bw()
1377
-          plot(g)
1374
+         # DISTANCE FOCUSED #
1378 1375
          
1379
-  
1380
-          if (includeRobustCols) {
1381
-              g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw.robust, color = peak_gene.distance_class_abs)) + geom_density() + 
1382
-                  ggtitle(paste0("Density of the raw p-value distributions\n(stratified by whether r is positive)")) + 
1383
-                  facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + 
1384
-                  scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) +
1385
-                  theme_bw()
1386
-              plot(g)
1376
+         
1377
+         # Here, we focus on distance and exclude distance classes with too few points and create a new subset of the data
1378
+         
1379
+         # Filter distance classes with too few points
1380
+         distance_class_abund = table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs)
1381
+         indexFilt = which(peakGeneCorrelations.all$peak_gene.distance_class_abs %in% 
1382
+                               names(distance_class_abund)[which(distance_class_abund > 50)])
1383
+         indexFilt = intersect(indexFilt, indexCur)
1384
+         
1385
+         if (length(indexFilt) > 0) {
1386
+             g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = peak_gene.distance_class_abs)) + geom_density() + 
1387
+                 ggtitle(paste0("Density of the raw p-value distributions")) + 
1388
+                 facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + 
1389
+                 scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) +
1390
+                 theme_bw()
1391
+             plot(g)
1387 1392
              
1388
- 
1389
-          }
1390
-          
1391
-      } # end if (length(indexFilt) > 0)
1392
-      
1393
-      
1394
-      
1395
-      
1396
-      if (length(indexFilt) > 0) {
1397
-          g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = r_positive)) + 
1398
-              geom_density() + 
1399
-              ggtitle(paste0("Density of the raw p-value distributions")) + 
1400
-              facet_wrap(~ peak_gene.distance_class_abs,  ncol = 2, labeller = .customLabeler(table(peakGeneCorrelations.all$peak_gene.distance_class_abs))) +
1401
-              scale_color_manual(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$r_positive)), values = r_pos_class) +
1402
-              theme_bw()
1403
-          plot(g)
1404
-  
1405
-          
1406
-      }
1407
-      
1408
-      
1409
-      #
1410
-      # Focus on peak_gene.r
1411
-      #
1412
-      
1413
-      g = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.r, color = peak_gene.distance_class_abs)) + geom_density() + 
1414
-          geom_density(data = peakGeneCorrelations.all[indexCur,], aes(peak_gene.r), color = "black") + 
1415
-          ggtitle(paste0("Density of the correlation coefficients")) + 
1416
-          scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs))) +
1417
-          theme_bw()
1418
-      plot(g)
1419
-      
1420
- 
1421
-      if (!is.null(fileBase)) {
1422
-           grDevices::dev.off()
1423
-      }
1424
-      
1425
-      
1426
-      
1427
-  }
1428
-  
1429
-  
1430
-  
1393
+             
1394
+             if (includeRobustCols) {
1395
+                 g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw.robust, color = peak_gene.distance_class_abs)) + geom_density() + 
1396
+                     ggtitle(paste0("Density of the raw p-value distributions\n(stratified by whether r is positive)")) + 
1397
+                     facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + 
1398
+                     scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) +
1399
+                     theme_bw()
1400
+                 plot(g)
1401
+                 
1402
+                 
1403
+             }
1404
+             
1405
+         } # end if (length(indexFilt) > 0)
1406
+         
1407
+         
1408
+         
1409
+         
1410
+         if (length(indexFilt) > 0) {
1411
+             g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = r_positive)) + 
1412
+                 geom_density() + 
1413
+                 ggtitle(paste0("Density of the raw p-value distributions")) + 
1414
+                 facet_wrap(~ peak_gene.distance_class_abs,  ncol = 2, labeller = .customLabeler(table(peakGeneCorrelations.all$peak_gene.distance_class_abs))) +
1415
+                 scale_color_manual(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$r_positive)), values = r_pos_class) +
1416
+                 theme_bw()
1417
+             plot(g)
1418
+             
1419
+             
1420
+         }
1421
+         
1422
+         
1423
+         #
1424
+         # Focus on peak_gene.r
1425
+         #
1426
+         
1427
+         g = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.r, color = peak_gene.distance_class_abs)) + geom_density() + 
1428
+             geom_density(data = peakGeneCorrelations.all[indexCur,], aes(peak_gene.r), color = "black") + 
1429
+             ggtitle(paste0("Density of the correlation coefficients")) + 
1430
+             scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs))) +
1431
+             theme_bw()
1432
+         plot(g)
1433
+         
1434
+         
1435
+         if (!is.null(fileBase)) {
1436
+             grDevices::dev.off()
1437
+         }
1438
+         
1439
+         
1440
+     }
1441
+     
1442
+ } else {
1443
+     futile.logger::flog.info(paste0("All output files already exist. Set forceRerun = TRUE to regenerate and overwrite."))
1444
+     
1445
+ }
1431 1446
   
1447
+
1432 1448
   
1433 1449
   .printExecutionTime(start)
1434 1450
   
... ...
@@ -1554,14 +1570,14 @@ plotDiagnosticPlots_peakGene <- function(GRN,
1554 1570
 #' Plot various network connectivity summaries for a \code{\linkS4class{GRN}} object
1555 1571
 #'
1556 1572
 #' @template GRN 
1557
-#' @param type Character. Either "heatmap" or "boxplot". Default "heatmap". Which plot type to produce?
1573
+#' @param type Character. Either \code{"heatmap"} or \code{"boxplot"}. Default \code{"heatmap"}. Which plot type to produce?
1558 1574
 #' @template outputFolder
1559 1575
 #' @template basenameOutput
1560 1576
 #' @template plotAsPDF
1561 1577
 #' @template pdf_width
1562 1578
 #' @template pdf_height
1563 1579
 #' @template forceRerun
1564
-#' @return The same \code{\linkS4class{GRN}} object, without modifications. In addition, for the specified \code{type}, a PDF file (default filename is GRN.connectionSummary_{type}.pdf) is produced with a connection summary. We refer to the Vignettes for details and further explanations.
1580
+#' @return The same \code{\linkS4class{GRN}} object, without modifications. In addition, for the specified \code{type}, a PDF file (default filename is \code{GRN.connectionSummary_{type}.pdf}) is produced with a connection summary.
1565 1581
 #' @examples 
1566 1582
 #' # See the Workflow vignette on the GRaNIE website for examples
1567 1583
 #' GRN = loadExampleObject()
... ...
@@ -2051,10 +2067,10 @@ plotGeneralGraphStats <- function(GRN, outputFolder = NULL, basenameOutput = NUL
2051 2067
 #' @template pdf_width
2052 2068
 #' @template pdf_height
2053 2069
 #' @template forceRerun
2054
-#' @param ontology Character. NULL or vector of ontology names. Default NULL. Vector of ontologies to plot. The results must have been previously calculated otherwise an error is thrown.
2070
+#' @param ontology Character. \code{NULL} or vector of ontology names. Default \code{NULL}. Vector of ontologies to plot. The results must have been previously calculated otherwise an error is thrown.
2055 2071
 #' @param p Numeric. Default 0.05. p-value threshold to determine significance.
2056 2072
 #' @param topn_pvalue Numeric. Default 30. Maximum number of ontology terms that meet the p-value significance threshold to display in the enrichment dot plot
2057
-#' @param display_pAdj Boolean. Default FALSE. Is the p-value being displayed in the plots the adjusted p-value? This parameter is relevant for KEGG, Disease Ontology, and Reactome enrichments, and does not affect GO enrichments.
2073
+#' @param display_pAdj \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Is the p-value being displayed in the plots the adjusted p-value? This parameter is relevant for KEGG, Disease Ontology, and Reactome enrichments, and does not affect GO enrichments.
2058 2074
 #' @return The same \code{\linkS4class{GRN}} object, without modifications. A single PDF file is produced with the results.
2059 2075
 #' @examples 
2060 2076
 #' # See the Workflow vignette on the GRaNIE website for examples
... ...
@@ -2225,8 +2241,8 @@ plotGeneralEnrichment <- function(GRN, outputFolder = NULL, basenameOutput = NUL
2225 2241
 #' @template pdf_width
2226 2242
 #' @template pdf_height
2227 2243
 #' @template basenameOutput
2228
-#' @param display Character. Default "byRank". One of: "byRank", "byLabel". Specify whether the communities will by displayed based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank.
2229
-#' @param communities Numeric vector. Default c(1:10). Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities} = c(1,4), if \code{display} = "byRank" the results for the first and fourth largest communities will be plotted. if \code{display} = "byLabel", the results for the communities labeled "1", and "4" will be plotted. If set to NULL, all communities will be plotted
2244
+#' @inheritParams plotCommunitiesEnrichment
2245
+#' @param communities Numeric vector. Default \code{seq_len(10)}. Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities = c(1,4)}, if \code{display = "byRank"} the results for the first and fourth largest communities will be plotted. if \code{display = "byLabel"}, the results for the communities labeled \code{"1"}, and \code{"4"} will be plotted. If set to \code{NULL}, all communities will be plotted
2230 2246
 #' @template forceRerun
2231 2247
 #' @param topnGenes Integer. Default 20. Number of genes to plot, sorted by their rank or label.
2232 2248
 #' @param topnTFs Integer. Default 20. Number of TFs to plot, sorted by their rank or label.
... ...
@@ -2397,8 +2413,8 @@ plotCommunitiesStats <- function(GRN, outputFolder = NULL, basenameOutput = NULL
2397 2413
 #' Similarly to \code{\link{plotGeneralEnrichment}}, the results of the community-based enrichment analysis are plotted.. By default, the results for the 10 largest communities are displayed. Additionally, if a general enrichment analysis was previously generated, this function plots an additional heatmap to compare the general enrichment with the community based enrichment. A reduced version of this heatmap is also produced where terms are filtered out to improve visibility and display and highlight the most significant terms.
2398 2414
 #' 
2399 2415
 #' @inheritParams plotGeneralEnrichment
2400
-#' @param display Character. Default "byRank". One of: "byRank", "byLabel". Specify whether the communities will by displayed based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank.
2401
-#' @param communities NULL or Numeric vector. Default NULL. If set to NULL, the default, all communities enrichments that have been calculated before are plotted. If a numeric vector is specified: Depending on what was specified in the \code{display} parameter, this parameter indicates either the rank or the label of the communities to be plotted. i.e. for \code{communities} = c(1,4), if \code{display} = "byRank" the results for the first and fourth largest communities are plotted. if \code{display} = "byLabel", the results for the communities labeled "1", and "4" are plotted. 
2416
+#' @param display Character. Default \code{"byRank"}. One of: \code{"byRank"}, \code{"byLabel"}. Specify whether the communities will by displayed based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank.
2417
+#' @param communities \code{NULL} or numeric vector. Default \code{NULL}. If set to \code{NULL}, the default, all communities enrichments that have been calculated before are plotted. If a numeric vector is specified: Depending on what was specified in the \code{display} parameter, this parameter indicates either the rank or the label of the communities to be plotted. i.e. for \code{communities = c(1,4)}, if \code{display = "byRank"} the results for the first and fourth largest communities are plotted. if \code{display = "byLabel"}, the results for the communities labeled \code{"1"}, and \code{"4"} are plotted. 
2402 2418
 #' @param nSignificant Numeric. Default 3. Threshold to filter out an ontology term with less than \code{nSignificant} overlapping genes. 
2403 2419
 #' @param nID Numeric. Default 10. For the reduced heatmap, number of top terms to select per community.
2404 2420
 #' @param maxWidth_nchar_plot Integer (>=10). Default 100. Maximum number of characters for a term before it is truncated.
... ...
@@ -18,7 +18,7 @@ AR_classification_wrapper(
18 18
 \arguments{
19 19
 \item{GRN}{Object of class \code{\linkS4class{GRN}}}
20 20
 
21
-\item{significanceThreshold_Wilcoxon}{Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and diffTF paper for details.}
21
+\item{significanceThreshold_Wilcoxon}{Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and *diffTF* paper for details.}
22 22
 
23 23
 \item{plot_minNoTFBS_heatmap}{Integer. Default 100. Minimum number of TFBS for a TF to be included in the heatmap that is part of the output of this function.}
24 24
 
... ...
@@ -33,7 +33,7 @@ AR_classification_wrapper(
33 33
 \item{forceRerun}{\code{TRUE} or \code{FALSE}. Default \code{FALSE}. Force execution, even if the GRN object already contains the result. Overwrites the old results.}
34 34
 }
35 35
 \value{
36
-The same \code{\linkS4class{GRN}} object, with added data from this function.  TF_classification_densityPlotsForegroundBackground_expression_perm{0,1}.pdf, TF_classification_stringencyThresholds_expression_perm0.pdf, TF_classification_summaryHeatmap_expression_perm0.pdf,
36
+The same \code{\linkS4class{GRN}} object, with added data from this function.
37 37
 }
38 38
 \description{
39 39
 Run the activator-repressor classification for the TFs for a \code{\linkS4class{GRN}} object
... ...
@@ -30,9 +30,9 @@ addConnections_TF_peak(
30 30
 
31 31
 \item{corMethod}{Character. \code{pearson} or \code{spearman}. Default \code{pearson}. Method for calculating the correlation coefficient. See \link{cor} for details.}
32 32
 
33
-\item{connectionTypes}{Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{addData_TFActivity} for details.}
33
+\item{connectionTypes}{Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{\link{addData_TFActivity}} for details.}
34 34
 
35
-\item{removeNegativeCorrelation}{\code{TRUE} or \code{FALSE} (vector). Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is FALSE, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}.}
35
+\item{removeNegativeCorrelation}{Vector of \code{TRUE} or \code{FALSE}. Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is \code{FALSE}, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}.}
36 36
 
37 37
 \item{maxFDRToStore}{Numeric. Default 0.3. Maximum TF-peak FDR value to permanently store a particular TF-peak connection in the object? This parameter has a large influence on the overall memory size of the object, and we recommend not storing connections with a high FDR due to their sheer number.}
38 38
 
... ...
@@ -45,7 +45,7 @@ addConnections_TF_peak(
45 45
 \item{forceRerun}{\code{TRUE} or \code{FALSE}. Default \code{FALSE}. Force execution, even if the GRN object already contains the result. Overwrites the old results.}
46 46
 }
47 47
 \value{
48
-The same \code{\linkS4class{GRN}} object, with added data from this function.  TF_peak.fdrCurves_perm{o,1}.pdf
48
+The same \code{\linkS4class{GRN}} object, with added data from this function.
49 49
 }
50 50
 \description{
51 51
 Add TF-peak connections to a \code{\linkS4class{GRN}} object
... ...
@@ -21,19 +21,19 @@ addConnections_peak_gene(
21 21
 \arguments{
22 22
 \item{GRN}{Object of class \code{\linkS4class{GRN}}}
23 23
 
24
-\item{overlapTypeGene}{Character. "TSS" or "full". Default "TSS". If set to "TSS", only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to "full", the whole annotated gene (including all exons and introns) is used instead.}
24
+\item{overlapTypeGene}{Character. \code{"TSS"} or \code{"full"}. Default \code{"TSS"}. If set to \code{"TSS"}, only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to \code{"full"}, the whole annotated gene (including all exons and introns) is used instead.}
25 25
 
26 26
 \item{corMethod}{Character. \code{pearson} or \code{spearman}. Default \code{pearson}. Method for calculating the correlation coefficient. See \link{cor} for details.}
27 27
 
28 28
 \item{promoterRange}{Integer >=0. Default 250000. The size of the neighborhood in bp to correlate peaks and genes in vicinity. Only peak-gene pairs will be correlated if they are within the specified range. Increasing