... | ... |
@@ -3,9 +3,9 @@ |
3 | 3 |
#' Initialize a \code{\linkS4class{GRN}} object |
4 | 4 |
#' |
5 | 5 |
#' @export |
6 |
-#' @param objectMetadata List. Default list(). Optional (named) list with an arbitrary number of elements, all of which capture metadata for the object. This is mainly used to distinguish GRN objects from one another by storing object-specific metadata along with the data. |
|
6 |
+#' @param objectMetadata List. Default \code{list()}. Optional (named) list with an arbitrary number of elements, all of which capture metadata for the object. This is mainly used to distinguish GRN objects from one another by storing object-specific metadata along with the data. |
|
7 | 7 |
#' @param outputFolder Absolute file path. No default. Default output folder where all pipeline output will be put unless specified otherwise. We recommend specifying an absolute path. Note that for Windows-based systems, the path must be correctly specified with "/" as path separator. |
8 |
-#' @param genomeAssembly Character. No default. The genome assembly of all data that to be used within this object. Currently, supported genomes are: \code{hg19}, \code{hg38}, and \code{mm10} |
|
8 |
+#' @param genomeAssembly Character. No default. The genome assembly of all data that to be used within this object. Currently, supported genomes are: \code{hg19}, \code{hg38}, and \code{mm10}. |
|
9 | 9 |
#' @return Empty \code{\linkS4class{GRN}} object |
10 | 10 |
#' @examples |
11 | 11 |
#' meta.l = list(name = "exampleName", date = "01.03.22") |
... | ... |
@@ -101,14 +101,14 @@ initializeGRN <- function(objectMetadata = list(), |
101 | 101 |
#' |
102 | 102 |
#' @export |
103 | 103 |
#' @template GRN |
104 |
-#' @param counts_peaks Data frame. No default. Counts for the peaks, with raw or normalized counts for each peak (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument *idColumn_peaks* below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table. |
|
105 |
-#' @param normalization_peaks Character. Default "DESeq_sizeFactor". Normalization procedure for peak data. Must be one of "DESeq_sizeFactor", "none", or "quantile" |
|
106 |
-#' @param idColumn_peaks Character. Default "peakID". Name of the column in the counts_peaks data frame that contains peak IDs. The required format must be "{chr}:{start}-{end}", with {chr} denoting the abbreviated chromosome name, and {start} and {end} the begin and end of the peak coordinates, respectively. End must be bigger than start. Examples for valid peak IDs are chr1:400-800 or chrX:20-25. |
|
107 |
-#' @param counts_rna Data frame. No default. Counts for the RNA-seq data, with raw or normalized counts for each gene (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument *idColumn_rna* below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table. |
|
108 |
-#' @param normalization_rna Character. Default "quantile". Normalization procedure for peak data. Must be one of "DESeq_sizeFactor", "none", or "quantile" |
|
109 |
-#' @param idColumn_RNA Character. Default "ENSEMBL". Name of the column in the counts_rna data frame that contains Ensembl IDs. |
|
110 |
-#' @param sampleMetadata Data frame. Default NULL. Optional, additional metadata for the samples, such as age, sex, gender etc. If provided, the @seealso [plotPCA_all()] function can then incorporate and plot it. Sample names must match with those from both peak and RNA-Seq data. The first column is expected to contain the sample IDs, the actual column name is irrelevant. |
|
111 |
-#' @param allowOverlappingPeaks \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should overlapping peaks be allowed (then only a warning is issued when overlapping peaks are found) or (the default) should an error be raised? |
|
104 |
+#' @param counts_peaks Data frame. No default. Counts for the peaks, with raw or normalized counts for each peak (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_peaks} below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table. |
|
105 |
+#' @param normalization_peaks Character. Default \code{DESeq_sizeFactor}. Normalization procedure for peak data. Must be one of \code{DESeq_sizeFactor}, \code{none}, or \code{quantile}. |
|
106 |
+#' @param idColumn_peaks Character. Default \code{peakID}. Name of the column in the counts_peaks data frame that contains peak IDs. The required format must be {chr}:{start}-{end}", with {chr} denoting the abbreviated chromosome name, and {start} and {end} the begin and end of the peak coordinates, respectively. End must be bigger than start. Examples for valid peak IDs are \code{chr1:400-800} or \code{chrX:20-25}. |
|
107 |
+#' @param counts_rna Data frame. No default. Counts for the RNA-seq data, with raw or normalized counts for each gene (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument \code{idColumn_rna} below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table. |
|
108 |
+#' @param normalization_rna Character. Default \code{quantile}. Normalization procedure for peak data. Must be one of "DESeq_sizeFactor", "none", or "quantile" |
|
109 |
+#' @param idColumn_RNA Character. Default \code{ENSEMBL}. Name of the column in the \code{counts_rna} data frame that contains Ensembl IDs. |
|
110 |
+#' @param sampleMetadata Data frame. Default \code{NULL}. Optional, additional metadata for the samples, such as age, sex, gender etc. If provided, the @seealso [plotPCA_all()] function can then incorporate and plot it. Sample names must match with those from both peak and RNA-Seq data. The first column is expected to contain the sample IDs, the actual column name is irrelevant. |
|
111 |
+#' @param allowOverlappingPeaks \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should overlapping peaks be allowed (then only a warning is issued when overlapping peaks are found) or (the default) should an error be raised? |
|
112 | 112 |
#' @template forceRerun |
113 | 113 |
#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. |
114 | 114 |
#' @examples |
... | ... |
@@ -117,9 +117,9 @@ initializeGRN <- function(objectMetadata = list(), |
117 | 117 |
#' # rna.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/rna.tsv.gz") |
118 | 118 |
#' # peaks.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/peaks.tsv.gz") |
119 | 119 |
#' # meta.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/sampleMetadata.tsv.gz") |
120 |
-#' GRN = loadExampleObject() |
|
121 |
-#' # We omit sampleMetadata = meta.df here, lines becomes too long otherwise |
|
122 |
-#' GRN = addData(GRN, counts_peaks = peaks.df, counts_rna = rna.df, forceRerun = FALSE) |
|
120 |
+#' # GRN = loadExampleObject() |
|
121 |
+#' # We omit sampleMetadata = meta.df in the following line, becomes too long otherwise |
|
122 |
+#' # GRN = addData(GRN, counts_peaks = peaks.df, counts_rna = rna.df, forceRerun = FALSE) |
|
123 | 123 |
|
124 | 124 |
addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq_sizeFactor", idColumn_peaks = "peakID", |
125 | 125 |
counts_rna, normalization_rna = "quantile", idColumn_RNA = "ENSEMBL", sampleMetadata = NULL, |
... | ... |
@@ -591,7 +591,7 @@ addData <- function(GRN, counts_peaks, normalization_peaks = "DESeq_sizeFactor", |
591 | 591 |
#' @param maxNormalizedMean_peaks Numeric or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a peak to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
592 | 592 |
#' @param minNormalizedMeanRNA Numeric or \code{NULL}. Default 5. Minimum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
593 | 593 |
#' @param maxNormalizedMeanRNA Numeric or \code{NULL}. Default \code{NULL}. Maximum mean across all samples for a gene to be retained for the normalized counts table. Set to \code{NULL} for not applying the filter. |
594 |
-#' @param chrToKeep_peaks Character vector. Default c(paste0("chr", 1:22), "chrX", "chrY"). Vector of chromosomes that peaks are allowed to come from. This filter can be used to filter sex chromosomes from the peaks, for example. |
|
594 |
+#' @param chrToKeep_peaks Character vector. Default \code{c(paste0("chr", 1:22), "chrX", "chrY")}. Vector of chromosomes that peaks are allowed to come from. This filter can be used to filter sex chromosomes from the peaks, for example. |
|
595 | 595 |
#' @param minSize_peaks Integer or \code{NULL}. Default \code{NULL}. Minimum peak size (width, end - start) for a peak to be retained. Set to \code{NULL} for not applying the filter. |
596 | 596 |
#' @param maxSize_peaks Integer or \code{NULL}. Default 10000. Maximum peak size (width, end - start) for a peak to be retained. Set to \code{NULL} for not applying the filter. |
597 | 597 |
#' @param minCV_peaks Numeric or \code{NULL}. Default \code{NULL}. Minimum CV (coefficient of variation, a unitless measure of variation) for a peak to be retained. Set to \code{NULL} for not applying the filter. |
... | ... |
@@ -845,9 +845,9 @@ filterData <- function (GRN, |
845 | 845 |
#' |
846 | 846 |
#' @template GRN |
847 | 847 |
#' @param motifFolder Character. No default. Path to the folder that contains the TFBS predictions. The files must be in BED format, 6 columns, one file per TF. See the other parameters for more details. |
848 |
-#' @param TFs Character vector. Default "all". Vector of TF names to include. The special keyword "all" can be used to include all TF found in the folder as specified by motifFolder. If "all" is specified anywhere, all TFs will be included. TF names must otherwise match the file names that are found in the folder, without the file suffix. |
|
849 |
-#' @param filesTFBSPattern Character. Default "_TFBS". Suffix for the file names in the TFBS folder that is not part of the TF name. Can be empty. For example, for the TF CTCF, if the file is called CTCF.all.TFBS.bed, set this parameter to ".all.TFBS". |
|
850 |
-#' @param fileEnding Character. Default ".bed". File ending for the files from the motif folder. |
|
848 |
+#' @param TFs Character vector. Default \code{all}. Vector of TF names to include. The special keyword \code{all} can be used to include all TF found in the folder as specified by \code{motifFolder}. If \code{all} is specified anywhere, all TFs will be included. TF names must otherwise match the file names that are found in the folder, without the file suffix. |
|
849 |
+#' @param filesTFBSPattern Character. Default \code{"_TFBS"}. Suffix for the file names in the TFBS folder that is not part of the TF name. Can be empty. For example, for the TF CTCF, if the file is called \code{CTCF.all.TFBS.bed}, set this parameter to \code{".all.TFBS"}. |
|
850 |
+#' @param fileEnding Character. Default \code{".bed"}. File ending for the files from the motif folder. |
|
851 | 851 |
#' @param nTFMax \code{NULL} or integer. Default \code{NULL}. Maximal number of TFs to import. Can be used for testing purposes, e.g., setting to 5 only imports 5 TFs even though the whole \code{motifFolder} has many more TFs defined. |
852 | 852 |
#' @template forceRerun |
853 | 853 |
#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. |
... | ... |
@@ -1496,14 +1496,14 @@ importTFData <- function(GRN, data, name, idColumn = "ENSEMBL", nameColumn = "TF |
1496 | 1496 |
#' Run the activator-repressor classification for the TFs for a \code{\linkS4class{GRN}} object |
1497 | 1497 |
#' |
1498 | 1498 |
#' @template GRN |
1499 |
-#' @param significanceThreshold_Wilcoxon Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and diffTF paper for details. |
|
1499 |
+#' @param significanceThreshold_Wilcoxon Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and *diffTF* paper for details. |
|
1500 | 1500 |
#' @param plot_minNoTFBS_heatmap Integer. Default 100. Minimum number of TFBS for a TF to be included in the heatmap that is part of the output of this function. |
1501 | 1501 |
#' @param deleteIntermediateData \code{TRUE} or \code{FALSE}. Default \code{TRUE}. Should intermediate data be deleted before returning the object after a successful run? Due to the size of the produced intermediate data, we recommend setting this to \code{TRUE}, but if memory or object size are not an issue, the information can also be kept. |
1502 | 1502 |
#' @template plotDiagnosticPlots |
1503 | 1503 |
#' @template outputFolder |
1504 | 1504 |
#' @template corMethod |
1505 | 1505 |
#' @template forceRerun |
1506 |
-#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. TF_classification_densityPlotsForegroundBackground_expression_perm{0,1}.pdf, TF_classification_stringencyThresholds_expression_perm0.pdf, TF_classification_summaryHeatmap_expression_perm0.pdf, |
|
1506 |
+#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. |
|
1507 | 1507 |
#' @examples |
1508 | 1508 |
#' # See the Workflow vignette on the GRaNIE website for examples |
1509 | 1509 |
#' GRN = loadExampleObject() |
... | ... |
@@ -1631,7 +1631,7 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05 |
1631 | 1631 |
corMethod, |
1632 | 1632 |
fileCur, width = 5, height = 5) |
1633 | 1633 |
} else { |
1634 |
- futile.logger::flog.info(paste0("File ", fileCur, " already exists, not overwriting since forceRerun = FALSE")) |
|
1634 |
+ futile.logger::flog.info(paste0(" File ", fileCur, " already exists, not overwriting since forceRerun = FALSE")) |
|
1635 | 1635 |
} |
1636 | 1636 |
|
1637 | 1637 |
fileCur = paste0(outputFolder, .getOutputFileName("plot_class_medianClass"), "_", connectionTypeCur, suffixFile) |
... | ... |
@@ -1644,7 +1644,7 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05 |
1644 | 1644 |
corMethod = corMethod, |
1645 | 1645 |
file = fileCur, width = 4, height = 8) |
1646 | 1646 |
} else { |
1647 |
- futile.logger::flog.info(paste0("File ", fileCur, " already exists, not overwriting since forceRerun = FALSE")) |
|
1647 |
+ futile.logger::flog.info(paste0(" File ", fileCur, " already exists, not overwriting since forceRerun = FALSE")) |
|
1648 | 1648 |
} |
1649 | 1649 |
|
1650 | 1650 |
fileCur = paste0(outputFolder, .getOutputFileName("plot_class_densityClass"), "_", connectionTypeCur, suffixFile) |
... | ... |
@@ -1661,7 +1661,7 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05 |
1661 | 1661 |
finalClassification = GRN@data$TFs$classification[[permIndex]] [[connectionTypeCur]]$TF.classification, |
1662 | 1662 |
file = fileCur, width = 8, height = 15) |
1663 | 1663 |
} else { |
1664 |
- futile.logger::flog.info(paste0("File ", fileCur, " already exists, not overwriting since forceRerun = FALSE")) |
|
1664 |
+ futile.logger::flog.info(paste0(" File ", fileCur, " already exists, not overwriting since forceRerun = FALSE")) |
|
1665 | 1665 |
} |
1666 | 1666 |
} |
1667 | 1667 |
|
... | ... |
@@ -1714,14 +1714,14 @@ AR_classification_wrapper<- function (GRN, significanceThreshold_Wilcoxon = 0.05 |
1714 | 1714 |
#' @template plotDetails |
1715 | 1715 |
#' @template outputFolder |
1716 | 1716 |
#' @template corMethod |
1717 |
-#' @param connectionTypes Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{addData_TFActivity} for details. |
|
1718 |
-#' @param removeNegativeCorrelation \code{TRUE} or \code{FALSE} (vector). Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is FALSE, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}. |
|
1717 |
+#' @param connectionTypes Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{\link{addData_TFActivity}} for details. |
|
1718 |
+#' @param removeNegativeCorrelation Vector of \code{TRUE} or \code{FALSE}. Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is \code{FALSE}, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}. |
|
1719 | 1719 |
#' @param maxFDRToStore Numeric. Default 0.3. Maximum TF-peak FDR value to permanently store a particular TF-peak connection in the object? This parameter has a large influence on the overall memory size of the object, and we recommend not storing connections with a high FDR due to their sheer number. |
1720 | 1720 |
#' @param useGCCorrection \code{TRUE} or \code{FALSE}. Default \code{FALSE}. EXPERIMENTAL. Should a GC-matched background be used when calculating FDRs? |
1721 | 1721 |
#' @param percBackground_size Numeric (0 to 100). Default 75. EXPERIMENTAL. Description will follow. Only relevant if \code{useGCCorrection} is set to \code{TRUE}, ignored otherwise. |
1722 | 1722 |
#' @param percBackground_resample \code{TRUE} or \code{FALSE}. Default \code{TRUE}. EXPERIMENTAL. Should resampling be enabled for those GC bins for which not enough background peaks are available?. Only relevant if \code{useGCCorrection} is set to \code{TRUE}, ignored otherwise. |
1723 | 1723 |
#' @template forceRerun |
1724 |
-#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. TF_peak.fdrCurves_perm{o,1}.pdf |
|
1724 |
+#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. |
|
1725 | 1725 |
#' @examples |
1726 | 1726 |
#' # See the Workflow vignette on the GRaNIE website for examples |
1727 | 1727 |
#' GRN = loadExampleObject() |
... | ... |
@@ -2257,13 +2257,13 @@ addConnections_TF_peak <- function (GRN, plotDiagnosticPlots = TRUE, plotDetails |
2257 | 2257 |
#' |
2258 | 2258 |
#' @export |
2259 | 2259 |
#' @template GRN |
2260 |
-#' @param overlapTypeGene Character. "TSS" or "full". Default "TSS". If set to "TSS", only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to "full", the whole annotated gene (including all exons and introns) is used instead. |
|
2260 |
+#' @param overlapTypeGene Character. \code{"TSS"} or \code{"full"}. Default \code{"TSS"}. If set to \code{"TSS"}, only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to \code{"full"}, the whole annotated gene (including all exons and introns) is used instead. |
|
2261 | 2261 |
#' @template corMethod |
2262 | 2262 |
#' @param promoterRange Integer >=0. Default 250000. The size of the neighborhood in bp to correlate peaks and genes in vicinity. Only peak-gene pairs will be correlated if they are within the specified range. Increasing this value leads to higher running times and more peak-gene pairs to be associated, while decreasing results in the opposite. |
2263 |
-#' @param TADs Data frame with TAD domains. Default NULL. If provided, the neighborhood of a peak is defined by the TAD domain the peak is in rather than a fixed-sized neighborhood. The expected format is a BED-like data frame with at least 3 columns in this particular order: chromosome, start, end, the 4th column is optional and will be taken as ID column. All additional columns as well as column names are ignored. For the first 3 columns, the type is checked as part of a data integrity check. |
|
2263 |
+#' @param TADs Data frame with TAD domains. Default \code{NULL}. If provided, the neighborhood of a peak is defined by the TAD domain the peak is in rather than a fixed-sized neighborhood. The expected format is a BED-like data frame with at least 3 columns in this particular order: chromosome, start, end, the 4th column is optional and will be taken as ID column. All additional columns as well as column names are ignored. For the first 3 columns, the type is checked as part of a data integrity check. |
|
2264 | 2264 |
#' @template nCores |
2265 | 2265 |
#' @template plotDiagnosticPlots |
2266 |
-#' @param plotGeneTypes List of character vectors. Default list(c("all"), c("protein_coding"), c("protein_coding", "lincRNA")). Each list element may consist of one or multiple gene types that are plotted collectively in one PDF. The special keyword "all" denotes all gene types that are found (be aware: this typically contains 20+ gene types, see https://www.gencodegenes.org/pages/biotypes.html for details). |
|
2266 |
+#' @param plotGeneTypes List of character vectors. Default \code{list(c("all"), c("protein_coding"), c("protein_coding", "lincRNA"))}. Each list element may consist of one or multiple gene types that are plotted collectively in one PDF. The special keyword \code{"all"} denotes all gene types that are found (be aware: this typically contains 20+ gene types, see \url{https://www.gencodegenes.org/pages/biotypes.html} for details). |
|
2267 | 2267 |
#' @template outputFolder |
2268 | 2268 |
#' @template addRobustRegression |
2269 | 2269 |
#' @template forceRerun |
... | ... |
@@ -2799,17 +2799,17 @@ addConnections_peak_gene <- function(GRN, overlapTypeGene = "TSS", corMethod = " |
2799 | 2799 |
#' @template gene.types |
2800 | 2800 |
#' @param allowMissingTFs \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should connections be returned for which the TF is NA (i.e., connections consisting only of peak-gene links?). If set to \code{TRUE}, this generally greatly increases the number of connections but it may not be what you aim for. |
2801 | 2801 |
#' @param allowMissingGenes \code{TRUE} or \code{FALSE}. Default \code{TRUE}. Should connections be returned for which the gene is NA (i.e., connections consisting only of TF-peak links?). If set to \code{TRUE}, this generally increases the number of connections. |
2802 |
-#' @param peak_gene.r_range Numeric(2) . Default c(0,1). Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links. |
|
2803 |
-#' @param peak_gene.selection "all" or "closest". Default "all". Filter for the selection of genes for each peak. If set to "all", all previously identified peak-gene are used, while "closest" only retains the closest gene for each peak that is retained until the point the filter is applied. |
|
2804 |
-#' @param peak_gene.maxDistance Integer >0. Default NULL. Maximum peak-gene distance to retain a peak-gene connection. |
|
2805 |
-#' @param filterTFs Character vector. Default NULL. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out. |
|
2806 |
-#' @param filterGenes Character vector. Default NULL. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out. |
|
2807 |
-#' @param filterPeaks Character vector. Default NULL. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out. |
|
2802 |
+#' @param peak_gene.r_range Numeric(2). Default \code{c(0,1)}. Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links. |
|
2803 |
+#' @param peak_gene.selection \code{"all"} or \code{"closest"}. Default \code{"all"}. Filter for the selection of genes for each peak. If set to \code{"all"}, all previously identified peak-gene are used, while \code{"closest"} only retains the closest gene for each peak that is retained until the point the filter is applied. |
|
2804 |
+#' @param peak_gene.maxDistance Integer >0. Default \code{NULL}. Maximum peak-gene distance to retain a peak-gene connection. |
|
2805 |
+#' @param filterTFs Character vector. Default \code{NULL}. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out. |
|
2806 |
+#' @param filterGenes Character vector. Default \code{NULL}. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out. |
|
2807 |
+#' @param filterPeaks Character vector. Default \code{NULL}. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out. |
|
2808 | 2808 |
#' @param TF_peak_FDR_selectViaCorBins \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Use a modified procedure for selecting TF-peak links that is based on the user-specified FDR but that retains also links that may have a higher FDR but a more extreme correlation. |
2809 | 2809 |
#' @param silent \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Print progress messages and filter statistics. |
2810 | 2810 |
#' @param filterLoops \code{TRUE} or \code{FALSE}. Default \code{TRUE}. If a TF regulates itself (i.e., the TF and the gene are the same entity), should such loops be filtered from the GRN? |
2811 | 2811 |
#' @template outputFolder |
2812 |
-#' @return The same \code{\linkS4class{GRN}} object, with the filtered and merged TF-peak and peak-gene connections in the slot connections$all.filtered. The filtered |
|
2812 |
+#' @return The same \code{\linkS4class{GRN}} object, with the filtered and merged TF-peak and peak-gene connections in the slot connections$all.filtered. |
|
2813 | 2813 |
#' @examples |
2814 | 2814 |
#' # See the Workflow vignette on the GRaNIE website for examples |
2815 | 2815 |
#' GRN = loadExampleObject() |
... | ... |
@@ -3621,18 +3621,18 @@ addSNPOverlap <- function(grn, SNPData, col_chr = "chr", col_pos = "pos", col_pe |
3621 | 3621 |
|
3622 | 3622 |
#' Generate a summary PDF for the number of connections for a \code{\linkS4class{GRN}} object. |
3623 | 3623 |
#' |
3624 |
-#' Essentially, this functions calls \code{filterGRNAndConnectGenes} repeatedly and stores the total number of connections and other statistics each time to summarize them afterwards. All arguments are identical to the ones in \code{filterGRNAndConnectGenes}, see the help for this function for details. |
|
3624 |
+#' Essentially, this functions calls \code{\link{filterGRNAndConnectGenes}} repeatedly and stores the total number of connections and other statistics each time to summarize them afterwards. All arguments are identical to the ones in \code{\link{filterGRNAndConnectGenes}}, see the help for this function for details. |
|
3625 | 3625 |
#' |
3626 | 3626 |
#' @export |
3627 | 3627 |
#' @template GRN |
3628 |
-#' @param TF_peak.fdr Numeric vector. Default c(0.001, 0.01, 0.05, 0.1, 0.2). TF-peak FDR values to iterate over. |
|
3628 |
+#' @param TF_peak.fdr Numeric vector. Default \code{c(0.001, 0.01, 0.05, 0.1, 0.2)}. TF-peak FDR values to iterate over. |
|
3629 | 3629 |
#' @template TF_peak.connectionTypes |
3630 |
-#' @param peak_gene.fdr Numeric vector. Default c(0.001, 0.01, 0.05, 0.1, 0.2). Peak-gene FDR values to iterate over. |
|
3631 |
-#' @param peak_gene.p_raw Numeric vector. Default NULL. Peak-gene raw p-value values to iterate over. Skipepd if set to NULL. |
|
3632 |
-#' @param peak_gene.r_range Numeric vector of length 2 (minimum -1, maximum 1). Default c(0,1). The correlation range of peak-gene connections to keep. |
|
3630 |
+#' @param peak_gene.fdr Numeric vector. Default \code{c(0.001, 0.01, 0.05, 0.1, 0.2)}. Peak-gene FDR values to iterate over. |
|
3631 |
+#' @param peak_gene.p_raw Numeric vector. Default \code{NULL}. Peak-gene raw p-value values to iterate over. Skipped if set to NULL. |
|
3632 |
+#' @param peak_gene.r_range Numeric vector of length 2 (minimum -1, maximum 1). Default \code{c(0,1)}. The correlation range of peak-gene connections to keep. |
|
3633 | 3633 |
#' @template gene.types |
3634 |
-#' @param allowMissingGenes Logical vector. Default c(FALSE, TRUE). Allow genes to be missing for peak-gene connections? |
|
3635 |
-#' @param allowMissingTFs Logical vector. Default c(FALSE). Allow TFs to be missing for TF-peak connections? |
|
3634 |
+#' @param allowMissingGenes Logical vector. Default \code{c(FALSE, TRUE)}. Allow genes to be missing for peak-gene connections? |
|
3635 |
+#' @param allowMissingTFs Logical vector. Default \code{c(FALSE)}. Allow TFs to be missing for TF-peak connections? |
|
3636 | 3636 |
#' @template forceRerun |
3637 | 3637 |
#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. |
3638 | 3638 |
#' @examples |
... | ... |
@@ -3666,128 +3666,135 @@ generateStatsSummary <- function(GRN, |
3666 | 3666 |
|
3667 | 3667 |
checkmate::assertFlag(forceRerun) |
3668 | 3668 |
|
3669 |
- GRN@stats$connections = .initializeStatsDF() |
|
3670 |
- |
|
3671 |
- if (TF_peak.connectionTypes == "all") { |
|
3672 |
- TF_peak.connectionTypesAllComb =.getAll_TF_peak_connectionTypes(GRN) |
|
3673 |
- } else { |
|
3674 |
- TF_peak.connectionTypesAllComb = unique(TF_peak.connectionTypes) |
|
3675 |
- } |
|
3676 |
- |
|
3677 |
- |
|
3678 |
- futile.logger::flog.info(paste0("Generating summary. This may take a while...")) |
|
3679 |
- |
|
3680 |
- |
|
3681 |
- for (permutationCur in 0:.getMaxPermutation(GRN)) { |
|
3682 |
- |
|
3683 |
- futile.logger::flog.info(paste0("\n", .getPermStr(permutationCur), "...\n")) |
|
3684 |
- |
|
3685 |
- permIndex = as.character(permutationCur) |
|
3686 |
- GRN@stats$connectionDetails.l[[permIndex]] = list() |
|
3687 |
- |
|
3688 |
- # Iterate over different stringency thresholds and collect statistics |
|
3689 |
- |
|
3690 |
- for (TF_peak.fdr_cur in TF_peak.fdr) { |
|
3669 |
+ if (is.null(GRN@stats$connections) | is.null(GRN@stats$connectionDetails.l) | forceRerun) { |
|
3691 | 3670 |
|
3692 |
- TF_peak.fdr_cur.str = as.character(TF_peak.fdr_cur) |
|
3671 |
+ GRN@stats$connections = .initializeStatsDF() |
|
3693 | 3672 |
|
3694 |
- futile.logger::flog.info(paste0("Calculate network stats for TF-peak FDR of ", TF_peak.fdr_cur)) |
|
3695 |
- GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] =list() |
|
3673 |
+ if (TF_peak.connectionTypes == "all") { |
|
3674 |
+ TF_peak.connectionTypesAllComb =.getAll_TF_peak_connectionTypes(GRN) |
|
3675 |
+ } else { |
|
3676 |
+ TF_peak.connectionTypesAllComb = unique(TF_peak.connectionTypes) |
|
3677 |
+ } |
|
3696 | 3678 |
|
3697 |
- futile.logger::flog.debug(paste0("Iterating over different peak-gene FDR thresholds...")) |
|
3698 |
- for (peak_gene.fdr_cur in peak_gene.fdr) { |
|
3699 |
- |
|
3700 |
- futile.logger::flog.debug(paste0("Peak-gene FDR = ", peak_gene.fdr_cur)) |
|
3701 |
- peak_gene.fdr_cur.str = as.character(peak_gene.fdr_cur) |
|
3702 |
- GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] =list() |
|
3703 |
- |
|
3704 |
- for (allowMissingTFsCur in allowMissingTFs) { |
|
3679 |
+ |
|
3680 |
+ futile.logger::flog.info(paste0("Generating summary. This may take a while...")) |
|
3681 |
+ |
|
3682 |
+ |
|
3683 |
+ for (permutationCur in 0:.getMaxPermutation(GRN)) { |
|
3705 | 3684 |
|
3706 |
- futile.logger::flog.debug(paste0(" allowMissingTFs = ", allowMissingTFsCur)) |
|
3707 |
- allowMissingTFsCur.str = as.character(allowMissingTFsCur) |
|
3708 |
- GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] = list() |
|
3685 |
+ futile.logger::flog.info(paste0("\n", .getPermStr(permutationCur), "...\n")) |
|
3709 | 3686 |
|
3710 |
- for (allowMissingGenesCur in allowMissingGenes) { |
|
3711 |
- |
|
3712 |
- futile.logger::flog.debug(paste0(" allowMissingGenes = ", allowMissingGenesCur)) |
|
3713 |
- allowMissingGenesCur.str = as.character(allowMissingGenesCur) |
|
3714 |
- GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] = list() |
|
3715 |
- |
|
3716 |
- for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb) { |
|
3687 |
+ permIndex = as.character(permutationCur) |
|
3688 |
+ GRN@stats$connectionDetails.l[[permIndex]] = list() |
|
3689 |
+ |
|
3690 |
+ # Iterate over different stringency thresholds and collect statistics |
|
3691 |
+ |
|
3692 |
+ for (TF_peak.fdr_cur in TF_peak.fdr) { |
|
3717 | 3693 |
|
3718 |
- GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = list() |
|
3694 |
+ TF_peak.fdr_cur.str = as.character(TF_peak.fdr_cur) |
|
3719 | 3695 |
|
3720 |
- futile.logger::flog.debug(paste0(" TF_peak.connectionType = ", TF_peak.connectionTypeCur)) |
|
3696 |
+ futile.logger::flog.info(paste0("Calculate network stats for TF-peak FDR of ", TF_peak.fdr_cur)) |
|
3697 |
+ GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] =list() |
|
3721 | 3698 |
|
3722 |
- GRN = filterGRNAndConnectGenes(GRN, |
|
3723 |
- TF_peak.fdr.threshold = TF_peak.fdr_cur, |
|
3724 |
- TF_peak.connectionTypes = TF_peak.connectionTypeCur, |
|
3725 |
- peak_gene.p_raw.threshold = NULL, |
|
3726 |
- peak_gene.fdr.threshold= peak_gene.fdr_cur, |
|
3727 |
- gene.types = gene.types, |
|
3728 |
- allowMissingGenes = allowMissingGenesCur, |
|
3729 |
- allowMissingTFs = allowMissingTFsCur, |
|
3730 |
- peak_gene.r_range = peak_gene.r_range, |
|
3731 |
- filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL, |
|
3732 |
- silent = TRUE) |
|
3699 |
+ futile.logger::flog.debug(paste0("Iterating over different peak-gene FDR thresholds...")) |
|
3700 |
+ for (peak_gene.fdr_cur in peak_gene.fdr) { |
|
3701 |
+ |
|
3702 |
+ futile.logger::flog.debug(paste0("Peak-gene FDR = ", peak_gene.fdr_cur)) |
|
3703 |
+ peak_gene.fdr_cur.str = as.character(peak_gene.fdr_cur) |
|
3704 |
+ GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] =list() |
|
3705 |
+ |
|
3706 |
+ for (allowMissingTFsCur in allowMissingTFs) { |
|
3707 |
+ |
|
3708 |
+ futile.logger::flog.debug(paste0(" allowMissingTFs = ", allowMissingTFsCur)) |
|
3709 |
+ allowMissingTFsCur.str = as.character(allowMissingTFsCur) |
|
3710 |
+ GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] = list() |
|
3711 |
+ |
|
3712 |
+ for (allowMissingGenesCur in allowMissingGenes) { |
|
3713 |
+ |
|
3714 |
+ futile.logger::flog.debug(paste0(" allowMissingGenes = ", allowMissingGenesCur)) |
|
3715 |
+ allowMissingGenesCur.str = as.character(allowMissingGenesCur) |
|
3716 |
+ GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] = list() |
|
3717 |
+ |
|
3718 |
+ for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb) { |
|
3719 |
+ |
|
3720 |
+ GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = list() |
|
3721 |
+ |
|
3722 |
+ futile.logger::flog.debug(paste0(" TF_peak.connectionType = ", TF_peak.connectionTypeCur)) |
|
3723 |
+ |
|
3724 |
+ GRN = filterGRNAndConnectGenes(GRN, |
|
3725 |
+ TF_peak.fdr.threshold = TF_peak.fdr_cur, |
|
3726 |
+ TF_peak.connectionTypes = TF_peak.connectionTypeCur, |
|
3727 |
+ peak_gene.p_raw.threshold = NULL, |
|
3728 |
+ peak_gene.fdr.threshold= peak_gene.fdr_cur, |
|
3729 |
+ gene.types = gene.types, |
|
3730 |
+ allowMissingGenes = allowMissingGenesCur, |
|
3731 |
+ allowMissingTFs = allowMissingTFsCur, |
|
3732 |
+ peak_gene.r_range = peak_gene.r_range, |
|
3733 |
+ filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL, |
|
3734 |
+ silent = TRUE) |
|
3735 |
+ |
|
3736 |
+ results.l = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], |
|
3737 |
+ perm = permutationCur, |
|
3738 |
+ TF_peak.fdr = TF_peak.fdr_cur, TF_peak.connectionType = TF_peak.connectionTypeCur, |
|
3739 |
+ peak_gene.p_raw = NA, |
|
3740 |
+ peak_gene.fdr = peak_gene.fdr_cur, |
|
3741 |
+ peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","), |
|
3742 |
+ gene.types = paste0(gene.types, collapse = ","), |
|
3743 |
+ allowMissingGenes = allowMissingGenesCur, |
|
3744 |
+ allowMissingTFs = allowMissingTFsCur) |
|
3745 |
+ |
|
3746 |
+ GRN@stats$connections = results.l[["summary"]] |
|
3747 |
+ |
|
3748 |
+ GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = |
|
3749 |
+ results.l[["details"]] |
|
3750 |
+ |
|
3751 |
+ } # end of for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb) |
|
3752 |
+ |
|
3753 |
+ } # end of for (allowMissingGenesCur in allowMissingGenes) |
|
3754 |
+ |
|
3755 |
+ } # end of for (allowMissingTFsCur in allowMissingTFs) |
|
3756 |
+ |
|
3757 |
+ } # end of for (peak_gene.fdr_cur in peak_gene.fdr) |
|
3733 | 3758 |
|
3734 |
- results.l = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], |
|
3735 |
- perm = permutationCur, |
|
3736 |
- TF_peak.fdr = TF_peak.fdr_cur, TF_peak.connectionType = TF_peak.connectionTypeCur, |
|
3737 |
- peak_gene.p_raw = NA, |
|
3738 |
- peak_gene.fdr = peak_gene.fdr_cur, |
|
3739 |
- peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","), |
|
3740 |
- gene.types = paste0(gene.types, collapse = ","), |
|
3741 |
- allowMissingGenes = allowMissingGenesCur, |
|
3742 |
- allowMissingTFs = allowMissingTFsCur) |
|
3743 | 3759 |
|
3744 |
- GRN@stats$connections = results.l[["summary"]] |
|
3760 |
+ if (!is.null(peak_gene.p_raw)) { |
|
3761 |
+ |
|
3762 |
+ futile.logger::flog.info(paste0(" Iterating over different peak-gene raw p-value thresholds...")) |
|
3763 |
+ for (peak_gene.p_raw_cur in peak_gene.p_raw) { |
|
3764 |
+ |
|
3765 |
+ futile.logger::flog.info(paste0(" Peak-gene raw p-value = ", peak_gene.p_raw_cur)) |
|
3766 |
+ |
|
3767 |
+ # TODO: Add the other ones here also |
|
3768 |
+ for (allowMissingGenesCur in allowMissingGenes) { |
|
3769 |
+ |
|
3770 |
+ GRN = filterGRNAndConnectGenes(GRN, |
|
3771 |
+ TF_peak.fdr.threshold = TF_peak.fdr_cur, peak_gene.p_raw.threshold = peak_gene.p_raw_cur, |
|
3772 |
+ peak_gene.fdr.threshold= NULL, |
|
3773 |
+ gene.types = gene.types, |
|
3774 |
+ allowMissingGenes = allowMissingGenesCur, peak_gene.r_range = peak_gene.r_range, |
|
3775 |
+ filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL, |
|
3776 |
+ silent = TRUE) |
|
3777 |
+ |
|
3778 |
+ |
|
3779 |
+ GRN@stats$connections = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], |
|
3780 |
+ perm = permutationCur, TF_peak.fdr = TF_peak.fdr_cur, peak_gene.fdr = NA, |
|
3781 |
+ allowMissingGenes = allowMissingGenesCur, peak_gene.p_raw = peak_gene.p_raw_cur, |
|
3782 |
+ peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","), |
|
3783 |
+ gene.types = paste0(gene.types, collapse = ",")) |
|
3784 |
+ |
|
3785 |
+ } # end of for (allowMissingGenesCur in allowMissingGenes) |
|
3786 |
+ |
|
3787 |
+ } # end of for (peak_gene.p_raw_cur in peak_gene.p_raw) |
|
3788 |
+ } # end of if (!is.null(peak_gene.p_raw)) |
|
3745 | 3789 |
|
3746 |
- GRN@stats$connectionDetails.l[[permIndex]] [[TF_peak.fdr_cur.str]] [[peak_gene.fdr_cur.str]] [[allowMissingTFsCur.str]] [[allowMissingGenesCur.str]] [[TF_peak.connectionTypeCur]] = |
|
3747 |
- results.l[["details"]] |
|
3748 | 3790 |
|
3749 |
- } # end of for (TF_peak.connectionTypeCur in TF_peak.connectionTypesAllComb) |
|
3750 |
- |
|
3751 |
- } # end of for (allowMissingGenesCur in allowMissingGenes) |
|
3752 |
- |
|
3753 |
- } # end of for (allowMissingTFsCur in allowMissingTFs) |
|
3754 |
- |
|
3755 |
- } # end of for (peak_gene.fdr_cur in peak_gene.fdr) |
|
3756 |
- |
|
3757 |
- |
|
3758 |
- if (!is.null(peak_gene.p_raw)) { |
|
3759 |
- |
|
3760 |
- futile.logger::flog.info(paste0(" Iterating over different peak-gene raw p-value thresholds...")) |
|
3761 |
- for (peak_gene.p_raw_cur in peak_gene.p_raw) { |
|
3762 |
- |
|
3763 |
- futile.logger::flog.info(paste0(" Peak-gene raw p-value = ", peak_gene.p_raw_cur)) |
|
3764 |
- |
|
3765 |
- # TODO: Add the other ones here also |
|
3766 |
- for (allowMissingGenesCur in allowMissingGenes) { |
|
3767 |
- |
|
3768 |
- GRN = filterGRNAndConnectGenes(GRN, |
|
3769 |
- TF_peak.fdr.threshold = TF_peak.fdr_cur, peak_gene.p_raw.threshold = peak_gene.p_raw_cur, |
|
3770 |
- peak_gene.fdr.threshold= NULL, |
|
3771 |
- gene.types = gene.types, |
|
3772 |
- allowMissingGenes = allowMissingGenesCur, peak_gene.r_range = peak_gene.r_range, |
|
3773 |
- filterTFs = NULL, filterGenes = NULL, filterPeaks = NULL, |
|
3774 |
- silent = TRUE) |
|
3775 |
- |
|
3776 |
- |
|
3777 |
- GRN@stats$connections = .addStats(GRN@stats$connections, GRN@connections$all.filtered[[permIndex]], |
|
3778 |
- perm = permutationCur, TF_peak.fdr = TF_peak.fdr_cur, peak_gene.fdr = NA, |
|
3779 |
- allowMissingGenes = allowMissingGenesCur, peak_gene.p_raw = peak_gene.p_raw_cur, |
|
3780 |
- peak_gene.r_range = paste0(peak_gene.r_range, collapse = ","), |
|
3781 |
- gene.types = paste0(gene.types, collapse = ",")) |
|
3782 |
- |
|
3783 |
- } # end of for (allowMissingGenesCur in allowMissingGenes) |
|
3784 |
- |
|
3785 |
- } # end of for (peak_gene.p_raw_cur in peak_gene.p_raw) |
|
3786 |
- } # end of if (!is.null(peak_gene.p_raw)) |
|
3787 |
- |
|
3791 |
+ } |
|
3792 |
+ } # end for each permutation |
|
3788 | 3793 |
|
3789 |
- } |
|
3790 |
- } # end for each permutation |
|
3794 |
+ } else { |
|
3795 |
+ futile.logger::flog.info(paste0("Data already exists in object. Set forceRerun = TRUE to regenerate and overwrite.")) |
|
3796 |
+ } |
|
3797 |
+ |
|
3791 | 3798 |
|
3792 | 3799 |
|
3793 | 3800 |
GRN |
... | ... |
@@ -3906,7 +3913,7 @@ generateStatsSummary <- function(GRN, |
3906 | 3913 |
#' |
3907 | 3914 |
#' @export |
3908 | 3915 |
#' @param forceDownload \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should the download be enforced even if the local cached file is already present? |
3909 |
-#' @param fileURL Character. Default https://www.embl.de/download/zaugg/GRaNIE/GRN.rds. URL to the GRN example object in rds format. |
|
3916 |
+#' @param fileURL Character. Default \url{https://www.embl.de/download/zaugg/GRaNIE/GRN.rds}. URL to the GRN example object in rds format. |
|
3910 | 3917 |
#' @examples |
3911 | 3918 |
#' GRN = loadExampleObject() |
3912 | 3919 |
#' @return |
... | ... |
@@ -3943,7 +3950,7 @@ loadExampleObject <- function(forceDownload = FALSE, fileURL = "https://www.embl |
3943 | 3950 |
#' Get counts for the various data defined in a \code{\linkS4class{GRN}} object. |
3944 | 3951 |
#' |
3945 | 3952 |
#' @template GRN |
3946 |
-#' @param type Character. Either "peaks" or "rna". "peaks" corresponds to the counts for the open chromatin data, while "rna" refers to th RNA-seq counts. If set to "rna", both permuted and non-permuted data can be retrieved, while for "peaks", only the non-permuted one (i.e., 0) can be retrieved. |
|
3953 |
+#' @param type Character. Either \code{peaks} or \code{rna}. \code{peaks} corresponds to the counts for the open chromatin data, while \code{rna} refers to th RNA-seq counts. If set to \code{rna}, both permuted and non-permuted data can be retrieved, while for \code{peaks}, only the non-permuted one (i.e., 0) can be retrieved. |
|
3947 | 3954 |
#' @param norm Logical. \code{TRUE} or \code{FALSE}. Should original (often raw, but this may not necessarily be the case) or normalized counts be returned? |
3948 | 3955 |
#' @template permuted |
3949 | 3956 |
#' @export |
... | ... |
@@ -3951,9 +3958,9 @@ loadExampleObject <- function(forceDownload = FALSE, fileURL = "https://www.embl |
3951 | 3958 |
#' @examples |
3952 | 3959 |
#' # See the Workflow vignette on the GRaNIE website for examples |
3953 | 3960 |
#' GRN = loadExampleObject() |
3954 |
-#' GRN = getCounts(GRN, type = "peaks", norm = TRUE, permuted = FALSE) |
|
3961 |
+#' counts.df = getCounts(GRN, type = "peaks", norm = TRUE, permuted = FALSE) |
|
3955 | 3962 |
#' @return Data frame of counts, with the type as indicated by the function parameters. |
3956 |
-#' getCounts(GRN, type = "peaks", norm = TRUE) |
|
3963 |
+ |
|
3957 | 3964 |
getCounts <- function(GRN, type, norm, permuted = FALSE) { |
3958 | 3965 |
|
3959 | 3966 |
GRN = .addFunctionLogToObject(GRN) |
... | ... |
@@ -4036,9 +4043,9 @@ getCounts <- function(GRN, type, norm, permuted = FALSE) { |
4036 | 4043 |
#' @export |
4037 | 4044 |
#' @template GRN |
4038 | 4045 |
#' @template permuted |
4039 |
-#' @param type Character. Default "all.filtered". Must be one of "TF_peaks", "peak_genes", "all.filtered". The type of connections to retrieve. |
|
4040 |
-#' @param include_TF_gene_correlations Logical. \code{TRUE} or \code{FALSE}. Should TFs and gene correlations be returned as well? If set to \code{TRUE}, they must have been computed beforehand with \code{\link{add_TF_gene_correlation}}. |
|
4041 |
-#' @return A data frame with the connections. Importantly, this function does NOT return a \code{\linkS4class{GRN}} object. |
|
4046 |
+#' @param type Character. Default \code{all.filtered}. Must be one of \code{TF_peaks}, \code{peak_genes}, \code{all.filtered}. The type of connections to retrieve. |
|
4047 |
+#' @param include_TF_gene_correlations Logical. \code{TRUE} or \code{FALSE}. Should TFs and gene correlations be returned as well? If set to \code{TRUE}, they must have been computed beforehand with \code{\link{add_TF_gene_correlation}}. |
|
4048 |
+#' @return A data frame with the connections. Importantly, this function does **NOT** return a \code{\linkS4class{GRN}} object. |
|
4042 | 4049 |
#' @examples |
4043 | 4050 |
#' # See the Workflow vignette on the GRaNIE website for examples |
4044 | 4051 |
#' GRN = loadExampleObject() |
... | ... |
@@ -4220,8 +4227,7 @@ getGRNConnections <- function(GRN, type = "all.filtered", permuted = FALSE, inc |
4220 | 4227 |
#' @examples |
4221 | 4228 |
#' # See the Workflow vignette on the GRaNIE website for examples |
4222 | 4229 |
#' GRN = loadExampleObject() |
4223 |
-#' # getParameters(GRN, type = "parameter", name = "all") |
|
4224 |
-#' |
|
4230 |
+#' getParameters(GRN, type = "parameter", name = "all") |
|
4225 | 4231 |
getParameters <- function (GRN, type = "parameter", name = "all") { |
4226 | 4232 |
|
4227 | 4233 |
checkmate::assertClass(GRN, "GRN") |
... | ... |
@@ -4266,7 +4272,7 @@ getParameters <- function (GRN, type = "parameter", name = "all") { |
4266 | 4272 |
} |
4267 | 4273 |
|
4268 | 4274 |
|
4269 |
-#' Optional convenience function to delete intermediate data from the function \link{AR_classification_wrapper} and summary statistics that may occupy a lot of space |
|
4275 |
+#' Optional convenience function to delete intermediate data from the function \code{\link{AR_classification_wrapper}} and summary statistics that may occupy a lot of space |
|
4270 | 4276 |
#' @export |
4271 | 4277 |
#' @template GRN |
4272 | 4278 |
#' @return The same \code{\linkS4class{GRN}} object, with some slots being deleted (\code{GRN@data$TFs$classification} as well as \code{GRN@stats$connectionDetails.l}) |
... | ... |
@@ -2,7 +2,7 @@ |
2 | 2 |
#' Builds a graph out of a set of connections |
3 | 3 |
#' |
4 | 4 |
#' @template GRN |
5 |
-#' @param model_TF_gene_nodes_separately \code{TRUE} or \code{FALSE}. Default \code{FALSE}. TODO description follows |
|
5 |
+#' @param model_TF_gene_nodes_separately \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should TF and gene nodes be modeled separately? If set to \code{TRUE},this may lead to unwanted effects in case of TF-TF connections (i.e., a TF regulating another TF) |
|
6 | 6 |
#' @param allowLoops \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Allow loops in the network (i.e., a TF that regulates itself) |
7 | 7 |
#' @param removeMultiple \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Remove loops with the same start and end point? This can happen if multiple TF originate from the same gene, for example. |
8 | 8 |
#' @param directed \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Should the network be directed? |
... | ... |
@@ -285,11 +285,11 @@ performAllNetworkAnalyses <- function(GRN, ontology = c("GO_BP", "GO_MF"), |
285 | 285 |
#' |
286 | 286 |
#' This function runs an enrichment analysis for the genes in the filtered network. |
287 | 287 |
#' @template GRN |
288 |
-#' @param ontology Character vector of ontologies. Default c("GO_BP", "GO_MF"). Valid values are "GO_BP", "GO_MF", "GO_CC", "KEGG", "DO", and "Reactome", referring to GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Disease Ontology, and Reactome Pathways. The GO enrichments |
|
289 |
-#' @param algorithm Character. Default "weight01". One of: "classic", "elim", "weight", "weight01", "lea", "parentchild." Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), ignored otherwise. Name of the algorithm that handles the GO graph structures. Valid inputs are those supported by the topGO library. |
|
290 |
-#' @param statistic Character. Default "fisher". One of: "fisher", "ks", "t", "globaltest", "sum", "ks.ties". Statistical test to be used. Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), and valid inputs are those supported by the topGO library, ignored otherwise. For the other ontologies the test statistic is always Fisher. |
|
291 |
-#' @param background Character. Default "neighborhood". One of: "all_annotated", "all_RNA", "neighborhood". Set of genes to be used to construct the background for the enrichment analysis. This can either be all annotated genes in the reference genome (all_annotated), all differentially expressed genes (all_RNA), or all the genes that are within the neighbourhood of a peak in the GRN (neighbourhood) |
|
292 |
-#' @param pAdjustMethod Character. Default "BH". One of: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr". This parameter is only relevant for the following ontologies: KEGG, DO, Reactome. For the other ontologies, the algorithm serves as an adjustment. |
|
288 |
+#' @param ontology Character vector of ontologies. Default \code{c("GO_BP", "GO_MF")}. Valid values are \code{"GO_BP"}, \code{"GO_MF"}, \code{"GO_CC"}, \code{"KEGG"}, \code{"DO"}, and \code{"Reactome"}, referring to GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Disease Ontology, and Reactome Pathways. |
|
289 |
+#' @param algorithm Character. Default \code{"weight01"}. One of: \code{"classic"}, \code{"elim"}, \code{"weight"}, \code{"weight01"}, \code{"lea"}, \code{"parentchild"}. Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), ignored otherwise. Name of the algorithm that handles the GO graph structures. Valid inputs are those supported by the \code{topGO} library. |
|
290 |
+#' @param statistic Character. Default \code{"fisher"}. One of: \code{"fisher"}, \code{"ks"}, \code{"t"}, \code{"globaltest"}, \code{"sum"}, \code{"ks.ties"}. Statistical test to be used. Only relevant if ontology is GO related (GO_BP, GO_MF, GO_CC), and valid inputs are those supported by the topGO library, ignored otherwise. For the other ontologies the test statistic is always Fisher. |
|
291 |
+#' @param background Character. Default \code{"neighborhood"}. One of: \code{"all_annotated"}, \code{"all_RNA"}, \code{"neighborhood"}. Set of genes to be used to construct the background for the enrichment analysis. This can either be all annotated genes in the reference genome (all_annotated), all differentially expressed genes (all_RNA), or all the genes that are within the neighborhood of a peak in the GRN (neighborhood) |
|
292 |
+#' @param pAdjustMethod Character. Default \code{"BH"}. One of: \code{"holm"}, \code{"hochberg"}, \code{"hommel"}, \code{"bonferroni"}, \code{"BH"}, \code{"BY"}, \code{"fdr"}. This parameter is only relevant for the following ontologies: KEGG, DO, Reactome. For the other ontologies, the algorithm serves as an adjustment. |
|
293 | 293 |
#' @template forceRerun |
294 | 294 |
#' @return The same \code{\linkS4class{GRN}} object, with the enrichment results stored in the \code{stats$Enrichment$general} slot. |
295 | 295 |
#' @seealso \code{\link{plotGeneralEnrichment}} |
... | ... |
@@ -802,8 +802,8 @@ calculateCommunitiesStats <- function(GRN, clustering = "louvain", forceRerun = |
802 | 802 |
#' |
803 | 803 |
#' After the vertices of the filtered GRN are clustered into communities using \code{\link{calculateCommunitiesStats}}, this function will run a per-community enrichment analysis. |
804 | 804 |
#' @inheritParams calculateGeneralEnrichment |
805 |
-#' @param selection Character. Default "byRank". One of: "byRank", "byLabel". Specify whether the communities enrichment will by calculated based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank. |
|
806 |
-#' @param communities Numeric vector. Default c(1:10). Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities} = c(1,4), if \code{display} = "byRank" the GO enrichment for the first and fourth largest communities will be calculated if \code{display} = "byLabel", the results for the communities labeled "1", and "4" will be plotted. |
|
805 |
+#' @param selection Character. Default \code{"byRank"}. One of: \code{"byRank"}, \code{"byLabel"}. Specify whether the communities enrichment will by calculated based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank. |
|
806 |
+#' @param communities Numeric vector. Default \code{c(1:10)}. Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities = c(1,4)}, if \code{display = "byRank"} the GO enrichment for the first and fourth largest communities will be calculated if \code{display = "byLabel"}, the results for the communities labeled "1", and "4" will be plotted. |
|
807 | 807 |
#' @return The same \code{\linkS4class{GRN}} object, with the enrichment results stored in the \code{stats$Enrichment$byCommunity} slot. |
808 | 808 |
#' @seealso \code{\link{plotCommunitiesEnrichment}} |
809 | 809 |
#' @seealso \code{\link{plotGeneralEnrichment}} |
... | ... |
@@ -930,8 +930,8 @@ calculateCommunitiesEnrichment <- function(GRN, |
930 | 930 |
#' Retrieve top Nodes in the filtered \code{\linkS4class{GRN}} |
931 | 931 |
#' |
932 | 932 |
#' @template GRN |
933 |
-#' @param nodeType Character. One of: "gene", "TF". |
|
934 |
-#' @param rankType Character. One of: "degree", "EV". This parameter will determine the criterion to be used to identify the "top" nodes. If set to "degree", the function will select top nodes based on the number of connections they have, i.e. based on their degree-centrality. If set to "EV" it will select the top nodes based on their eigenvector-centrality score in the network. |
|
933 |
+#' @param nodeType Character. One of: \code{"gene"} or \code{"TF"}. Node type. |
|
934 |
+#' @param rankType Character. One of: \code{"degree"}, \code{"EV"}. This parameter will determine the criterion to be used to identify the "top" nodes. If set to "degree", the function will select top nodes based on the number of connections they have, i.e. based on their degree-centrality. If set to "EV" it will select the top nodes based on their eigenvector-centrality score in the network. |
|
935 | 935 |
#' @param n Numeric. Default 0.1. If this parameter is passed as a value between (0,1), it is treated as a percentage of top nodes. If the value is passed as an integer it will be treated as the number of top nodes. |
936 | 936 |
#' @param use_TF_gene_network \code{TRUE} or \code{FALSE}. Default \code{TRUE}. Should the TF-gene network be used (\code{TRUE}) or the TF-peak-gene network (\code{FALSE})? |
937 | 937 |
#' @return A dataframe with the node names and the corresponding scores used to rank them |
... | ... |
@@ -1017,15 +1017,14 @@ getTopNodes <- function(GRN, nodeType, rankType, n = 0.1, use_TF_gene_network = |
1017 | 1017 |
#' This function calculates the GO enrichment per TF, i.e. for the set of genes a given TF is connected to in the filtered \code{\linkS4class{GRN}}. |
1018 | 1018 |
#' |
1019 | 1019 |
#' @inheritParams calculateGeneralEnrichment |
1020 |
-#' @param rankType Character. One of: "degree", "EV", "custom". This parameter will determine the criterion to be used to identify the "top" TFs. If set to "degree", the function will select top TFs based on the number of connections to genes they have, i.e. based on their degree-centrality. If set to "EV" it will select the top TFs based on their eigenvector-centrality score in the network. If set to custom, a set of TF names will have to be passed to the "TF.names" parameter. |
|
1021 |
-#' @param n Numeric. Default 0.1. If this parameter is passed as a value between (0,1), it is treated as a percentage of top nodes. If the value is passed as an integer it will be treated as the number of top nodes. This parameter is not relevant if rankType = "custom". |
|
1022 |
-#' @param TF.names Character vector. If the rank type is set to "custom", a vector of TF names for which the GO enrichment should be calculated should be passed to this parameter. |
|
1020 |
+#' @param rankType Character. Default \code{"degree"}. One of: \code{"degree"}, \code{"EV"}, \code{"custom"}. This parameter will determine the criterion to be used to identify the "top" TFs. If set to "degree", the function will select top TFs based on the number of connections to genes they have, i.e. based on their degree-centrality. If set to \code{"EV"} it will select the top TFs based on their eigenvector-centrality score in the network. If set to custom, a set of TF names will have to be passed to the "TF.names" parameter. |
|
1021 |
+#' @param n Numeric. Default 0.1. If this parameter is passed as a value between (0,1), it is treated as a percentage of top nodes. If the value is passed as an integer it will be treated as the number of top nodes. This parameter is not relevant if \code{rankType = "custom"}. |
|
1022 |
+#' @param TF.names Character vector. Default \code{NULL}. If the rank type is set to \code{"custom"}, a vector of TF names for which the GO enrichment should be calculated should be passed to this parameter. |
|
1023 | 1023 |
#' @return The same \code{\linkS4class{GRN}} object, with the enrichment results stored in the \code{stats$Enrichment$byTF} slot. |
1024 | 1024 |
#' @examples |
1025 | 1025 |
#' # See the Workflow vignette on the GRaNIE website for examples |
1026 | 1026 |
#' GRN = loadExampleObject() |
1027 | 1027 |
#' GRN = calculateTFEnrichment(GRN, n = 5, ontology = "GO_BP", forceRerun = FALSE) |
1028 |
-#' GRN = calculateTFEnrichment(GRN, n = 5, ontology = "GO_BP", forceRerun = FALSE) |
|
1029 | 1028 |
#' @export |
1030 | 1029 |
calculateTFEnrichment <- function(GRN, rankType = "degree", n = 0.1, TF.names = NULL, |
1031 | 1030 |
ontology = c("GO_BP", "GO_MF"), algorithm = "weight01", |
... | ... |
@@ -791,9 +791,10 @@ plotDiagnosticPlots_TFPeaks <- function(GRN, |
791 | 791 |
#' @return The same \code{\linkS4class{GRN}} object, with added data from this function. |
792 | 792 |
#' @examples |
793 | 793 |
#' # See the Workflow vignette on the GRaNIE website for examples |
794 |
-#' GRN = loadExampleObject() |
|
795 |
-#' GRN = plotDiagnosticPlots_peakGene(GRN, forceRerun = FALSE) |
|
794 |
+#' # GRN = loadExampleObject() |
|
795 |
+#' # GRN = plotDiagnosticPlots_peakGene(GRN, forceRerun = FALSE) |
|
796 | 796 |
#' @export |
797 |
+# TODO: implement forceRerun correctly |
|
797 | 798 |
plotDiagnosticPlots_peakGene <- function(GRN, |
798 | 799 |
outputFolder = NULL, |
799 | 800 |
basenameOutput = NULL, |
... | ... |
@@ -845,7 +846,8 @@ plotDiagnosticPlots_peakGene <- function(GRN, |
845 | 846 |
useFiltered = useFiltered, |
846 | 847 |
plotPerTF = plotPerTF, |
847 | 848 |
pdf_width = pdf_width, pdf_height = pdf_height, |
848 |
- plotDetails = plotDetails) |
|
849 |
+ plotDetails = plotDetails, |
|
850 |
+ forceRerun = forceRerun) |
|
849 | 851 |
|
850 | 852 |
# Summarize all data, both real and random |
851 | 853 |
# filteredStr = dplyr::if_else(useFiltered, "_filtered", "") |
... | ... |
@@ -871,564 +873,578 @@ plotDiagnosticPlots_peakGene <- function(GRN, |
871 | 873 |
plotPerTF = FALSE, |
872 | 874 |
fileBase = NULL, |
873 | 875 |
pdf_width = 12, |
874 |
- pdf_height = 12) { |
|
876 |
+ pdf_height = 12, |
|
877 |
+ forceRerun = FALSE) { |
|
875 | 878 |
|
876 | 879 |
start = Sys.time() |
877 |
- futile.logger::flog.info(paste0("Plotting diagnostic plots for peak-gene correlations", dplyr::if_else(is.null(fileBase), "", paste0(" to file(s) with basename ", fileBase)))) |
|
878 |
- |
|
879 |
- cols_keep = c("peak.ID", "gene.ENSEMBL", "peak_gene.r", "peak_gene.p_raw", "peak_gene.distance") |
|
880 |
- |
|
881 |
- # Change from 10 to 5 |
|
882 |
- nCategoriesBinning = 5 |
|
883 |
- probs = seq(0,1,1/nCategoriesBinning) |
|
884 |
- |
|
885 |
- |
|
886 |
- range = GRN@config$parameters$promoterRange |
|
887 |
- |
|
888 |
- networkType_details = c(paste0("real_",range), paste0("random_",range)) |
|
889 |
- |
|
890 |
- colors_vec = c("black", "darkgray") |
|
891 |
- networkType_vec = c("real", "permuted") |
|
892 |
- names(colors_vec) = names(networkType_vec) = networkType_details |
|
893 |
- |
|
894 |
- options(dplyr.summarise.inform = FALSE) |
|
895 |
- |
|
896 |
- includeRobustCols = FALSE |
|
897 |
- #Either take all or the filtered set of connections |
|
898 |
- if (useFiltered) { |
|
899 |
- |
|
900 |
- # TODO: Robust columns filter |
|
901 |
- peakGeneCorrelations.all = |
|
902 |
- rbind(dplyr::select(GRN@connections$all.filtered[["0"]], dplyr::everything()) %>% |
|
903 |
- dplyr::mutate(class = paste0("real_",range))) %>% |
|
904 |
- rbind(dplyr::select(GRN@connections$all.filtered[["1"]], dplyr::everything()) %>% |
|
905 |
- dplyr::mutate(class = paste0("random_",range))) |
|
906 |
- |
|
907 |
- } else { |
|
908 |
- |
|
909 |
- robustColumns = c("peak_gene.p_raw.robust", "peak_gene.bias_M_p.raw", "peak_gene.bias_LS_p.raw", "peak_gene.r_robust") |
|
910 |
- if (all(robustColumns %in% colnames(GRN@connections$peak_genes[["0"]]))) { |
|
911 |
- includeRobustCols = TRUE |
|
912 |
- cols_keep = c(cols_keep, robustColumns) |
|
913 |
- } |
|
914 |
- |
|
915 |
- class_levels = c(paste0("real_",range), paste0("random_",range)) |
|
916 |
- |
|
917 |
- peakGeneCorrelations.all = |
|
918 |
- rbind( |
|
919 |
- dplyr::select(GRN@connections$peak_genes[["0"]], tidyselect::all_of(cols_keep)) %>% |
|
920 |
- dplyr::mutate(class = factor(paste0("real_",range), levels = class_levels)), |
|
921 |
- dplyr::select(GRN@connections$peak_genes[["1"]], tidyselect::all_of(cols_keep)) %>% |
|
922 |
- dplyr::mutate(class = factor(paste0("random_",range), levels = class_levels))) %>% |
|
923 |
- dplyr::left_join(dplyr::select(GRN@annotation$genes, gene.ENSEMBL, gene.type, gene.mean, gene.median, gene.CV), by = "gene.ENSEMBL") %>% |
|
924 |
- dplyr::left_join(GRN@annotation$consensusPeaks %>% |
|
925 |
- dplyr::select(-dplyr::starts_with("peak.gene."), -peak.GC.perc), by = "peak.ID") %>% |
|
926 |
- dplyr::select(-gene.ENSEMBL) |
|
927 |
- |
|
928 |
- } |
|
929 |
- |
|
930 |
- if (!plotPerTF) { |
|
931 |
- peakGeneCorrelations.all = dplyr::select(peakGeneCorrelations.all, -peak.ID) |
|
932 |
- } |
|
933 |
- |
|
934 |
- nClasses_distance = 10 |
|
935 |
- |
|
936 |
- peakGeneCorrelations.all = peakGeneCorrelations.all %>% |
|
937 |
- dplyr::mutate(r_positive = peak_gene.r > 0, |
|
938 |
- peak_gene.distance_class = |
|
939 |
- forcats::fct_explicit_na(addNA(cut(peak_gene.distance, breaks = nClasses_distance, include.lowest = TRUE)), "random"), |
|
940 |
- peak_gene.distance_class_abs = forcats::fct_explicit_na(addNA(cut(abs(peak_gene.distance), |
|
941 |
- breaks = nClasses_distance, include.lowest = TRUE, ordered_result = TRUE)), "random"), |
|
942 |
- peak_gene.p.raw.class = cut(peak_gene.p_raw, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE), |
|
943 |
- peak_gene.r.class = cut(peak_gene.r, breaks = seq(-1,1,0.05), include.lowest = TRUE, ordered_result = TRUE)) %>% |
|
944 |
- dplyr::filter(!is.na(peak_gene.r)) # Eliminate rows with NA for peak_gene.r. This can happen if the normalized gene counts are identical across ALL samples, and due to the lack of any variation, the correlation cannot be computed |
|
945 |
- |
|
946 |
- |
|
947 |
- # Oddity of cut: When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals. |
|
948 |
- levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1] = |
|
949 |
- gsub("(-\\d+)", "0", levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1], perl = TRUE) |
|
950 |
- |
|
951 |
- |
|
952 |
- if (includeRobustCols) { |
|
953 |
- peakGeneCorrelations.all = peakGeneCorrelations.all %>% |
|
954 |
- dplyr::mutate(peak_gene.p_raw.robust.class = |
|
955 |
- cut(peak_gene.p_raw.robust, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE)) |
|
956 |
- } |
|
957 |
- |
|
958 |
- |
|
959 |
- # Prepare plots # |
|
960 |
- |
|
961 |
- colors_class = c("black", "black") |
|
962 |
- names(colors_class)= unique(peakGeneCorrelations.all$class) |
|
963 |
- colors_class[which(grepl("random", names(colors_class)))] = "darkgray" |
|
964 |
- |
|
965 |
- r_pos_class = c("black", "darkgray") |
|
966 |
- names(r_pos_class) =c("TRUE", "FALSE") |
|
967 |
- |
|
968 |
- dist_class = c("dark red", "#fc9c9c") |
|
969 |
- names(dist_class) = class_levels |
|
970 |
- |
|
971 |
- freqs= table(peakGeneCorrelations.all$class) |
|
972 |
- freq_class = paste0(gsub(names(freqs), pattern = "(.+)(_.*)", replacement = "\\1"), " (n=", .prettyNum(freqs) , ")") |
|
973 |
- # Change upstream and go with "permuted" everywhere |
|
974 |
- freq_class = gsub(freq_class, pattern = "random", replacement = "permuted") |
|
975 |
- names(freq_class) <- names(freqs) |
|
976 | 880 |
|
881 |
+ # get list of all filenames that are going to be created |
|
882 |
+ filenames.all = c() |
|
977 | 883 |
filteredStr = dplyr::if_else(useFiltered, "_filtered", "") |
978 |
- |
|
979 |
- xlabels_peakGene_r.class = levels(peakGeneCorrelations.all$peak_gene.r.class) |
|
980 |
- nCur = length(xlabels_peakGene_r.class) |
|
981 |
- xlabels_peakGene_r.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- "" |
|
982 |
- |
|
983 |
- # For the last plot, which is wider, we label a few more |
|
984 |
- xlabels_peakGene_r.class2 = levels(peakGeneCorrelations.all$peak_gene.r.class) |
|
985 |
- nCur = length(xlabels_peakGene_r.class2) |
|
986 |
- xlabels_peakGene_r.class2[setdiff(seq_len(nCur), c(1, floor(nCur/4), floor(nCur/2), floor(nCur/4*3), nCur))] <- "" |
|
987 |
- |
|
988 |
- xlabels_peakGene_praw.class = levels(peakGeneCorrelations.all$peak_gene.p.raw.class) |
|
989 |
- nCur = length(xlabels_peakGene_praw.class) |
|
990 |
- xlabels_peakGene_praw.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- "" |
|
991 |
- |
|
992 |
- |
|
993 | 884 |
for (geneTypesSelected in gene.types) { |
994 |
- |
|
995 |
- futile.logger::flog.info(paste0(" Gene type ", paste0(geneTypesSelected, collapse = "+"))) |
|
996 |
- |
|
997 |
- if ("all" %in% geneTypesSelected) { |
|
998 |
- indexCur = seq_len(nrow(peakGeneCorrelations.all)) |
|
999 |
- } else { |
|
1000 |
- indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected) |
|
1001 |
- } |
|
1002 |
- |
|
1003 |
- |
|
1004 |
- # START PLOTTING # |
|
1005 |
- |
|
1006 |
- if (!is.null(fileBase)) { |
|
1007 |
- filenameCur = paste0(fileBase, paste0(geneTypesSelected, collapse = "+"), filteredStr, ".pdf") |
|
1008 |
- grDevices::pdf(file = filenameCur, width = pdf_width, height = pdf_height) |
|
1009 |
- } |
|
1010 |
- |
|
1011 |
- if (plotPerTF) { |
|
1012 |
- |
|
1013 |
- TF.nRows = rep(-1, length(GRN@config$allTF)) |
|
1014 |
- #TF.nRows = rep(-1, 10) |
|
1015 |
- TF.peaks = list() |
|
1016 |
- names(TF.nRows) = GRN@config$allTF |
|
1017 |
- for (TFCur in GRN@config$allTF) { |
|
1018 |
- TF.peaks[[TFCur]] = names(which(GRN@data$TFs$TF_peak_overlap[,TFCur] == 1)) |
|
1019 |
- TF.nRows[TFCur] = peakGeneCorrelations.all[indexCur,] %>% dplyr::filter(peak.ID %in% TF.peaks[[TFCur]]) %>% nrow() |
|
1020 |
- } |
|
1021 |
- |
|
1022 |
- TFs_sorted = names(sort(TF.nRows, decreasing = TRUE)) |
|
1023 |
- |
|
1024 |
- allTF = c("all", TFs_sorted) |
|
1025 |
- |
|
1026 |
- } else { |
|
1027 |
- allTF = "all" |
|
1028 |
- } |
|
1029 |
- |
|
1030 |
- counter = 0 |
|
1031 |
- for (TFCur in allTF) { |
|
1032 |
- |
|
1033 |
- counter = counter + 1 |
|
1034 |
- |
|
1035 |
- if (length(allTF) > 1) { |
|
1036 |
- futile.logger::flog.info(paste0(" QC plots for TF ", TFCur, " (", counter, " of ", length(allTF), ")")) |
|
1037 |
- } |
|
1038 |
- |
|
1039 |
- |
|
1040 |
- if ("all" %in% geneTypesSelected) { |
|
1041 |
- indexCur = seq_len(nrow(peakGeneCorrelations.all)) |
|
1042 |
- } else { |
|
1043 |
- indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected) |
|
1044 |
- } |
|
1045 |
- |
|
1046 |
- |
|
1047 |
- if (TFCur != "all") { |
|
1048 |
- indexCur = intersect(indexCur, which(peakGeneCorrelations.all$peak.ID %in% TF.peaks[[TFCur]])) |
|
1049 |
- } |
|
1050 |
- |
|
1051 |
- # Get subset also for just the real data |
|
1052 |
- indexCurReal = intersect(indexCur, which(peakGeneCorrelations.all$class == names(dist_class)[1])) |
|
1053 |
- |
|
1054 |
- |
|
1055 |
- xlabel = paste0("Correlation raw p-value") |
|
1056 |
- |
|
1057 |
- # DENSITY PLOTS P VALUE |
|
1058 |
- |
|
1059 |
- # TODO: Densities as ratio |
|
1060 |
- # https://stackoverflow.com/questions/58629959/how-can-i-extract-data-from-a-kernel-density-function-in-r-for-many-samples-at-o#:~:text=To%20compute%20the%20density%20you,can%20use%20the%20package%20spatstat%20. |
|
1061 |
- |
|
1062 |
- # Produce the labels for the class-specific subtitles |
|
1063 |
- customLabel_class = .customLabeler(table(peakGeneCorrelations.all[indexCur,]$class)) |
|
1064 |
- |
|
1065 |
- r_pos_freq = table(peakGeneCorrelations.all[indexCur,]$r_positive) |
|
1066 |
- labeler_r_pos = labeller(r_positive = c("TRUE" = paste0("r positive (", r_pos_freq["TRUE"], ")"), |
|
1067 |
- "FALSE" = paste0("r negative (", r_pos_freq["FALSE"], ")")) ) |
|
1068 |
- theme_main = theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = rel(0.8)), |
|
1069 |
- axis.text.y = element_text(size=rel(0.8)), |
|
1070 |
- panel.grid.major = element_blank(), panel.grid.minor = element_blank()) |
|
1071 |
- |
|
1072 |
- |
|
1073 |
- |
|
1074 |
- ## p-val density curves stratified by real/permuted ## |
|
1075 |
- |
|
1076 |
- gA2 = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.p_raw, color = r_positive)) + geom_density() + |
|
1077 |
- facet_wrap(~ class, labeller = labeller(class=freq_class) ) + |
|
1078 |
- xlab(xlabel) + ylab("Density") + theme_bw() + |
|
1079 |
- scale_color_manual(labels = names(r_pos_class), values = r_pos_class) + |
|
1080 |
- theme(legend.position = "none", axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8))) |
|
1081 |
- |
|
1082 |
- # Helper function to retrieve all tables and data aggregation steps for subsequent visualization |
|
1083 |
- tbl.l = .createTables_peakGeneQC(peakGeneCorrelations.all[indexCur,], networkType_details, colors_vec, range) |
|
1084 |
- |
|
1085 |
- |
|
1086 |
- xlabel = paste0("Correlation raw\np-value (binned)") |
|
1087 |
- |
|
1088 |
- |
|
1089 |
- xlabels = levels(tbl.l$d_merged$peak_gene.p.raw.class) |
|
1090 |
- xlabels[setdiff(seq_len(length(xlabels)), c(1, floor(length(xlabels)/2), length(xlabels)))] <- "" |
|
1091 |
- |
|
1092 |
- gB3 = ggplot(tbl.l$d_merged, aes(peak_gene.p.raw.class, ratio, fill = classAll)) + |
|
1093 |
- geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.5) + |
|
1094 |
- geom_hline(yintercept = 1, linetype = "dotted") + |
|
1095 |
- xlab(xlabel) + ylab("Ratio") + |
|
1096 |
- scale_fill_manual("Class", values = c(dist_class, r_pos_class), |
|
1097 |
- labels = c("real", "permuted", "r+ (r>0)", "r- (r<=0)"), |
|
1098 |
- ) + # labels vector can be kind of manually specified here because the levels were previosly defined in a certain order |
|
1099 |
- scale_x_discrete(labels = xlabels_peakGene_praw.class) + |
|
1100 |
- theme_bw() + |
|
1101 |
- #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) + |
|
1102 |
- # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) + |
|
1103 |
- theme_main + |
|
1104 |
- facet_wrap(~ factor(set), nrow = 2, scales = "free_y", strip.position = "left") |
|
1105 |
- |
|
1106 |
- # plot two FDR plots as well: (fraction of negative / negative+positive) and (fraction of permuted / permuted + real) |
|
1107 |
- # that could give an indication of whether an FDR based on the permuted or based on the negative correlations would be more stringent |
|
1108 |
- |
|
1109 |
- # R PEAK GENE # |
|
1110 |
- |
|
1111 |
- xlabel = paste0("Correlation coefficient r") |
|
1112 |
- |
|
1113 |
- sum_real = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[1]] |
|
1114 |
- sum_rnd = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[2]] |
|
1115 |
- binData.r = peakGeneCorrelations.all[indexCur,] %>% |
|
1116 |
- dplyr::group_by(class) %>% |
|
1117 |
- dplyr::count(peak_gene.r.class) %>% |
|
1118 |
- dplyr::mutate(nnorm = dplyr::case_when(class == !! (names(dist_class)[1]) ~ .data$n / (sum_real / sum_rnd), |
|
1119 |
- TRUE ~ as.double(.data$n))) |
|
1120 |
- |
|
1121 |
- xlabel = paste0("Correlation coefficient r (binned)") |
|
1122 |
- |
|
1123 |
- gD = ggplot(binData.r, aes(peak_gene.r.class, nnorm, group = class, fill = class)) + |
|
1124 |
- geom_bar(stat = "identity", position = position_dodge(preserve = "single"), na.rm = FALSE, width = 0.5) + |
|
1125 |
- geom_line(aes(peak_gene.r.class, nnorm, group = class, color= class), stat = "identity") + |
|
1126 |
- scale_fill_manual("Group", labels = names(dist_class), values = dist_class) + |
|
1127 |
- scale_color_manual("Group", labels = names(dist_class), values = dist_class) + |
|
1128 |
- scale_x_discrete(labels = xlabels_peakGene_r.class2, drop = FALSE) + |
|
1129 |
- theme_bw() + theme(legend.position = "none") + |
|
1130 |
- xlab(xlabel) + ylab("Abundance") + |
|
1131 |
- theme_main + |
|
1132 |
- scale_y_continuous(labels = scales::scientific) |
|
1133 |
- |
|
1134 |
- |
|
1135 |
- mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ",\n", .prettyNum(range), " bp promoter range)") |
|
1136 |
- |
|
1137 |
- plots_all = ( ((gA2 | gB3 ) + |
|
1138 |
- patchwork::plot_layout(widths = c(2.5,1.5))) / ((gD) + |
|
1139 |
- patchwork::plot_layout(widths = c(4))) ) + |
|
1140 |
- patchwork::plot_layout(heights = c(2,1), guides = 'collect') + |
|
1141 |
- patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5))) |
|
1142 |
- |
|
1143 |
- plot(plots_all) |
|
1144 |
- |
|
1145 |
- |
|
1146 |
- } # end for each TF |
|
1147 |
- |
|
1148 |
- |
|
1149 |
- |
|
1150 |
- if (!is.null(GRN@annotation$consensusPeaks_obj) & is.installed("ChIPseeker")) { |
|
1151 |
- #plot(ChIPseeker::plotAnnoBar(GRN@annotation$consensusPeaks_obj)) |
|
1152 |
- |
|
1153 |
- # no plot, as this is somehow just a list and no ggplot object |
|
1154 |
- ChIPseeker::plotAnnoPie(GRN@annotation$consensusPeaks_obj, |
|
1155 |
- main = paste0("\nPeak annotation BEFORE filtering (n = ", nrow(GRN@annotation$consensusPeaks), ")")) |
|
1156 |
- plot(ChIPseeker::plotDistToTSS(GRN@annotation$consensusPeaks_obj)) |
|
1157 |
- } |
|
1158 |
- |
|
1159 |
- # PEAK AND GENE PROPERTIES # |
|
1160 |
- |
|
1161 |
- xlabel = paste0("Correlation raw\np-value (binned)") |
|
1162 |
- mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ", .prettyNum(range), " bp promoter range)") |
|
1163 |
- |
|
1164 |
- allVars = c("peak.GC.class", "peak.width", "peak.mean","peak.median", |
|
1165 |
- "peak.CV", "gene.median", "gene.mean", "gene.CV", "peak.gene.combined.CV") |
|
1166 |
- |
|
1167 |
- # If ChIPseeker is not installed, remove peak.annotation from it |
|
1168 |
- if ("peak.annotation" %in% colnames(peakGeneCorrelations.all)) { |
|
1169 |
- allVars = c(allVars, "peak.annotation") |
|
1170 |
- } |
|
1171 |
- |
|
1172 |
- |
|
1173 |
- for (varCur in allVars) { |
|
1174 |
- #next |
|
1175 |
- # Save memory and prune the table and add only the variable we need here |
|
1176 |
- if (varCur != "peak.gene.combined.CV") { |
|
1177 |
- dataCur = peakGeneCorrelations.all[indexCurReal,] %>% |
|
1178 |
- dplyr::select(peak_gene.p_raw, tidyselect::all_of(varCur), class, r_positive, peak_gene.p.raw.class, peak_gene.distance) |
|
1179 |
- } else { |
|
1180 |
- dataCur = peakGeneCorrelations.all[indexCurReal,] %>% |
|
1181 |
- dplyr::select(peak_gene.p_raw, class, gene.CV, peak.CV, r_positive, peak_gene.p.raw.class, peak_gene.distance) |
|
1182 |
- } |
|
885 |
+ filenameCur = paste0(fileBase, paste0(geneTypesSelected, collapse = "+"), filteredStr, ".pdf") |
|
886 |
+ filenames.all = c(filenames.all, filenameCur) |
|
887 |
+ } |
|
888 |
+ |
|
889 |
+ if (!all(file.exists(filenames.all)) | forceRerun) { |
|
890 |
+ |
|
891 |
+ futile.logger::flog.info(paste0("Plotting diagnostic plots for peak-gene correlations", dplyr::if_else(is.null(fileBase), "", paste0(" to file(s) with basename ", fileBase)))) |
|
892 |
+ |
|
893 |
+ cols_keep = c("peak.ID", "gene.ENSEMBL", "peak_gene.r", "peak_gene.p_raw", "peak_gene.distance") |
|
894 |
+ |
|
895 |
+ # Change from 10 to 5 |
|
896 |
+ nCategoriesBinning = 5 |
|
897 |
+ probs = seq(0,1,1/nCategoriesBinning) |
|
898 |
+ |
|
899 |
+ |
|
900 |
+ range = GRN@config$parameters$promoterRange |
|
901 |
+ |
|
902 |
+ networkType_details = c(paste0("real_",range), paste0("random_",range)) |
|
903 |
+ |
|
904 |
+ colors_vec = c("black", "darkgray") |
|
905 |
+ networkType_vec = c("real", "permuted") |
|
906 |
+ names(colors_vec) = names(networkType_vec) = networkType_details |
|
907 |
+ |
|
908 |
+ options(dplyr.summarise.inform = FALSE) |
|
909 |
+ |
|
910 |
+ includeRobustCols = FALSE |
|
911 |
+ #Either take all or the filtered set of connections |
|
912 |
+ if (useFiltered) { |
|
1183 | 913 |
|
1184 |
- |
|
1185 |
- # Choose colors depending on type of variable: gradient or not? |
|
1186 |
- |
|
1187 |
- if (varCur %in% c("peak.annotation","peak.GC.class")) { |
|
914 |
+ # TODO: Robust columns filter |
|
915 |
+ peakGeneCorrelations.all = |
|
916 |
+ rbind(dplyr::select(GRN@connections$all.filtered[["0"]], dplyr::everything()) %>% |
|
917 |
+ dplyr::mutate(class = paste0("real_",range))) %>% |
|
918 |
+ rbind(dplyr::select(GRN@connections$all.filtered[["1"]], dplyr::everything()) %>% |
|
919 |
+ dplyr::mutate(class = paste0("random_",range))) |
|
920 |
+ |
|
921 |
+ } else { |
|
922 |
+ |
|
923 |
+ robustColumns = c("peak_gene.p_raw.robust", "peak_gene.bias_M_p.raw", "peak_gene.bias_LS_p.raw", "peak_gene.r_robust") |
|
924 |
+ if (all(robustColumns %in% colnames(GRN@connections$peak_genes[["0"]]))) { |
|
925 |
+ includeRobustCols = TRUE |
|
926 |
+ cols_keep = c(cols_keep, robustColumns) |
|
927 |
+ } |
|
928 |
+ |
|
929 |
+ class_levels = c(paste0("real_",range), paste0("random_",range)) |
|
930 |
+ |
|
931 |
+ peakGeneCorrelations.all = |
|
932 |
+ rbind( |
|
933 |
+ dplyr::select(GRN@connections$peak_genes[["0"]], tidyselect::all_of(cols_keep)) %>% |
|
934 |
+ dplyr::mutate(class = factor(paste0("real_",range), levels = class_levels)), |
|
935 |
+ dplyr::select(GRN@connections$peak_genes[["1"]], tidyselect::all_of(cols_keep)) %>% |
|
936 |
+ dplyr::mutate(class = factor(paste0("random_",range), levels = class_levels))) %>% |
|
937 |
+ dplyr::left_join(dplyr::select(GRN@annotation$genes, gene.ENSEMBL, gene.type, gene.mean, gene.median, gene.CV), by = "gene.ENSEMBL") %>% |
|
938 |
+ dplyr::left_join(GRN@annotation$consensusPeaks %>% |
|
939 |
+ dplyr::select(-dplyr::starts_with("peak.gene."), -peak.GC.perc), by = "peak.ID") %>% |
|
940 |
+ dplyr::select(-gene.ENSEMBL) |
|
941 |
+ |
|
942 |
+ } |
|
1188 | 943 |
|
1189 |
- newColName = varCur |
|
1190 |
- |
|
1191 |
- } else { |
|
1192 |
- |
|
1193 |
- newColName = paste0(varCur, ".class") |
|
1194 |
- |
|
1195 |
- if (varCur == "peak.gene.combined.CV") { |
|
1196 |
- |
|
1197 |
- dataCur = dataCur %>% |
|
1198 |
- dplyr::mutate(!!(newColName) := dplyr::case_when( |
|
1199 |
- gene.CV < 0.5 & peak.CV < 0.5 ~ "gene.CV+peak.CV<0.5", ## |
|
1200 |
- # gene.CV >= 0.5 & gene.CV < 1 & peak.CV < 0.5 ~ "gene.CV<1_peak.CV<0.5", |
|
1201 |
- # gene.CV >= 1 & peak.CV < 0.5 ~ "gene.CV>1_peak.CV<0.5", |
|
1202 |
- # |
|
1203 |
- # gene.CV < 0.5 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<0.5_peak.CV<1", |
|
1204 |
- # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<1_peak.CV<1", |
|
1205 |
- # gene.CV >= 1 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV>1_peak.CV<1", |
|
1206 |
- # |
|
1207 |
- # gene.CV < 0.5 & peak.CV >= 1 ~ "gene.CV<0.5_peak.CV>1", |
|
1208 |
- # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 1 ~ "gene.CV<1_peak.CV>1", |
|
1209 |
- gene.CV >= 1 & peak.CV >= 1 ~ "gene.CV+peak.CV>1", ## |
|
1210 |
- TRUE ~ "other" |
|
1211 |
- )) %>% |
|
1212 |
- dplyr::select(-gene.CV, -peak.CV) |
|
1213 |
- |
|
1214 |
- } else { |
|
1215 |
- |
|
1216 |
- dataCur = dataCur %>% |
|
1217 |
- dplyr::mutate(!!(newColName) := cut(.data[[varCur]], breaks = unique(quantile(.data[[varCur]], probs = probs, na.rm = TRUE)), |
|
1218 |
- include.lowest = TRUE, ordered_result = TRUE)) %>% |
|
1219 |
- dplyr::select(-tidyselect::all_of(varCur)) |
|
1220 |
- } |
|
1221 |
- |
|
944 |
+ if (!plotPerTF) { |
|
945 |
+ peakGeneCorrelations.all = dplyr::select(peakGeneCorrelations.all, -peak.ID) |
|
946 |
+ } |
|
947 |
+ |
|
948 |
+ nClasses_distance = 10 |
|
949 |
+ |
|
950 |
+ peakGeneCorrelations.all = peakGeneCorrelations.all %>% |
|
951 |
+ dplyr::mutate(r_positive = peak_gene.r > 0, |
|
952 |
+ peak_gene.distance_class = |
|
953 |
+ forcats::fct_explicit_na(addNA(cut(peak_gene.distance, breaks = nClasses_distance, include.lowest = TRUE)), "random"), |
|
954 |
+ peak_gene.distance_class_abs = forcats::fct_explicit_na(addNA(cut(abs(peak_gene.distance), |
|
955 |
+ breaks = nClasses_distance, include.lowest = TRUE, ordered_result = TRUE)), "random"), |
|
956 |
+ peak_gene.p.raw.class = cut(peak_gene.p_raw, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE), |
|
957 |
+ peak_gene.r.class = cut(peak_gene.r, breaks = seq(-1,1,0.05), include.lowest = TRUE, ordered_result = TRUE)) %>% |
|
958 |
+ dplyr::filter(!is.na(peak_gene.r)) # Eliminate rows with NA for peak_gene.r. This can happen if the normalized gene counts are identical across ALL samples, and due to the lack of any variation, the correlation cannot be computed |
|
959 |
+ |
|
960 |
+ |
|
961 |
+ # Oddity of cut: When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals. |
|
962 |
+ levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1] = |
|
963 |
+ gsub("(-\\d+)", "0", levels(peakGeneCorrelations.all$peak_gene.distance_class_abs)[1], perl = TRUE) |
|
964 |
+ |
|
965 |
+ |
|
966 |
+ if (includeRobustCols) { |
|
967 |
+ peakGeneCorrelations.all = peakGeneCorrelations.all %>% |
|
968 |
+ dplyr::mutate(peak_gene.p_raw.robust.class = |
|
969 |
+ cut(peak_gene.p_raw.robust, breaks = seq(0,1,0.05), include.lowest = TRUE, ordered_result = TRUE)) |
|
970 |
+ } |
|
971 |
+ |
|
972 |
+ |
|
973 |
+ # Prepare plots # |
|
974 |
+ |
|
975 |
+ colors_class = c("black", "black") |
|
976 |
+ names(colors_class)= unique(peakGeneCorrelations.all$class) |
|
977 |
+ colors_class[which(grepl("random", names(colors_class)))] = "darkgray" |
|
978 |
+ |
|
979 |
+ r_pos_class = c("black", "darkgray") |
|
980 |
+ names(r_pos_class) =c("TRUE", "FALSE") |
|
981 |
+ |
|
982 |
+ dist_class = c("dark red", "#fc9c9c") |
|
983 |
+ names(dist_class) = class_levels |
|
984 |
+ |
|
985 |
+ freqs= table(peakGeneCorrelations.all$class) |
|
986 |
+ freq_class = paste0(gsub(names(freqs), pattern = "(.+)(_.*)", replacement = "\\1"), " (n=", .prettyNum(freqs) , ")") |
|
987 |
+ # Change upstream and go with "permuted" everywhere |
|
988 |
+ freq_class = gsub(freq_class, pattern = "random", replacement = "permuted") |
|
989 |
+ names(freq_class) <- names(freqs) |
|
990 |
+ |
|
991 |
+ |
|
992 |
+ |
|
993 |
+ xlabels_peakGene_r.class = levels(peakGeneCorrelations.all$peak_gene.r.class) |
|
994 |
+ nCur = length(xlabels_peakGene_r.class) |
|
995 |
+ xlabels_peakGene_r.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- "" |
|
996 |
+ |
|
997 |
+ # For the last plot, which is wider, we label a few more |
|
998 |
+ xlabels_peakGene_r.class2 = levels(peakGeneCorrelations.all$peak_gene.r.class) |
|
999 |
+ nCur = length(xlabels_peakGene_r.class2) |
|
1000 |
+ xlabels_peakGene_r.class2[setdiff(seq_len(nCur), c(1, floor(nCur/4), floor(nCur/2), floor(nCur/4*3), nCur))] <- "" |
|
1001 |
+ |
|
1002 |
+ xlabels_peakGene_praw.class = levels(peakGeneCorrelations.all$peak_gene.p.raw.class) |
|
1003 |
+ nCur = length(xlabels_peakGene_praw.class) |
|
1004 |
+ xlabels_peakGene_praw.class[setdiff(seq_len(nCur), c(1, floor(nCur/2), nCur))] <- "" |
|
1005 |
+ |
|
1006 |
+ |
|
1007 |
+ for (geneTypesSelected in gene.types) { |
|
1008 |
+ |
|
1009 |
+ futile.logger::flog.info(paste0(" Gene type ", paste0(geneTypesSelected, collapse = "+"))) |
|
1010 |
+ |
|
1011 |
+ if ("all" %in% geneTypesSelected) { |
|
1012 |
+ indexCur = seq_len(nrow(peakGeneCorrelations.all)) |
|
1013 |
+ } else { |
|
1014 |
+ indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected) |
|
1015 |
+ } |
|
1016 |
+ |
|
1017 |
+ |
|
1018 |
+ # START PLOTTING # |
|
1019 |
+ |
|
1020 |
+ if (!is.null(fileBase)) { |
|
1021 |
+ filenameCur = paste0(fileBase, paste0(geneTypesSelected, collapse = "+"), filteredStr, ".pdf") |
|
1022 |
+ grDevices::pdf(file = filenameCur, width = pdf_width, height = pdf_height) |
|
1023 |
+ } |
|
1024 |
+ |
|
1025 |
+ if (plotPerTF) { |
|
1026 |
+ |
|
1027 |
+ TF.nRows = rep(-1, length(GRN@config$allTF)) |
|
1028 |
+ #TF.nRows = rep(-1, 10) |
|
1029 |
+ TF.peaks = list() |
|
1030 |
+ names(TF.nRows) = GRN@config$allTF |
|
1031 |
+ for (TFCur in GRN@config$allTF) { |
|
1032 |
+ TF.peaks[[TFCur]] = names(which(GRN@data$TFs$TF_peak_overlap[,TFCur] == 1)) |
|
1033 |
+ TF.nRows[TFCur] = peakGeneCorrelations.all[indexCur,] %>% dplyr::filter(peak.ID %in% TF.peaks[[TFCur]]) %>% nrow() |
|
1034 |
+ } |
|
1035 |
+ |
|
1036 |
+ TFs_sorted = names(sort(TF.nRows, decreasing = TRUE)) |
|
1037 |
+ |
|
1038 |
+ allTF = c("all", TFs_sorted) |
|
1039 |
+ |
|
1040 |
+ } else { |
|
1041 |
+ allTF = "all" |
|
1042 |
+ } |
|
1043 |
+ |
|
1044 |
+ counter = 0 |
|
1045 |
+ for (TFCur in allTF) { |
|
1046 |
+ |
|
1047 |
+ counter = counter + 1 |
|
1048 |
+ |
|
1049 |
+ if (length(allTF) > 1) { |
|
1050 |
+ futile.logger::flog.info(paste0(" QC plots for TF ", TFCur, " (", counter, " of ", length(allTF), ")")) |
|
1051 |
+ } |
|
1052 |
+ |
|
1053 |
+ |
|
1054 |
+ if ("all" %in% geneTypesSelected) { |
|
1055 |
+ indexCur = seq_len(nrow(peakGeneCorrelations.all)) |
|
1056 |
+ } else { |
|
1057 |
+ indexCur = which(peakGeneCorrelations.all$gene.type %in% geneTypesSelected) |
|
1058 |
+ } |
|
1059 |
+ |
|
1060 |
+ |
|
1061 |
+ if (TFCur != "all") { |
|
1062 |
+ indexCur = intersect(indexCur, which(peakGeneCorrelations.all$peak.ID %in% TF.peaks[[TFCur]])) |
|
1063 |
+ } |
|
1064 |
+ |
|
1065 |
+ # Get subset also for just the real data |
|
1066 |
+ indexCurReal = intersect(indexCur, which(peakGeneCorrelations.all$class == names(dist_class)[1])) |
|
1067 |
+ |
|
1068 |
+ |
|
1069 |
+ xlabel = paste0("Correlation raw p-value") |
|
1070 |
+ |
|
1071 |
+ # DENSITY PLOTS P VALUE |
|
1072 |
+ |
|
1073 |
+ # TODO: Densities as ratio |
|
1074 |
+ # https://stackoverflow.com/questions/58629959/how-can-i-extract-data-from-a-kernel-density-function-in-r-for-many-samples-at-o#:~:text=To%20compute%20the%20density%20you,can%20use%20the%20package%20spatstat%20. |
|
1075 |
+ |
|
1076 |
+ # Produce the labels for the class-specific subtitles |
|
1077 |
+ customLabel_class = .customLabeler(table(peakGeneCorrelations.all[indexCur,]$class)) |
|
1078 |
+ |
|
1079 |
+ r_pos_freq = table(peakGeneCorrelations.all[indexCur,]$r_positive) |
|
1080 |
+ labeler_r_pos = labeller(r_positive = c("TRUE" = paste0("r positive (", r_pos_freq["TRUE"], ")"), |
|
1081 |
+ "FALSE" = paste0("r negative (", r_pos_freq["FALSE"], ")")) ) |
|
1082 |
+ theme_main = theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = rel(0.8)), |
|
1083 |
+ axis.text.y = element_text(size=rel(0.8)), |
|
1084 |
+ panel.grid.major = element_blank(), panel.grid.minor = element_blank()) |
|
1085 |
+ |
|
1086 |
+ |
|
1087 |
+ |
|
1088 |
+ ## p-val density curves stratified by real/permuted ## |
|
1089 |
+ |
|
1090 |
+ gA2 = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.p_raw, color = r_positive)) + geom_density() + |
|
1091 |
+ facet_wrap(~ class, labeller = labeller(class=freq_class) ) + |
|
1092 |
+ xlab(xlabel) + ylab("Density") + theme_bw() + |
|
1093 |
+ scale_color_manual(labels = names(r_pos_class), values = r_pos_class) + |
|
1094 |
+ theme(legend.position = "none", axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8))) |
|
1095 |
+ |
|
1096 |
+ # Helper function to retrieve all tables and data aggregation steps for subsequent visualization |
|
1097 |
+ tbl.l = .createTables_peakGeneQC(peakGeneCorrelations.all[indexCur,], networkType_details, colors_vec, range) |
|
1098 |
+ |
|
1099 |
+ |
|
1100 |
+ xlabel = paste0("Correlation raw\np-value (binned)") |
|
1101 |
+ |
|
1102 |
+ |
|
1103 |
+ xlabels = levels(tbl.l$d_merged$peak_gene.p.raw.class) |
|
1104 |
+ xlabels[setdiff(seq_len(length(xlabels)), c(1, floor(length(xlabels)/2), length(xlabels)))] <- "" |
|
1105 |
+ |
|
1106 |
+ gB3 = ggplot(tbl.l$d_merged, aes(peak_gene.p.raw.class, ratio, fill = classAll)) + |
|
1107 |
+ geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.5) + |
|
1108 |
+ geom_hline(yintercept = 1, linetype = "dotted") + |
|
1109 |
+ xlab(xlabel) + ylab("Ratio") + |
|
1110 |
+ scale_fill_manual("Class", values = c(dist_class, r_pos_class), |
|
1111 |
+ labels = c("real", "permuted", "r+ (r>0)", "r- (r<=0)"), |
|
1112 |
+ ) + # labels vector can be kind of manually specified here because the levels were previosly defined in a certain order |
|
1113 |
+ scale_x_discrete(labels = xlabels_peakGene_praw.class) + |
|
1114 |
+ theme_bw() + |
|
1115 |
+ #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) + |
|
1116 |
+ # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) + |
|
1117 |
+ theme_main + |
|
1118 |
+ facet_wrap(~ factor(set), nrow = 2, scales = "free_y", strip.position = "left") |
|
1119 |
+ |
|
1120 |
+ # plot two FDR plots as well: (fraction of negative / negative+positive) and (fraction of permuted / permuted + real) |
|
1121 |
+ # that could give an indication of whether an FDR based on the permuted or based on the negative correlations would be more stringent |
|
1122 |
+ |
|
1123 |
+ # R PEAK GENE # |
|
1124 |
+ |
|
1125 |
+ xlabel = paste0("Correlation coefficient r") |
|
1126 |
+ |
|
1127 |
+ sum_real = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[1]] |
|
1128 |
+ sum_rnd = table(peakGeneCorrelations.all[indexCur,]$class)[names(dist_class)[2]] |
|
1129 |
+ binData.r = peakGeneCorrelations.all[indexCur,] %>% |
|
1130 |
+ dplyr::group_by(class) %>% |
|
1131 |
+ dplyr::count(peak_gene.r.class) %>% |
|
1132 |
+ dplyr::mutate(nnorm = dplyr::case_when(class == !! (names(dist_class)[1]) ~ .data$n / (sum_real / sum_rnd), |
|
1133 |
+ TRUE ~ as.double(.data$n))) |
|
1134 |
+ |
|
1135 |
+ xlabel = paste0("Correlation coefficient r (binned)") |
|
1136 |
+ |
|
1137 |
+ gD = ggplot(binData.r, aes(peak_gene.r.class, nnorm, group = class, fill = class)) + |
|
1138 |
+ geom_bar(stat = "identity", position = position_dodge(preserve = "single"), na.rm = FALSE, width = 0.5) + |
|
1139 |
+ geom_line(aes(peak_gene.r.class, nnorm, group = class, color= class), stat = "identity") + |
|
1140 |
+ scale_fill_manual("Group", labels = names(dist_class), values = dist_class) + |
|
1141 |
+ scale_color_manual("Group", labels = names(dist_class), values = dist_class) + |
|
1142 |
+ scale_x_discrete(labels = xlabels_peakGene_r.class2, drop = FALSE) + |
|
1143 |
+ theme_bw() + theme(legend.position = "none") + |
|
1144 |
+ xlab(xlabel) + ylab("Abundance") + |
|
1145 |
+ theme_main + |
|
1146 |
+ scale_y_continuous(labels = scales::scientific) |
|
1147 |
+ |
|
1148 |
+ |
|
1149 |
+ mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ",\n", .prettyNum(range), " bp promoter range)") |
|
1150 |
+ |
|
1151 |
+ plots_all = ( ((gA2 | gB3 ) + |
|
1152 |
+ patchwork::plot_layout(widths = c(2.5,1.5))) / ((gD) + |
|
1153 |
+ patchwork::plot_layout(widths = c(4))) ) + |
|
1154 |
+ patchwork::plot_layout(heights = c(2,1), guides = 'collect') + |
|
1155 |
+ patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5))) |
|
1156 |
+ |
|
1157 |
+ plot(plots_all) |
|
1158 |
+ |
|
1159 |
+ |
|
1160 |
+ } # end for each TF |
|
1161 |
+ |
|
1162 |
+ |
|
1163 |
+ |
|
1164 |
+ if (!is.null(GRN@annotation$consensusPeaks_obj) & is.installed("ChIPseeker")) { |
|
1165 |
+ #plot(ChIPseeker::plotAnnoBar(GRN@annotation$consensusPeaks_obj)) |
|
1166 |
+ |
|
1167 |
+ # no plot, as this is somehow just a list and no ggplot object |
|
1168 |
+ ChIPseeker::plotAnnoPie(GRN@annotation$consensusPeaks_obj, |
|
1169 |
+ main = paste0("\nPeak annotation BEFORE filtering (n = ", nrow(GRN@annotation$consensusPeaks), ")")) |
|
1170 |
+ plot(ChIPseeker::plotDistToTSS(GRN@annotation$consensusPeaks_obj)) |
|
1171 |
+ } |
|
1172 |
+ |
|
1173 |
+ # PEAK AND GENE PROPERTIES # |
|
1174 |
+ |
|
1175 |
+ xlabel = paste0("Correlation raw\np-value (binned)") |
|
1176 |
+ mainTitle = paste0("Summary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ", .prettyNum(range), " bp promoter range)") |
|
1177 |
+ |
|
1178 |
+ allVars = c("peak.GC.class", "peak.width", "peak.mean","peak.median", |
|
1179 |
+ "peak.CV", "gene.median", "gene.mean", "gene.CV", "peak.gene.combined.CV") |
|
1180 |
+ |
|
1181 |
+ # If ChIPseeker is not installed, remove peak.annotation from it |
|
1182 |
+ if ("peak.annotation" %in% colnames(peakGeneCorrelations.all)) { |
|
1183 |
+ allVars = c(allVars, "peak.annotation") |
|
1184 |
+ } |
|
1185 |
+ |
|
1186 |
+ |
|
1187 |
+ for (varCur in allVars) { |
|
1188 |
+ #next |
|
1189 |
+ # Save memory and prune the table and add only the variable we need here |
|
1190 |
+ if (varCur != "peak.gene.combined.CV") { |
|
1191 |
+ dataCur = peakGeneCorrelations.all[indexCurReal,] %>% |
|
1192 |
+ dplyr::select(peak_gene.p_raw, tidyselect::all_of(varCur), class, r_positive, peak_gene.p.raw.class, peak_gene.distance) |
|
1193 |
+ } else { |
|
1194 |
+ dataCur = peakGeneCorrelations.all[indexCurReal,] %>% |
|
1195 |
+ dplyr::select(peak_gene.p_raw, class, gene.CV, peak.CV, r_positive, peak_gene.p.raw.class, peak_gene.distance) |
|
1196 |
+ } |
|
1197 |
+ |
|
1198 |
+ |
|
1199 |
+ # Choose colors depending on type of variable: gradient or not? |
|
1200 |
+ |
|
1201 |
+ if (varCur %in% c("peak.annotation","peak.GC.class")) { |
|
1222 | 1202 |
|
1223 |
- } |
|
1203 |
+ newColName = varCur |
|
1204 |
+ |
|
1205 |
+ } else { |
|
1206 |
+ |
|
1207 |
+ newColName = paste0(varCur, ".class") |
|
1208 |
+ |
|
1209 |
+ if (varCur == "peak.gene.combined.CV") { |
|
1210 |
+ |
|
1211 |
+ dataCur = dataCur %>% |
|
1212 |
+ dplyr::mutate(!!(newColName) := dplyr::case_when( |
|
1213 |
+ gene.CV < 0.5 & peak.CV < 0.5 ~ "gene.CV+peak.CV<0.5", ## |
|
1214 |
+ # gene.CV >= 0.5 & gene.CV < 1 & peak.CV < 0.5 ~ "gene.CV<1_peak.CV<0.5", |
|
1215 |
+ # gene.CV >= 1 & peak.CV < 0.5 ~ "gene.CV>1_peak.CV<0.5", |
|
1216 |
+ # |
|
1217 |
+ # gene.CV < 0.5 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<0.5_peak.CV<1", |
|
1218 |
+ # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV<1_peak.CV<1", |
|
1219 |
+ # gene.CV >= 1 & peak.CV >= 0.5 & peak.CV < 1 ~ "gene.CV>1_peak.CV<1", |
|
1220 |
+ # |
|
1221 |
+ # gene.CV < 0.5 & peak.CV >= 1 ~ "gene.CV<0.5_peak.CV>1", |
|
1222 |
+ # gene.CV >= 0.5 & gene.CV < 1 & peak.CV >= 1 ~ "gene.CV<1_peak.CV>1", |
|
1223 |
+ gene.CV >= 1 & peak.CV >= 1 ~ "gene.CV+peak.CV>1", ## |
|
1224 |
+ TRUE ~ "other" |
|
1225 |
+ )) %>% |
|
1226 |
+ dplyr::select(-gene.CV, -peak.CV) |
|
1227 |
+ |
|
1228 |
+ } else { |
|
1229 |
+ |
|
1230 |
+ dataCur = dataCur %>% |
|
1231 |
+ dplyr::mutate(!!(newColName) := cut(.data[[varCur]], breaks = unique(quantile(.data[[varCur]], probs = probs, na.rm = TRUE)), |
|
1232 |
+ include.lowest = TRUE, ordered_result = TRUE)) %>% |
|
1233 |
+ dplyr::select(-tidyselect::all_of(varCur)) |
|
1234 |
+ } |
|
1235 |
+ |
|
1236 |
+ |
|
1237 |
+ } |
|
1238 |
+ |
|
1239 |
+ # Filter groups with fewer than 100 observations |
|
1240 |
+ nGroupsMin = 100 |
|
1241 |
+ dataCur = dataCur %>% |
|
1242 |
+ dplyr::group_by(.data[[newColName]]) %>% |
|
1243 |
+ dplyr::filter(dplyr::n() >= nGroupsMin) %>% |
|
1244 |
+ dplyr::ungroup() |
|
1245 |
+ |
|
1246 |
+ var.label = .prettyNum(table(dataCur[, newColName])) |
|
1247 |
+ var.label = paste0(names(var.label), "\n(n=", var.label , ")\n") |
|
1248 |
+ |
|
1249 |
+ # Set colors |
|
1250 |
+ if (varCur != "peak.annotation") { |
|
1251 |
+ mycolors <- viridis::viridis(length(var.label)) |
|
1252 |
+ } else { |
|
1253 |
+ # Only here, we want to have colors that are not a gradient |
|
1254 |
+ mycolors = var.label |
|
1255 |
+ downstream = which(grepl("downstream", var.label, ignore.case = TRUE)) |
|
1256 |
+ promoter = which(grepl("promoter", var.label, ignore.case = TRUE)) |
|
1257 |
+ |
|
1258 |
+ # Remove the first, white-like color from the 2 palettes |
|
1259 |
+ mycolors[downstream] = RColorBrewer::brewer.pal(length(downstream) + 1, "Greens")[-1] |
|
1260 |
+ mycolors[promoter] = RColorBrewer::brewer.pal(length(promoter) + 1, "Purples")[-1] |
|
1261 |
+ mycolors[which(grepl("3' UTR", mycolors))] = "yellow" |
|
1262 |
+ mycolors[which(grepl("5' UTR", mycolors))] = "orange" |
|
1263 |
+ mycolors[which(grepl("Distal Intergenic", mycolors))] = "red" |
|
1264 |
+ mycolors[which(grepl("Exon", mycolors))] = "maroon" |
|
1265 |
+ mycolors[which(grepl("Intron", mycolors))] = "lightblue" |
|
1266 |
+ |
|
1267 |
+ } |
|
1268 |
+ |
|
1269 |
+ r_pos_tbl = dataCur %>% dplyr::group_by(r_positive) %>% dplyr::pull(r_positive) %>% table() |
|
1270 |
+ r_positive_label = c("TRUE" = paste0("r+(r>0, n=", .prettyNum(r_pos_tbl[["TRUE"]]), ")"), |
|
1271 |
+ "FALSE" = paste0("r-(r<=0, n=" ,.prettyNum(r_pos_tbl[["FALSE"]]), ")")) |
|
1272 |
+ |
|
1273 |
+ # Class in the facet_wrap has been removed, as this is only for real data here |
|
1274 |
+ gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2) + |
|
1275 |
+ facet_wrap(~r_positive, labeller = labeller(r_positive = r_positive_label), nrow = 2) + |
|
1276 |
+ geom_density(aes(color = classNew), color = "black", linetype = "dotted", alpha = 1) + |
|
1277 |
+ xlab(xlabel) + ylab("Density (for real data only)") + theme_bw() + |
|
1278 |
+ scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) + |
|
1279 |
+ theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), |
|
1280 |
+ legend.text=element_text(size=rel(0.6)), legend.position = "none") |
|
1281 |
+ |
|
1282 |
+ # The following causes an error with patchwork |
|
1283 |
+ # gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2) + |
|
1284 |
+ # facet_wrap(~ class + r_positive, labeller = labeller(class=freq_class, r_positive = r_positive_label), nrow = 2) + |
|
1285 |
+ # geom_density(aes(color = classNew), color = "black", linetype = "dotted", alpha = 1) + |
|
1286 |
+ # xlab(xlabel) + ylab("Density") + theme_bw() + |
|
1287 |
+ # scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) + |
|
1288 |
+ # theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), |
|
1289 |
+ # legend.text=element_text(size=rel(0.6)), legend.position = "none") |
|
1290 |
+ |
|
1291 |
+ |
|
1292 |
+ # Ratios for r+ / r- |
|
1293 |
+ freq = dataCur %>% |
|
1294 |
+ dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class, r_positive) %>% |
|
1295 |
+ dplyr::summarise(n = dplyr::n()) %>% |
|
1296 |
+ dplyr::ungroup() %>% |
|
1297 |
+ tidyr::complete(class, .data[[newColName]], peak_gene.p.raw.class, r_positive, fill = list(n = 0)) %>% # SOme cases might be missing |
|
1298 |
+ dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class) %>% # dont group by r_positive because we want to calculate the ratio within each group |
|
1299 |
+ dplyr::mutate( |
|
1300 |
+ n = .data$n + 1, # to allow ratios to be computed even for 0 counts |
|
1301 |
+ ratio_pos_raw = .data$n[r_positive] / .data$n[!r_positive]) %>% |
|
1302 |
+ dplyr::filter(r_positive, class == names(dist_class)[1])# Keep only one r_positive row per grouping as we operate via the ratio and this data is duplicated otherwise. Remove random data also because these have been filtered out before are only back due to the complete invokation. |
|
1303 |
+ |
|
1304 |
+ # Cap ratios > 10 at 10 to avoid visual artefacts |
|
1305 |
+ freq$ratio_pos_raw[which(freq$ratio_pos_raw > 10)] = 10 |
|
1306 |
+ |
|
1307 |
+ # Without proper colors for now, this will be added after the next plot |
|
1308 |
+ gB3 = ggplot(freq, aes(peak_gene.p.raw.class, ratio_pos_raw, fill = .data[[newColName]])) + |
|
1309 |
+ geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.8) + |
|
1310 |
+ geom_hline(yintercept = 1, linetype = "dotted") + |
|
1311 |
+ xlab(xlabel) + ylab("Ratio r+ / r- (capped at 10)") + |
|
1312 |
+ scale_x_discrete(labels = xlabels_peakGene_praw.class) + |
|
1313 |
+ scale_fill_manual (varCur, values = mycolors, labels = var.label, drop = FALSE ) + |
|
1314 |
+ theme_bw() + |
|
1315 |
+ #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) + |
|
1316 |
+ # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) + |
|
1317 |
+ theme_main + |
|
1318 |
+ facet_wrap(~ factor(class), nrow = 2, scales = "free_y", strip.position = "left", , labeller = labeller(class=freq_class)) |
|
1319 |
+ |
|
1320 |
+ |
|
1321 |
+ plots_all = ( ((gA3 | gB3 ) + |
|
1322 |
+ patchwork::plot_layout(widths = c(2.5,1.5))) ) + |
|
1323 |
+ patchwork::plot_layout(guides = 'collect') + |
|
1324 |
+ patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5))) |
|
1325 |
+ |
|
1326 |
+ plot(plots_all) |
|
1327 |
+ |
|
1328 |
+ |
|
1329 |
+ # VERSION JUDITH: simplified peak.gene distance + another variable as histogram, no pemuted data |
|
1330 |
+ |
|
1331 |
+ datasetName = "" |
|
1332 |
+ if (!is.null(GRN@config$metadata$name)) { |
|
1333 |
+ datasetName = GRN@config$metadata$name |
|
1334 |
+ } |
|
1335 |
+ |
|
1336 |
+ mainTitle2 = paste0(datasetName, "\nSummary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ", |
|
1337 |
+ .prettyNum(range), " bp promoter range, stratified by distance + ", varCur, ")") |
|
1338 |
+ |
|
1339 |
+ # r+ and r- |
|
1340 |
+ binwidth = 0.1 |
|
1341 |
+ mycolors <- viridis::viridis(2) |
|
1342 |
+ xlabel = paste0("Correlation raw p-value (", binwidth, " bins)") |
|
1343 |
+ |
|
1344 |
+ dataCur = dataCur %>% |
|
1345 |
+ dplyr::mutate(peak_gene.distance.class250k = factor(dplyr::if_else(peak_gene.distance <= 250000, "<=250k", ">250k"))) %>% |
|
1346 |
+ dplyr::select(-peak_gene.distance) |
|
1347 |
+ |
|
1348 |
+ # closed = "left", boundary = 0, ensure correct numbers. See https://github.com/tidyverse/ggplot2/issues/1739 |
|
1349 |
+ # Without boundary = 0, counts are actually wrong |
|
1350 |
+ |
|
1351 |
+ |
|
1352 |
+ nrows_plot = 2 |
|
1353 |
+ if (length(unique(dataCur[[newColName]])) > 9) { |
|
1354 |
+ nrows_plot = 3 |
|
1355 |
+ } |
|
1356 |
+ |
|
1357 |
+ gA5 = ggplot(dataCur, aes(peak_gene.p_raw, fill = r_positive)) + |
|
1358 |
+ geom_histogram(binwidth = binwidth, position="dodge", closed = "left", boundary = 0) + |
|
1359 |
+ facet_wrap(~ peak_gene.distance.class250k + .data[[newColName]], nrow = nrows_plot, scales = "free_y") + |
|
1360 |
+ xlab(xlabel) + ylab(paste0("Abundance for classes with n>=", nGroupsMin)) + theme_bw() + |
|
1361 |
+ ggtitle(mainTitle2) + |
|
1362 |
+ scale_fill_manual("Class for r", values = mycolors, labels = r_positive_label, drop = FALSE ) + |
|
1363 |
+ theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.6)), |
|
1364 |
+ legend.text=element_text(size=rel(0.7))) |
|
1365 |
+ |
|
1366 |
+ |
|
1367 |
+ plot(gA5) |
|
1368 |
+ |
|
1369 |
+ |
|
1370 |
+ |
|
1371 |
+ } #end for each variable |
|
1224 | 1372 |
|
1225 |
- # Filter groups with fewer than 100 observations |
|
1226 |
- nGroupsMin = 100 |
|
1227 |
- dataCur = dataCur %>% |
|
1228 |
- dplyr::group_by(.data[[newColName]]) %>% |
|
1229 |
- dplyr::filter(dplyr::n() >= nGroupsMin) %>% |
|
1230 |
- dplyr::ungroup() |
|
1231 |
- |
|
1232 |
- var.label = .prettyNum(table(dataCur[, newColName])) |
|
1233 |
- var.label = paste0(names(var.label), "\n(n=", var.label , ")\n") |
|
1234 |
- |
|
1235 |
- # Set colors |
|
1236 |
- if (varCur != "peak.annotation") { |
|
1237 |
- mycolors <- viridis::viridis(length(var.label)) |
|
1238 |
- } else { |
|
1239 |
- # Only here, we want to have colors that are not a gradient |
|
1240 |
- mycolors = var.label |
|
1241 |
- downstream = which(grepl("downstream", var.label, ignore.case = TRUE)) |
|
1242 |
- promoter = which(grepl("promoter", var.label, ignore.case = TRUE)) |
|
1243 |
- |
|
1244 |
- # Remove the first, white-like color from the 2 palettes |
|
1245 |
- mycolors[downstream] = RColorBrewer::brewer.pal(length(downstream) + 1, "Greens")[-1] |
|
1246 |
- mycolors[promoter] = RColorBrewer::brewer.pal(length(promoter) + 1, "Purples")[-1] |
|
1247 |
- mycolors[which(grepl("3' UTR", mycolors))] = "yellow" |
|
1248 |
- mycolors[which(grepl("5' UTR", mycolors))] = "orange" |
|
1249 |
- mycolors[which(grepl("Distal Intergenic", mycolors))] = "red" |
|
1250 |
- mycolors[which(grepl("Exon", mycolors))] = "maroon" |
|
1251 |
- mycolors[which(grepl("Intron", mycolors))] = "lightblue" |
|
1252 |
- |
|
1253 |
- } |
|
1254 |
- |
|
1255 |
- r_pos_tbl = dataCur %>% dplyr::group_by(r_positive) %>% dplyr::pull(r_positive) %>% table() |
|
1256 |
- r_positive_label = c("TRUE" = paste0("r+(r>0, n=", .prettyNum(r_pos_tbl[["TRUE"]]), ")"), |
|
1257 |
- "FALSE" = paste0("r-(r<=0, n=" ,.prettyNum(r_pos_tbl[["FALSE"]]), ")")) |
|
1258 | 1373 |
|
1259 |
- # Class in the facet_wrap has been removed, as this is only for real data here |
|
1260 |
- gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2) + |
|
1261 |
- facet_wrap(~r_positive, labeller = labeller(r_positive = r_positive_label), nrow = 2) + |
|
1262 |
- geom_density(aes(color = classNew), color = "black", linetype = "dotted", alpha = 1) + |
|
1263 |
- xlab(xlabel) + ylab("Density (for real data only)") + theme_bw() + |
|
1264 |
- scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) + |
|
1265 |
- theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), |
|
1266 |
- legend.text=element_text(size=rel(0.6)), legend.position = "none") |
|
1267 |
- |
|
1268 |
- # The following causes an error with patchwork |
|
1269 |
- # gA3 = ggplot(dataCur, aes(peak_gene.p_raw, color = .data[[newColName]])) + geom_density(size = 0.2) + |
|
1270 |
- # facet_wrap(~ class + r_positive, labeller = labeller(class=freq_class, r_positive = r_positive_label), nrow = 2) + |
|
1271 |
- # geom_density(aes(color = classNew), color = "black", linetype = "dotted", alpha = 1) + |
|
1272 |
- # xlab(xlabel) + ylab("Density") + theme_bw() + |
|
1273 |
- # scale_color_manual(newColName, values = mycolors, labels = var.label, drop = FALSE ) + |
|
1274 |
- # theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.8)), |
|
1275 |
- # legend.text=element_text(size=rel(0.6)), legend.position = "none") |
|
1276 |
- |
|
1277 |
- |
|
1278 |
- # Ratios for r+ / r- |
|
1279 |
- freq = dataCur %>% |
|
1280 |
- dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class, r_positive) %>% |
|
1281 |
- dplyr::summarise(n = dplyr::n()) %>% |
|
1282 |
- dplyr::ungroup() %>% |
|
1283 |
- tidyr::complete(class, .data[[newColName]], peak_gene.p.raw.class, r_positive, fill = list(n = 0)) %>% # SOme cases might be missing |
|
1284 |
- dplyr::group_by(class, .data[[newColName]], peak_gene.p.raw.class) %>% # dont group by r_positive because we want to calculate the ratio within each group |
|
1285 |
- dplyr::mutate( |
|
1286 |
- n = .data$n + 1, # to allow ratios to be computed even for 0 counts |
|
1287 |
- ratio_pos_raw = .data$n[r_positive] / .data$n[!r_positive]) %>% |
|
1288 |
- dplyr::filter(r_positive, class == names(dist_class)[1])# Keep only one r_positive row per grouping as we operate via the ratio and this data is duplicated otherwise. Remove random data also because these have been filtered out before are only back due to the complete invokation. |
|
1289 |
- |
|
1290 |
- # Cap ratios > 10 at 10 to avoid visual artefacts |
|
1291 |
- freq$ratio_pos_raw[which(freq$ratio_pos_raw > 10)] = 10 |
|
1292 |
- |
|
1293 |
- # Without proper colors for now, this will be added after the next plot |
|
1294 |
- gB3 = ggplot(freq, aes(peak_gene.p.raw.class, ratio_pos_raw, fill = .data[[newColName]])) + |
|
1295 |
- geom_bar(stat = "identity", position="dodge", na.rm = TRUE, width = 0.8) + |
|
1296 |
- geom_hline(yintercept = 1, linetype = "dotted") + |
|
1297 |
- xlab(xlabel) + ylab("Ratio r+ / r- (capped at 10)") + |
|
1298 |
- scale_x_discrete(labels = xlabels_peakGene_praw.class) + |
|
1299 |
- scale_fill_manual (varCur, values = mycolors, labels = var.label, drop = FALSE ) + |
|
1300 |
- theme_bw() + |
|
1301 |
- #theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), strip.background = element_blank(), strip.placement = "outside", axis.title.y = element_blank()) + |
|
1302 |
- # theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) , axis.title.y = element_blank()) + |
|
1303 |
- theme_main + |
|
1304 |
- facet_wrap(~ factor(class), nrow = 2, scales = "free_y", strip.position = "left", , labeller = labeller(class=freq_class)) |
|
1305 |
- |
|
1306 |
- |
|
1307 |
- plots_all = ( ((gA3 | gB3 ) + |
|
1308 |
- patchwork::plot_layout(widths = c(2.5,1.5))) ) + |
|
1309 |
- patchwork::plot_layout(guides = 'collect') + |
|
1310 |
- patchwork::plot_annotation(title = mainTitle, theme = theme(plot.title = element_text(hjust = 0.5))) |
|
1311 |
- |
|
1312 |
- plot(plots_all) |
|
1313 |
- |
|
1314 |
- |
|
1315 |
- # VERSION JUDITH: simplified peak.gene distance + another variable as histogram, no pemuted data |
|
1316 |
- |
|
1317 |
- datasetName = "" |
|
1318 |
- if (!is.null(GRN@config$metadata$name)) { |
|
1319 |
- datasetName = GRN@config$metadata$name |
|
1320 |
- } |
|
1321 |
- |
|
1322 |
- mainTitle2 = paste0(datasetName, "\nSummary QC (TF: ", TFCur, ", gene type: ", paste0(geneTypesSelected, collapse = "+"), ", ", |
|
1323 |
- .prettyNum(range), " bp promoter range, stratified by distance + ", varCur, ")") |
|
1324 |
- |
|
1325 |
- # r+ and r- |
|
1326 |
- binwidth = 0.1 |
|
1327 |
- mycolors <- viridis::viridis(2) |
|
1328 |
- xlabel = paste0("Correlation raw p-value (", binwidth, " bins)") |
|
1329 |
- |
|
1330 |
- dataCur = dataCur %>% |
|
1331 |
- dplyr::mutate(peak_gene.distance.class250k = factor(dplyr::if_else(peak_gene.distance <= 250000, "<=250k", ">250k"))) %>% |
|
1332 |
- dplyr::select(-peak_gene.distance) |
|
1333 |
- |
|
1334 |
- # closed = "left", boundary = 0, ensure correct numbers. See https://github.com/tidyverse/ggplot2/issues/1739 |
|
1335 |
- # Without boundary = 0, counts are actually wrong |
|
1336 |
- |
|
1337 |
- |
|
1338 |
- nrows_plot = 2 |
|
1339 |
- if (length(unique(dataCur[[newColName]])) > 9) { |
|
1340 |
- nrows_plot = 3 |
|
1341 |
- } |
|
1342 |
- |
|
1343 |
- gA5 = ggplot(dataCur, aes(peak_gene.p_raw, fill = r_positive)) + |
|
1344 |
- geom_histogram(binwidth = binwidth, position="dodge", closed = "left", boundary = 0) + |
|
1345 |
- facet_wrap(~ peak_gene.distance.class250k + .data[[newColName]], nrow = nrows_plot, scales = "free_y") + |
|
1346 |
- xlab(xlabel) + ylab(paste0("Abundance for classes with n>=", nGroupsMin)) + theme_bw() + |
|
1347 |
- ggtitle(mainTitle2) + |
|
1348 |
- scale_fill_manual("Class for r", values = mycolors, labels = r_positive_label, drop = FALSE ) + |
|
1349 |
- theme(axis.text=element_text(size=rel(0.6)), strip.text.x = element_text(size = rel(0.6)), |
|
1350 |
- legend.text=element_text(size=rel(0.7))) |
|
1351 |
- |
|
1352 |
- |
|
1353 |
- plot(gA5) |
|
1354 |
- |
|
1355 |
- |
|
1356 |
- |
|
1357 |
- } #end for each variable |
|
1358 |
- |
|
1359 |
- |
|
1360 |
- # DISTANCE FOCUSED # |
|
1361 |
- |
|
1362 |
- |
|
1363 |
- # Here, we focus on distance and exclude distance classes with too few points and create a new subset of the data |
|
1364 |
- |
|
1365 |
- # Filter distance classes with too few points |
|
1366 |
- distance_class_abund = table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs) |
|
1367 |
- indexFilt = which(peakGeneCorrelations.all$peak_gene.distance_class_abs %in% |
|
1368 |
- names(distance_class_abund)[which(distance_class_abund > 50)]) |
|
1369 |
- indexFilt = intersect(indexFilt, indexCur) |
|
1370 |
- |
|
1371 |
- if (length(indexFilt) > 0) { |
|
1372 |
- g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = peak_gene.distance_class_abs)) + geom_density() + |
|
1373 |
- ggtitle(paste0("Density of the raw p-value distributions")) + |
|
1374 |
- facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + |
|
1375 |
- scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) + |
|
1376 |
- theme_bw() |
|
1377 |
- plot(g) |
|
1374 |
+ # DISTANCE FOCUSED # |
|
1378 | 1375 |
|
1379 |
- |
|
1380 |
- if (includeRobustCols) { |
|
1381 |
- g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw.robust, color = peak_gene.distance_class_abs)) + geom_density() + |
|
1382 |
- ggtitle(paste0("Density of the raw p-value distributions\n(stratified by whether r is positive)")) + |
|
1383 |
- facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + |
|
1384 |
- scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) + |
|
1385 |
- theme_bw() |
|
1386 |
- plot(g) |
|
1376 |
+ |
|
1377 |
+ # Here, we focus on distance and exclude distance classes with too few points and create a new subset of the data |
|
1378 |
+ |
|
1379 |
+ # Filter distance classes with too few points |
|
1380 |
+ distance_class_abund = table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs) |
|
1381 |
+ indexFilt = which(peakGeneCorrelations.all$peak_gene.distance_class_abs %in% |
|
1382 |
+ names(distance_class_abund)[which(distance_class_abund > 50)]) |
|
1383 |
+ indexFilt = intersect(indexFilt, indexCur) |
|
1384 |
+ |
|
1385 |
+ if (length(indexFilt) > 0) { |
|
1386 |
+ g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = peak_gene.distance_class_abs)) + geom_density() + |
|
1387 |
+ ggtitle(paste0("Density of the raw p-value distributions")) + |
|
1388 |
+ facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + |
|
1389 |
+ scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) + |
|
1390 |
+ theme_bw() |
|
1391 |
+ plot(g) |
|
1387 | 1392 |
|
1388 |
- |
|
1389 |
- } |
|
1390 |
- |
|
1391 |
- } # end if (length(indexFilt) > 0) |
|
1392 |
- |
|
1393 |
- |
|
1394 |
- |
|
1395 |
- |
|
1396 |
- if (length(indexFilt) > 0) { |
|
1397 |
- g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = r_positive)) + |
|
1398 |
- geom_density() + |
|
1399 |
- ggtitle(paste0("Density of the raw p-value distributions")) + |
|
1400 |
- facet_wrap(~ peak_gene.distance_class_abs, ncol = 2, labeller = .customLabeler(table(peakGeneCorrelations.all$peak_gene.distance_class_abs))) + |
|
1401 |
- scale_color_manual(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$r_positive)), values = r_pos_class) + |
|
1402 |
- theme_bw() |
|
1403 |
- plot(g) |
|
1404 |
- |
|
1405 |
- |
|
1406 |
- } |
|
1407 |
- |
|
1408 |
- |
|
1409 |
- # |
|
1410 |
- # Focus on peak_gene.r |
|
1411 |
- # |
|
1412 |
- |
|
1413 |
- g = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.r, color = peak_gene.distance_class_abs)) + geom_density() + |
|
1414 |
- geom_density(data = peakGeneCorrelations.all[indexCur,], aes(peak_gene.r), color = "black") + |
|
1415 |
- ggtitle(paste0("Density of the correlation coefficients")) + |
|
1416 |
- scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs))) + |
|
1417 |
- theme_bw() |
|
1418 |
- plot(g) |
|
1419 |
- |
|
1420 |
- |
|
1421 |
- if (!is.null(fileBase)) { |
|
1422 |
- grDevices::dev.off() |
|
1423 |
- } |
|
1424 |
- |
|
1425 |
- |
|
1426 |
- |
|
1427 |
- } |
|
1428 |
- |
|
1429 |
- |
|
1430 |
- |
|
1393 |
+ |
|
1394 |
+ if (includeRobustCols) { |
|
1395 |
+ g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw.robust, color = peak_gene.distance_class_abs)) + geom_density() + |
|
1396 |
+ ggtitle(paste0("Density of the raw p-value distributions\n(stratified by whether r is positive)")) + |
|
1397 |
+ facet_wrap(~ r_positive, ncol = 2, labeller = labeler_r_pos) + |
|
1398 |
+ scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$peak_gene.distance_class_abs))) + |
|
1399 |
+ theme_bw() |
|
1400 |
+ plot(g) |
|
1401 |
+ |
|
1402 |
+ |
|
1403 |
+ } |
|
1404 |
+ |
|
1405 |
+ } # end if (length(indexFilt) > 0) |
|
1406 |
+ |
|
1407 |
+ |
|
1408 |
+ |
|
1409 |
+ |
|
1410 |
+ if (length(indexFilt) > 0) { |
|
1411 |
+ g = ggplot(peakGeneCorrelations.all[indexFilt,], aes(peak_gene.p_raw, color = r_positive)) + |
|
1412 |
+ geom_density() + |
|
1413 |
+ ggtitle(paste0("Density of the raw p-value distributions")) + |
|
1414 |
+ facet_wrap(~ peak_gene.distance_class_abs, ncol = 2, labeller = .customLabeler(table(peakGeneCorrelations.all$peak_gene.distance_class_abs))) + |
|
1415 |
+ scale_color_manual(labels = .classFreq_label(table(peakGeneCorrelations.all[indexFilt,]$r_positive)), values = r_pos_class) + |
|
1416 |
+ theme_bw() |
|
1417 |
+ plot(g) |
|
1418 |
+ |
|
1419 |
+ |
|
1420 |
+ } |
|
1421 |
+ |
|
1422 |
+ |
|
1423 |
+ # |
|
1424 |
+ # Focus on peak_gene.r |
|
1425 |
+ # |
|
1426 |
+ |
|
1427 |
+ g = ggplot(peakGeneCorrelations.all[indexCur,], aes(peak_gene.r, color = peak_gene.distance_class_abs)) + geom_density() + |
|
1428 |
+ geom_density(data = peakGeneCorrelations.all[indexCur,], aes(peak_gene.r), color = "black") + |
|
1429 |
+ ggtitle(paste0("Density of the correlation coefficients")) + |
|
1430 |
+ scale_color_viridis_d(labels = .classFreq_label(table(peakGeneCorrelations.all[indexCur,]$peak_gene.distance_class_abs))) + |
|
1431 |
+ theme_bw() |
|
1432 |
+ plot(g) |
|
1433 |
+ |
|
1434 |
+ |
|
1435 |
+ if (!is.null(fileBase)) { |
|
1436 |
+ grDevices::dev.off() |
|
1437 |
+ } |
|
1438 |
+ |
|
1439 |
+ |
|
1440 |
+ } |
|
1441 |
+ |
|
1442 |
+ } else { |
|
1443 |
+ futile.logger::flog.info(paste0("All output files already exist. Set forceRerun = TRUE to regenerate and overwrite.")) |
|
1444 |
+ |
|
1445 |
+ } |
|
1431 | 1446 |
|
1447 |
+ |
|
1432 | 1448 |
|
1433 | 1449 |
.printExecutionTime(start) |
1434 | 1450 |
|
... | ... |
@@ -1554,14 +1570,14 @@ plotDiagnosticPlots_peakGene <- function(GRN, |
1554 | 1570 |
#' Plot various network connectivity summaries for a \code{\linkS4class{GRN}} object |
1555 | 1571 |
#' |
1556 | 1572 |
#' @template GRN |
1557 |
-#' @param type Character. Either "heatmap" or "boxplot". Default "heatmap". Which plot type to produce? |
|
1573 |
+#' @param type Character. Either \code{"heatmap"} or \code{"boxplot"}. Default \code{"heatmap"}. Which plot type to produce? |
|
1558 | 1574 |
#' @template outputFolder |
1559 | 1575 |
#' @template basenameOutput |
1560 | 1576 |
#' @template plotAsPDF |
1561 | 1577 |
#' @template pdf_width |
1562 | 1578 |
#' @template pdf_height |
1563 | 1579 |
#' @template forceRerun |
1564 |
-#' @return The same \code{\linkS4class{GRN}} object, without modifications. In addition, for the specified \code{type}, a PDF file (default filename is GRN.connectionSummary_{type}.pdf) is produced with a connection summary. We refer to the Vignettes for details and further explanations. |
|
1580 |
+#' @return The same \code{\linkS4class{GRN}} object, without modifications. In addition, for the specified \code{type}, a PDF file (default filename is \code{GRN.connectionSummary_{type}.pdf}) is produced with a connection summary. |
|
1565 | 1581 |
#' @examples |
1566 | 1582 |
#' # See the Workflow vignette on the GRaNIE website for examples |
1567 | 1583 |
#' GRN = loadExampleObject() |
... | ... |
@@ -2051,10 +2067,10 @@ plotGeneralGraphStats <- function(GRN, outputFolder = NULL, basenameOutput = NUL |
2051 | 2067 |
#' @template pdf_width |
2052 | 2068 |
#' @template pdf_height |
2053 | 2069 |
#' @template forceRerun |
2054 |
-#' @param ontology Character. NULL or vector of ontology names. Default NULL. Vector of ontologies to plot. The results must have been previously calculated otherwise an error is thrown. |
|
2070 |
+#' @param ontology Character. \code{NULL} or vector of ontology names. Default \code{NULL}. Vector of ontologies to plot. The results must have been previously calculated otherwise an error is thrown. |
|
2055 | 2071 |
#' @param p Numeric. Default 0.05. p-value threshold to determine significance. |
2056 | 2072 |
#' @param topn_pvalue Numeric. Default 30. Maximum number of ontology terms that meet the p-value significance threshold to display in the enrichment dot plot |
2057 |
-#' @param display_pAdj Boolean. Default FALSE. Is the p-value being displayed in the plots the adjusted p-value? This parameter is relevant for KEGG, Disease Ontology, and Reactome enrichments, and does not affect GO enrichments. |
|
2073 |
+#' @param display_pAdj \code{TRUE} or \code{FALSE}. Default \code{FALSE}. Is the p-value being displayed in the plots the adjusted p-value? This parameter is relevant for KEGG, Disease Ontology, and Reactome enrichments, and does not affect GO enrichments. |
|
2058 | 2074 |
#' @return The same \code{\linkS4class{GRN}} object, without modifications. A single PDF file is produced with the results. |
2059 | 2075 |
#' @examples |
2060 | 2076 |
#' # See the Workflow vignette on the GRaNIE website for examples |
... | ... |
@@ -2225,8 +2241,8 @@ plotGeneralEnrichment <- function(GRN, outputFolder = NULL, basenameOutput = NUL |
2225 | 2241 |
#' @template pdf_width |
2226 | 2242 |
#' @template pdf_height |
2227 | 2243 |
#' @template basenameOutput |
2228 |
-#' @param display Character. Default "byRank". One of: "byRank", "byLabel". Specify whether the communities will by displayed based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank. |
|
2229 |
-#' @param communities Numeric vector. Default c(1:10). Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities} = c(1,4), if \code{display} = "byRank" the results for the first and fourth largest communities will be plotted. if \code{display} = "byLabel", the results for the communities labeled "1", and "4" will be plotted. If set to NULL, all communities will be plotted |
|
2244 |
+#' @inheritParams plotCommunitiesEnrichment |
|
2245 |
+#' @param communities Numeric vector. Default \code{seq_len(10)}. Depending on what was specified in the \code{display} parameter, this parameter would indicate either the rank or the label of the communities to be plotted. i.e. for \code{communities = c(1,4)}, if \code{display = "byRank"} the results for the first and fourth largest communities will be plotted. if \code{display = "byLabel"}, the results for the communities labeled \code{"1"}, and \code{"4"} will be plotted. If set to \code{NULL}, all communities will be plotted |
|
2230 | 2246 |
#' @template forceRerun |
2231 | 2247 |
#' @param topnGenes Integer. Default 20. Number of genes to plot, sorted by their rank or label. |
2232 | 2248 |
#' @param topnTFs Integer. Default 20. Number of TFs to plot, sorted by their rank or label. |
... | ... |
@@ -2397,8 +2413,8 @@ plotCommunitiesStats <- function(GRN, outputFolder = NULL, basenameOutput = NULL |
2397 | 2413 |
#' Similarly to \code{\link{plotGeneralEnrichment}}, the results of the community-based enrichment analysis are plotted.. By default, the results for the 10 largest communities are displayed. Additionally, if a general enrichment analysis was previously generated, this function plots an additional heatmap to compare the general enrichment with the community based enrichment. A reduced version of this heatmap is also produced where terms are filtered out to improve visibility and display and highlight the most significant terms. |
2398 | 2414 |
#' |
2399 | 2415 |
#' @inheritParams plotGeneralEnrichment |
2400 |
-#' @param display Character. Default "byRank". One of: "byRank", "byLabel". Specify whether the communities will by displayed based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank. |
|
2401 |
-#' @param communities NULL or Numeric vector. Default NULL. If set to NULL, the default, all communities enrichments that have been calculated before are plotted. If a numeric vector is specified: Depending on what was specified in the \code{display} parameter, this parameter indicates either the rank or the label of the communities to be plotted. i.e. for \code{communities} = c(1,4), if \code{display} = "byRank" the results for the first and fourth largest communities are plotted. if \code{display} = "byLabel", the results for the communities labeled "1", and "4" are plotted. |
|
2416 |
+#' @param display Character. Default \code{"byRank"}. One of: \code{"byRank"}, \code{"byLabel"}. Specify whether the communities will by displayed based on their rank, where the largest community (with most vertices) would have a rank of 1, or by their label. Note that the label is independent of the rank. |
|
2417 |
+#' @param communities \code{NULL} or numeric vector. Default \code{NULL}. If set to \code{NULL}, the default, all communities enrichments that have been calculated before are plotted. If a numeric vector is specified: Depending on what was specified in the \code{display} parameter, this parameter indicates either the rank or the label of the communities to be plotted. i.e. for \code{communities = c(1,4)}, if \code{display = "byRank"} the results for the first and fourth largest communities are plotted. if \code{display = "byLabel"}, the results for the communities labeled \code{"1"}, and \code{"4"} are plotted. |
|
2402 | 2418 |
#' @param nSignificant Numeric. Default 3. Threshold to filter out an ontology term with less than \code{nSignificant} overlapping genes. |
2403 | 2419 |
#' @param nID Numeric. Default 10. For the reduced heatmap, number of top terms to select per community. |
2404 | 2420 |
#' @param maxWidth_nchar_plot Integer (>=10). Default 100. Maximum number of characters for a term before it is truncated. |
... | ... |
@@ -18,7 +18,7 @@ AR_classification_wrapper( |
18 | 18 |
\arguments{ |
19 | 19 |
\item{GRN}{Object of class \code{\linkS4class{GRN}}} |
20 | 20 |
|
21 |
-\item{significanceThreshold_Wilcoxon}{Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and diffTF paper for details.} |
|
21 |
+\item{significanceThreshold_Wilcoxon}{Numeric between 0 and 1. Default 0.05. Significance threshold for Wilcoxon test that is run in the end for the final classification. See the Vignette and *diffTF* paper for details.} |
|
22 | 22 |
|
23 | 23 |
\item{plot_minNoTFBS_heatmap}{Integer. Default 100. Minimum number of TFBS for a TF to be included in the heatmap that is part of the output of this function.} |
24 | 24 |
|
... | ... |
@@ -33,7 +33,7 @@ AR_classification_wrapper( |
33 | 33 |
\item{forceRerun}{\code{TRUE} or \code{FALSE}. Default \code{FALSE}. Force execution, even if the GRN object already contains the result. Overwrites the old results.} |
34 | 34 |
} |
35 | 35 |
\value{ |
36 |
-The same \code{\linkS4class{GRN}} object, with added data from this function. TF_classification_densityPlotsForegroundBackground_expression_perm{0,1}.pdf, TF_classification_stringencyThresholds_expression_perm0.pdf, TF_classification_summaryHeatmap_expression_perm0.pdf, |
|
36 |
+The same \code{\linkS4class{GRN}} object, with added data from this function. |
|
37 | 37 |
} |
38 | 38 |
\description{ |
39 | 39 |
Run the activator-repressor classification for the TFs for a \code{\linkS4class{GRN}} object |
... | ... |
@@ -30,9 +30,9 @@ addConnections_TF_peak( |
30 | 30 |
|
31 | 31 |
\item{corMethod}{Character. \code{pearson} or \code{spearman}. Default \code{pearson}. Method for calculating the correlation coefficient. See \link{cor} for details.} |
32 | 32 |
|
33 |
-\item{connectionTypes}{Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{addData_TFActivity} for details.} |
|
33 |
+\item{connectionTypes}{Character vector. Default \code{expression}. Vector of connection types to include for the TF-peak connections. If an additional connection type is specified here, it has to be available already within the object (EXPERIMENTAL). See the function \code{\link{addData_TFActivity}} for details.} |
|
34 | 34 |
|
35 |
-\item{removeNegativeCorrelation}{\code{TRUE} or \code{FALSE} (vector). Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is FALSE, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}.} |
|
35 |
+\item{removeNegativeCorrelation}{Vector of \code{TRUE} or \code{FALSE}. Default \code{FALSE}. EXPERIMENTAL. Must be a logical vector of the same length as the parameter \code{connectionType}. Should negatively correlated TF-peak connections be removed for the specific connection type? For connection type expression, the default is \code{FALSE}, while for any TF Activity related connection type, we recommend setting this to \code{TRUE}.} |
|
36 | 36 |
|
37 | 37 |
\item{maxFDRToStore}{Numeric. Default 0.3. Maximum TF-peak FDR value to permanently store a particular TF-peak connection in the object? This parameter has a large influence on the overall memory size of the object, and we recommend not storing connections with a high FDR due to their sheer number.} |
38 | 38 |
|
... | ... |
@@ -45,7 +45,7 @@ addConnections_TF_peak( |
45 | 45 |
\item{forceRerun}{\code{TRUE} or \code{FALSE}. Default \code{FALSE}. Force execution, even if the GRN object already contains the result. Overwrites the old results.} |
46 | 46 |
} |
47 | 47 |
\value{ |
48 |
-The same \code{\linkS4class{GRN}} object, with added data from this function. TF_peak.fdrCurves_perm{o,1}.pdf |
|
48 |
+The same \code{\linkS4class{GRN}} object, with added data from this function. |
|
49 | 49 |
} |
50 | 50 |
\description{ |
51 | 51 |
Add TF-peak connections to a \code{\linkS4class{GRN}} object |
... | ... |
@@ -21,19 +21,19 @@ addConnections_peak_gene( |
21 | 21 |
\arguments{ |
22 | 22 |
\item{GRN}{Object of class \code{\linkS4class{GRN}}} |
23 | 23 |
|
24 |
-\item{overlapTypeGene}{Character. "TSS" or "full". Default "TSS". If set to "TSS", only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to "full", the whole annotated gene (including all exons and introns) is used instead.} |
|
24 |
+\item{overlapTypeGene}{Character. \code{"TSS"} or \code{"full"}. Default \code{"TSS"}. If set to \code{"TSS"}, only the TSS of the gene is used as reference for finding genes in the neighborhood of a peak. If set to \code{"full"}, the whole annotated gene (including all exons and introns) is used instead.} |
|
25 | 25 |
|
26 | 26 |
\item{corMethod}{Character. \code{pearson} or \code{spearman}. Default \code{pearson}. Method for calculating the correlation coefficient. See \link{cor} for details.} |
27 | 27 |
|
28 | 28 |
\item{promoterRange}{Integer >=0. Default 250000. The size of the neighborhood in bp to correlate peaks and genes in vicinity. Only peak-gene pairs will be correlated if they are within the specified range. Increasing |