⚖<code>EpiCompare</code>⚖<br>QC and Benchmarking of Epigenomic Datasets
Authors: <i>Sera Choi, Brian Schilder, Leyla Abbasova, Alan Murphy,
Nathan Skene</i>
<i>Updated</i>: Mar-08-2023
# Introduction
`EpiCompare` is an R package for comparing multiple epigenomic datasets
for quality control and benchmarking purposes. The function outputs a
report in HTML format consisting of three sections:
1. General Metrics: Metrics on peaks (percentage of blacklisted and
non-standard peaks, and peak widths) and fragments (duplication
rate) of samples.
2. Peak Overlap: Frequency, percentage, statistical significance of
overlapping and non-overlapping peaks. This also includes Upset,
precision-recall and correlation plots.
3. Functional Annotation: Functional annotation (ChromHMM, ChIPseeker
and enrichment analysis) of peaks. Also includes peak enrichment
around Transcription Start Site.
*Note*: Peaks located in blacklisted regions and non-standard
chromosomes are removed from the files prior to analysis.
# Installation
## Standard
To install `EpiCompare` use:
``` r
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
## All dependencies
Installing all *Imports* and *Suggests* will allow you to use the full
functionality of `EpiCompare` right away, without having to stop and
install extra dependencies later on.
To install these packages as well, use:
``` r
BiocManager::install("EpiCompare", dependencies=TRUE)
Note that this will increase installation time, but it means that you
won’t have to worry about installing any R packages when using functions
with certain suggested dependencies
## Development
To install the development version of `EpiCompare`, use:
``` r
if (!require("remotes")) install.packages("remotes")
## Citation
If you use `EpiCompare`, please cite:
> EpiCompare: R package for the comparison and quality control of
> epigenomic peak files (2022) Sera Choi, Brian M. Schilder, Leyla
> Abbasova, Alan E. Murphy, Nathan G. Skene, bioRxiv, 2022.07.22.501149;
> doi: <https://doi.org/10.1101/2022.07.22.501149>
# Documentation
## [EpiCompare website](https://neurogenomics.github.io/EpiCompare)
## [Docker/Singularity container](https://neurogenomics.github.io/EpiCompare/articles/docker)
## [Bioconductor page](https://doi.org/doi:10.18129/B9.bioc.EpiCompare)
### :warning: Note on documentation versioning
The documentation in this README and the [GitHub Pages
website](https://neurogenomics.github.io/EpiCompare/) pertains to the
*development* version of `EpiCompare`. Older versions of `EpiCompare`
may have slightly different documentation (e.g. available functions,
parameters). For documentation in older versions of `EpiCompare`, please
see the **Documentation** section of the relevant version on
# Usage
Load package and example datasets.
``` r
data("encode_H3K27ac") # example peakfile
data("CnT_H3K27ac") # example peakfile
data("CnR_H3K27ac") # example peakfile
data("CnT_H3K27ac_picard") # example Picard summary output
data("CnR_H3K27ac_picard") # example Picard summary output
Prepare input files:
``` r
# create named list of peakfiles
peakfiles <- list("CnT"=CnT_H3K27ac,
# set ref file and name
reference <- list("ENCODE_H3K27ac" = encode_H3K27ac)
# create named list of Picard summary
picard_files <- list("CnT"=CnT_H3K27ac_picard,
`EpiCompare::gather_files` is helpful for identifying and importing peak
or picard files.
``` r
# To import BED files as GRanges object
peakfiles <- EpiCompare::gather_files(dir = "path/to/peaks/",
type = "peaks.stringent")
# EpiCompare alternatively accepts paths (to BED files) as input
peakfiles <- list(sample1="/path/to/peaks/file1_peaks.stringent.bed",
# To import Picard summary output txt file as data frame
picard_files <- EpiCompare::gather_files(dir = "path/to/peaks",
type = "picard")
Run `EpiCompare()`:
``` r
EpiCompare::EpiCompare(peakfiles = peakfiles,
genome_build = list(peakfiles="hg19",
genome_build_output = "hg19",
picard_files = picard_files,
reference = reference,
run_all = TRUE
output_dir = tempdir())
#### Required Inputs
These input parameters must be provided:
- `peakfiles` : Peakfiles you want to analyse. EpiCompare accepts
peakfiles as GRanges object and/or as paths to BED files. Files must
be listed and named using `list()`. E.g.
`list("name1"=peakfile1, "name2"=peakfile2)`.
- `genome_build` : A named list indicating the human genome build used
to generate each of the following inputs:
- `peakfiles` : Genome build for the `peakfiles` input. Assumes genome
build is the same for each element in the `peakfiles` list.
- `reference` : Genome build for the `reference` input.
- `blacklist` : Genome build for the `blacklist` input. <br> E.g.
`genome_build = list(peakfiles="hg38", reference="hg19", blacklist="hg19")`
- `genome_build_output` Genome build to standardise all inputs to.
Liftovers will be performed automatically as needed. Default is
- `blacklist` : Peakfile as GRanges object specifying genomic regions
that have anomalous and/or unstructured signals independent of the
cell-line or experiment. For human hg19 and hg38 genome, use built-in
data `data(hg19_blacklist)` and `data(hg38_blacklist)` respectively.
For mouse mm10 genome, use built-in data `data(mm10_blacklist)`.
- `output_dir` : Please specify the path to directory, where all
`EpiCompare` outputs will be saved.
#### Optional Inputs
The following input files are optional:
- `picard_files` : A list of summary metrics output from
[Picard](https://broadinstitute.github.io/picard/). *Picard
MarkDuplicates* can be used to identify the duplicate reads amongst
the alignment. This tool generates a summary output, normally with the
ending *.markdup.MarkDuplicates.metrics.txt*. If this input is
provided, metrics on fragments (e.g. mapped fragments and duplication
rate) will be included in the report. Files must be in data.frame
format and listed using `list()` and named using `names()`. To import
Picard duplication metrics (.txt file) into R as data frame, use
`picard <- read.table("/path/to/picard/output", header = TRUE, fill = TRUE)`.
- `reference` : Reference peak file(s) is used in `stat_plot` and
`chromHMM_plot`. File must be in `GRanges` object, listed and named
using `list("reference_name" = GRanges_obect)`. If more than one
reference is specified, `EpiCompare` outputs individual reports for
each reference. However, please note that this can take awhile.
#### Optional Plots
By default, these plots will not be included in the report unless set to
`TRUE`. To turn on all features at once, simply use the `run_all=TRUE`
- `upset_plot` : Upset plot of overlapping peaks between samples.
- `stat_plot` : included only if a `reference` dataset is provided. The
plot shows statistical significance (p/q-values) of sample peaks that
are overlapping/non-overlapping with the `reference` dataset.
- `chromHMM_plot` : ChromHMM annotation of peaks. If a `reference`
dataset is provided, ChromHMM annotation of overlapping and
non-overlapping peaks with the `reference` is also included in the
- `chipseeker_plot` : ChIPseeker annotation of peaks.
- `enrichment_plot` : KEGG pathway and GO enrichment analysis of peaks.
- `tss_plot` : Peak frequency around (+/- 3000bp) transcriptional start
site. Note that it may take awhile to generate this plot for large
sample sizes.
- `precision_recall_plot` : Plot showing the precision-recall score
across the peak calling stringency thresholds.
- `corr_plot` : Plot showing the correlation between the quantiles when
the genome is binned at a set size. These quantiles are based on the
intensity of the peak, dependent on the peak caller used (q-value for
#### Other Options
- `chromHMM_annotation` : Cell-line annotation for ChromHMM. Default is
K562. Options are:
- “K562” = K-562 cells
- “Gm12878” = Cellosaurus cell-line GM12878
- “H1hesc” = H1 Human Embryonic Stem Cell
- “Hepg2” = Hep G2 cell
- “Hmec” = Human Mammary Epithelial Cell
- “Hsmm” = Human Skeletal Muscle Myoblasts
- “Huvec” = Human Umbilical Vein Endothelial Cells
- “Nhek” = Normal Human Epidermal Keratinocytes
- “Nhlf” = Normal Human Lung Fibroblasts
- `interact` : By default, all heatmaps (percentage overlap and ChromHMM
heatmaps) in the report will be interactive. If set FALSE, all
heatmaps will be static. N.B. If `interact=TRUE`, interactive heatmaps
will be saved as html files, which may take time for larger sample
- `output_filename` : By default, the report is named *EpiCompare.html*.
You can specify the file name of the report here.
- `output_timestamp` : By default FALSE. If TRUE, the filename of the
report includes the date.
#### Outputs
`EpiCompare` outputs the following:
1. **HTML report**: A summary of all analyses saved in specified
2. **EpiCompare_file**: if `save_output=TRUE`, all plots generated by
`EpiCompare` will be saved in *EpiCompare_file* directory also in
specified `output_dir`
An example report comparing ATAC-seq and DNase-seq can be found
## Datasets
`EpiCompare` includes several built-in datasets:
- `encode_H3K27ac`: Human H3K27ac peak file generated with ChIP-seq
using K562 cell-line. Taken from
[ENCODE](https://www.encodeproject.org/files/ENCFF044JNJ/) project.
For more information, run `?encode_H3K27ac`.
- `CnT_H3K27ac`: Human H3K27ac peak file generated with CUT&Tag using
K562 cell-line from [Kaya-Okur et al.,
For more information, run `?CnT_H3K27ac`.
- `CnR_H3K27ac`: Human H3K27ac peak file generated with CUT&Run using
K562 cell-line from [Meers et al.,
For more details, run `?CnR_H3K27ac`.
## Session Info
``` r
## Contact
### [Neurogenomics Lab](https://www.neurogenomics.co.uk/inst/report/EpiCompare.html)
UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London