Name Mode Size
.github 040000
R 040000
inst 040000
man 040000
pkgdown 040000
tests 040000
vignettes 040000
.Rbuildignore 100644 0 kb
.gitignore 100644 0 kb
DESCRIPTION 100644 3 kb
Makefile 100644 0 kb
NAMESPACE 100644 15 kb
NEWS.md 100644 4 kb
README.Rmd 100644 8 kb
README.md 100644 14 kb
README.md
<!-- README.md is generated from README.Rmd. Please edit that file --> # plyranges: fluent genomic data analysis <img id="plyranges_logo" src="man/figures/logo.png" align="right" width = "125" /> <!-- badges: start --> [![R-CMD-check-bioc](https://github.com/tidyomics/plyranges/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/tidyomics/plyranges/actions?query=workflow%3AR-CMD-check-bioc) [![BioC status](http://www.bioconductor.org/shields/build/release/bioc/plyranges.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/plyranges) <!-- badges: end --> [plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) provides a consistent interface for importing and wrangling genomics data from a variety of sources. The package defines a grammar of genomic data transformation based on `dplyr` and the Bioconductor packages `IRanges`, `GenomicRanges`, and `rtracklayer`. It does this by providing a set of verbs for developing analysis pipelines based on *Ranges* objects that represent genomic regions: - Modify genomic regions with the `mutate()` and `stretch()` functions. - Modify genomic regions while fixing the start/end/center coordinates with the `anchor_` family of functions. - Sort genomic ranges with `arrange()`. - Modify, subset, and aggregate genomic data with the `mutate()`, `filter()`, and `summarise()`functions. - Any of the above operations can be performed on partitions of the data with `group_by()`. - Find nearest neighbour genomic regions with the `join_nearest_` family of functions. - Find overlaps between ranges with the `join_overlaps_` family of functions. - Merge all overlapping and adjacent genomic regions with `reduce_ranges()`. - Merge the end points of all genomic regions with `disjoin_ranges()`. - Import and write common genomic data formats with the `read_/write_` family of functions. For more details on the features of plyranges, read the [vignette](https://tidyomics.github.io/plyranges/articles/an-introduction.html). For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq results read the [*fluentGenomics* workflow](https://tidyomics.github.io/fluentGenomics). plyranges is part of the [tidyomics](https://github.com/tidyomics) project, providing a `dplyr`-based interface for many types of genomics datasets represented in Bioconductor. # Installation [plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) can be installed from the latest Bioconductor release: ``` r # install.packages("BiocManager") BiocManager::install("plyranges") ``` To install the development version from GitHub: ``` r BiocManager::install("tidyomics/plyranges") ``` # Quick overview ## About `Ranges` `Ranges` objects can either represent sets of integers as `IRanges` (which have start, end and width attributes) or represent genomic intervals (which have additional attributes, sequence name, and strand) as `GRanges`. In addition, both types of `Ranges` can store information about their intervals as metadata columns (for example GC content over a genomic interval). `Ranges` objects follow the tidy data principle: each row of a `Ranges` object corresponds to an interval, while each column will represent a variable about that interval, and generally each object will represent a single unit of observation (like gene annotations). We can construct a `IRanges` object from a `data.frame` with a `start` or `width` using the `as_iranges()` method. ``` r library(plyranges) df <- data.frame(start = 1:5, width = 5) as_iranges(df) #> IRanges object with 5 ranges and 0 metadata columns: #> start end width #> <integer> <integer> <integer> #> [1] 1 5 5 #> [2] 2 6 5 #> [3] 3 7 5 #> [4] 4 8 5 #> [5] 5 9 5 # alternatively with end df <- data.frame(start = 1:5, end = 5:9) as_iranges(df) #> IRanges object with 5 ranges and 0 metadata columns: #> start end width #> <integer> <integer> <integer> #> [1] 1 5 5 #> [2] 2 6 5 #> [3] 3 7 5 #> [4] 4 8 5 #> [5] 5 9 5 ``` We can also construct a `GRanges` object in a similar manner. Note that a `GRanges` object requires at least a seqnames column to be present in the data.frame (but not necessarily a strand column). ``` r df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"), start = 1:5, width = 5) as_granges(df) #> GRanges object with 5 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 1-5 * #> [2] chr2 2-6 * #> [3] chr2 3-7 * #> [4] chr1 4-8 * #> [5] chr2 5-9 * #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths # strand can be specified with `+`, `*` (mising) and `-` df$strand <- c("+", "+", "-", "-", "*") as_granges(df) #> GRanges object with 5 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 1-5 + #> [2] chr2 2-6 + #> [3] chr2 3-7 - #> [4] chr1 4-8 - #> [5] chr2 5-9 * #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` # Example: finding GWAS hits that overlap known exons Let’s look at a more a realistic example (taken from HelloRanges vignette). Suppose we have two *GRanges* objects: one containing coordinates of known exons and another containing SNPs from a GWAS. The first and last 5 exons are printed below, there are two additional columns corresponding to the exon name, and a score. We could check the number of exons per chromosome using `group_by` and `summarise`. ``` r exons #> GRanges object with 459752 ranges and 2 metadata columns: #> seqnames ranges strand | name score #> <Rle> <IRanges> <Rle> | <character> <numeric> #> [1] chr1 11874-12227 + | NR_046018_exon_0_0_c.. 0 #> [2] chr1 12613-12721 + | NR_046018_exon_1_0_c.. 0 #> [3] chr1 13221-14409 + | NR_046018_exon_2_0_c.. 0 #> [4] chr1 14362-14829 - | NR_024540_exon_0_0_c.. 0 #> [5] chr1 14970-15038 - | NR_024540_exon_1_0_c.. 0 #> ... ... ... ... . ... ... #> [459748] chrY 59338754-59338859 + | NM_002186_exon_6_0_c.. 0 #> [459749] chrY 59338754-59338859 + | NM_176786_exon_7_0_c.. 0 #> [459750] chrY 59340194-59340278 + | NM_002186_exon_7_0_c.. 0 #> [459751] chrY 59342487-59343488 + | NM_002186_exon_8_0_c.. 0 #> [459752] chrY 59342487-59343488 + | NM_176786_exon_8_0_c.. 0 #> ------- #> seqinfo: 93 sequences from an unspecified genome; no seqlengths exons %>% group_by(seqnames) %>% summarise(n = n()) #> DataFrame with 49 rows and 2 columns #> seqnames n #> <Rle> <integer> #> 1 chr1 43366 #> 2 chr10 19420 #> 3 chr11 24476 #> 4 chr12 24949 #> 5 chr13 7974 #> ... ... ... #> 45 chrUn_gl000222 20 #> 46 chrUn_gl000223 22 #> 47 chrUn_gl000228 85 #> 48 chrX 18173 #> 49 chrY 4128 ``` Next we create a column representing the transcript_id with `mutate`: ``` r exons <- exons %>% mutate(tx_id = sub("_exon.*", "", name)) ``` To find all GWAS SNPs that overlap exons, we use `join_overlap_inner`. This will create a new *GRanges* with the coordinates of SNPs that overlap exons, as well as metadata from both objects. ``` r olap <- join_overlap_inner(gwas, exons) olap #> GRanges object with 3439 ranges and 4 metadata columns: #> seqnames ranges strand | name.x name.y score #> <Rle> <IRanges> <Rle> | <character> <character> <numeric> #> [1] chr1 1079198 * | rs11260603 NR_038869_exon_2_0_c.. 0 #> [2] chr1 1247494 * | rs12103 NM_001256456_exon_1_.. 0 #> [3] chr1 1247494 * | rs12103 NM_001256460_exon_1_.. 0 #> [4] chr1 1247494 * | rs12103 NM_001256462_exon_1_.. 0 #> [5] chr1 1247494 * | rs12103 NM_001256463_exon_1_.. 0 #> ... ... ... ... . ... ... ... #> [3435] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0 #> [3436] chrX 153764217 * | rs1050828 NM_000402_exon_9_0_c.. 0 #> [3437] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0 #> [3438] chrX 153764217 * | rs1050828 NM_000402_exon_9_0_c.. 0 #> [3439] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0 #> tx_id #> <character> #> [1] NR_038869 #> [2] NM_001256456 #> [3] NM_001256460 #> [4] NM_001256462 #> [5] NM_001256463 #> ... ... #> [3435] NM_001042351 #> [3436] NM_000402 #> [3437] NM_001042351 #> [3438] NM_000402 #> [3439] NM_001042351 #> ------- #> seqinfo: 93 sequences from an unspecified genome; no seqlengths ``` For each SNP we can count the number of times it overlaps a transcript. ``` r olap %>% group_by(name.x, tx_id) %>% summarise(n = n()) #> DataFrame with 1619 rows and 3 columns #> name.x tx_id n #> <character> <character> <integer> #> 1 rs10043775 NM_001271723 1 #> 2 rs10043775 NM_030793 1 #> 3 rs10078 NM_001242412 1 #> 4 rs10078 NM_020731 1 #> 5 rs10089 NM_001046 1 #> ... ... ... ... #> 1615 rs9906595 NM_001008777 1 #> 1616 rs9948 NM_017623 1 #> 1617 rs9948 NM_199078 1 #> 1618 rs995030 NM_000899 4 #> 1619 rs995030 NM_003994 4 ``` We can also generate 2bp splice sites on either side of the exon using `flank_left` and `flank_right`. We add a column indicating the side of flanking for illustrative purposes. The `interweave` function pairs the left and right ranges objects. ``` r left_ss <- flank_left(exons, 2L) right_ss <- flank_right(exons, 2L) all_ss <- interweave(left_ss, right_ss, .id = "side") all_ss #> GRanges object with 919504 ranges and 4 metadata columns: #> seqnames ranges strand | name score #> <Rle> <IRanges> <Rle> | <character> <numeric> #> [1] chr1 11872-11873 + | NR_046018_exon_0_0_c.. 0 #> [2] chr1 12228-12229 + | NR_046018_exon_0_0_c.. 0 #> [3] chr1 12611-12612 + | NR_046018_exon_1_0_c.. 0 #> [4] chr1 12722-12723 + | NR_046018_exon_1_0_c.. 0 #> [5] chr1 13219-13220 + | NR_046018_exon_2_0_c.. 0 #> ... ... ... ... . ... ... #> [919500] chrY 59340279-59340280 + | NM_002186_exon_7_0_c.. 0 #> [919501] chrY 59342485-59342486 + | NM_002186_exon_8_0_c.. 0 #> [919502] chrY 59343489-59343490 + | NM_002186_exon_8_0_c.. 0 #> [919503] chrY 59342485-59342486 + | NM_176786_exon_8_0_c.. 0 #> [919504] chrY 59343489-59343490 + | NM_176786_exon_8_0_c.. 0 #> tx_id side #> <character> <character> #> [1] NR_046018 left #> [2] NR_046018 right #> [3] NR_046018 left #> [4] NR_046018 right #> [5] NR_046018 left #> ... ... ... #> [919500] NM_002186 right #> [919501] NM_002186 left #> [919502] NM_002186 right #> [919503] NM_176786 left #> [919504] NM_176786 right #> ------- #> seqinfo: 93 sequences from an unspecified genome; no seqlengths ``` # Learning more - The [*fluentGenomics* workflow](https://sa-lee.github.io/fluentGenomics) package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love for using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html). - The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed walk through of using plyranges for coverage analysis. - The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data. # Citation If you found `plyranges` useful for your work please cite our [paper](http://dx.doi.org/10.1186/s13059-018-1597-8): @ARTICLE{Lee2019, title = "plyranges: a grammar of genomic data transformation", author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael", journal = "Genome Biol.", volume = 20, number = 1, pages = "4", month = jan, year = 2019, url = "http://dx.doi.org/10.1186/s13059-018-1597-8", doi = "10.1186/s13059-018-1597-8", pmc = "PMC6320618" } # Contributing We welcome contributions from the R/Bioconductor community. We ask that contributors follow the [code of conduct](.github/CODE_OF_CONDUCT.md) and the guide outlined [here](.github/CONTRIBUTING.md).