Name Mode Size
.github 040000
R 040000
data 040000
inst 040000
man 040000
src 040000
vignettes 040000
.gitignore 100644 0 kb
DESCRIPTION 100644 1 kb
NAMESPACE 100644 2 kb
NEWS.md 100644 14 kb
README.md 100644 8 kb
README.md
# SynExtend **Table of contents:** - [Introduction](#introduction) - [Installation](#installation) - [Usage](#usage) - [Reporting Bugs and Errors](#Reporting-Bugs-and-Errors) ## Introduction SynExtend is a package of tools for working with synteny objects generated from the Bioconductor Package DECIPHER. ## Installation SynExtend is present in the package repository Bioconductor and the current `release` version can be installed via: ```r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install() BiocManager::install("SynExtend") library("SynExtend") ``` The `devel` version in Bioconductor can be installed via: ```r BiocManager::install("SynExtend", version = "devel") library("SynExtend") ``` The version in this repository (which should always be up to date with `devel` and frequently ahead of `release`) can be installed with devtools via: ```r devtools::install_github(repo = "npcooley/synextend") library("SynExtend") ``` Additionally, SynExtend is maintained in a Docker container on [Dockerhub](https://hub.docker.com/repository/docker/npcooley/synextend) that is [here](https://github.com/npcooley/SynContainer). ## Usage ### Prediction of Orthologous Gene Pairs SynExtend facilitates the prediction of orthologous gene pairs from synteny maps. The functions herein rely on synteny maps built by the `FindSynteny()` function in the package `DECIPHER`. The process is relatively straightforward and only requires assemblies and genecalls, which can be conveniently pulled from the NCBI using [Entrez Direct](https://www.ncbi.nlm.nih.gov/books/NBK179288/). However, as long as given assemblies have associated gene calls, or can have gene calls generated, they will work just fine. `DECIPHER` now contains the function `FindGenes()` that produces high quality de novo gene calls. Using Entrez Direct, this process would start like this: ```r library(SynExtend) EntrezQuery <- paste("esearch -db assembly ", "-query '", "kitasatospora[organism] ", 'AND "complete genome"[filter] ', # only complete genomes 'AND "refseq has annotation"[properties] ', # only genomes with annotations 'AND "latest refseq"[filter] ', # only latest "NOT anomalous[filter]' ", '| ', 'esummary ', '| ', 'xtract -pattern DocumentSummary -element FtpPath_RefSeq', sep = "") FtPPaths <- system(command = EntrezQuery, intern = TRUE, timeout = 300L) # timeout argument is required for RStudio only FNAs <- unname(sapply(FtPPaths, function(x) paste(x, "/", strsplit(x, split = "/", fixed = TRUE)[[1]][10], "_genomic.fna.gz", sep = ""))) GFFs <- unname(sapply(FtPPaths, function(x) paste(x, "/", strsplit(x, split = "/", fixed = TRUE)[[1]][10], "_genomic.gff.gz", sep = ""))) ``` With these matched character vectors describing files on the NCBI FTP site, we can pull given assemblies, and their associated `GFF` files. Assemblies are pulled using functions within the package `Biostrings` and placed in a sqlite database using the package `DECIPHER`. Note, `DBPATH` can be a specific local file location, though a tempfile works fine for this example. The tempfile in this example will be destroyed upon closure of the active R session, so specific file paths are preferable for most user activity. The code chunk below provides an example of pulling down assemblies from the NCBI and using `DECIPHER`'s built in gene finder to generate gene calls. ```r DBPATH <- tempfile() GC01 <- vector(mode = "list", length = length(FNAs)) for (m1 in seq_along(FNAs)) { X <- readDNAStringSet(FNAs[m1]) Seqs2DB(seqs = X, type = "XStringSet", dbFile = DBPATH, identifier = as.character(m1), verbose = TRUE) GC01[[m1]] <- FindGenes(myDNAStringSet = X) } names(GC01) <- seq(length(FNAs)) ``` A current quirk to this process is that genecalls must be present in a named list, where the names can be matched to integer identifiers in the `DECIPHER` database used to construct a synteny object. Users have the option of using the function `FindGenes` in `DECIPHER`, using the `rtracklayer` function `import` to import a valid `GFF` file, or importing a `GFF` with the `SynExtend` function `gffToDataFrame` which performs some data munging to convert the `GFF` into a `DataFrame` for user convenience. ```r GC02 <- vector(mode = "list", length = length(GFFs)) for (m1 in seq_along(GFFs)) { GC02[[m1]] <- gffToDataFrame(GFF = GFFs[m1], Verbose = TRUE) } names(GC02) <- seq(length(GC02)) ``` With assemblies and genecalls now in a user's workspace, users can build a synteny map with `FindSynteny`. See `?FindSynteny` in R for further reading. Finding and tabulating where genes are connected by syntenic k-mers is then performed with `NucleotideOverlap`, and that raw data is converted into a convenient `data.frame` with the function `PairSummariers`. ```r Syn <- FindSynteny(dbFile = DBPATH) Links <- NucleotideOverlap(SyntenyObject = Syn, GeneCalls = GC01, Verbose = TRUE) Pairs01 <- PairSummaries(SyntenyLinks = Links, PIDs = TRUE, Verbose = TRUE) Clusters01 <- DisjointSet(Pairs = Pairs01, Verbose = TRUE) ``` Pairs where predictions create conflicts can be removed by a simple subsetting function. ```r Pairs02 <- SubSetPairs(CurrentPairs = Pairs01, Verbose = TRUE) Clusters02 <- DisjointSet(Pairs = Pairs02, Verbose = TRUE) ``` A more self-contained example is maintained in this package's Bioconductor vignette, and can be found [here](https://www.bioconductor.org/packages/release/bioc/html/SynExtend.html) ### Creating Functional Association Networks SynExtend also allows for prediction of functional association networks from evidence of correlated selective pressures between of pairs of clusters of orthologous genes (COGs). Given a list of gene trees or presence/absence data, the `ProtWeaver` class in SynExtend uses multiple algorithms to compute likelihood of pairwise functional association. This list can be generated with `DisjointSet(...)`, or can be created by the user. An example of using `ProtWeaver` can be done using the built-in example *Streptomyces* dataset. This will predict functional association of 200 genes from 301 genomes. ```r exData <- get(data("ExampleStreptomycesData")) pw <- ProtWeaver(exData$Genes) # For faster runtime, set subset as follows: # predictions <- predict(pw, mySpeciesTree=exData$Tree, subset=1:20) predictions <- predict(pw, mySpeciesTree=exData$Tree) # View number of genes and number of predictions: # Print out the adjacency matrix: predictions # Print out pairwise associations: as.data.frame(predictions) # Plot genes using force-directed embedding: plot(predictions) ``` Current algorithms used for prediction are Jaccard distance, Hamming distance, Mutual Information, Direct Coupling Analysis, Phylogenetic Gain/Loss events, Co-localization, MirrorTree, and ContextTree. See `?ProtWeaver` for more details. ### Planned Updates SynExtend currently has an extensive TODO list that includes: 1. Parsing communities of predicted pairs. 2. User friendly accessor functions to help with printing and plotting data. If users have specific requests I would love to incorporate them. ## Reporting Bugs and Errors Feel free to email me at npc19@pitt.edu or submit issues here on GitHub. Questions related to `ProtWeaver` can be sent to ahl27@pitt.edu or submitted similarly on GitHub.