title: "Tree Data Import"
author: "\\

	Guangchuang Yu (<guangchuangyu@gmail.com>) and Tommy Tsan-Yuk Lam (<ttylam@hku.hk>)\\

        School of Public Health, The University of Hong Kong"
date: "`r Sys.Date()`"
bibliography: ggtree.bib
csl: nature.csl
    toc: true
    toc: true
vignette: >
  %\VignetteIndexEntry{01 Tree Data Import}

```{r style, echo=FALSE, results="asis", message=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
		   message = FALSE)

```{r echo=FALSE, results="hide", message=FALSE}

The `ggtree` package should not be viewed solely as a standalone software. While it is useful for viewing, annotating and manipulating phylogenetic trees, it is also an infrastructure that enables evolutionary evidences that inferred by commonly used software packages in the field to be used in `R`. For instance, *dN/dS* values or ancestral sequences inferred by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html)[@yang_paml_2007], clade support values (posterior) inferred by [BEAST](http://beast2.org/)[@bouckaert_beast_2014] and short read placement by [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)[@berger_EPA_2011] and [pplacer](http://matsen.fhcrc.org/pplacer/)[@matsen_pplacer_2010]. These evolutionary evidences are not only used in annotating phylogenetic tree in  `ggtree` but can also be further analyzed in `R`.

# Supported File Formats

Most of the tree viewer software (including `R` packages) focus on `Newick` and `Nexus` file formats, while there are file formats from different evolution analysis software that contain supporting evidences within the file that are ready for annotating a phylogenetic tree.
The `ggtree` package define several parser functions and `S4` classes to store statistical evidences inferred by commonly used software packages. It supports several file formats, including:

+ Newick (via `ape`)
+ Nexus (via `ape`)
+ New Hampshire eXtended format (NHX)
+ Jplace
+ Phylip

and software output from:

+ [BEAST](http://beast2.org/)[@bouckaert_beast_2014] 
+ [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)[@berger_EPA_2011]
+ [HYPHY](http://hyphy.org/w/index.php/Main_Page)[@pond_hyphy_2005]
+ [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html)[@yang_paml_2007]
+ [PHYLDOG](http://pbil.univ-lyon1.fr/software/phyldog/)[@boussau_genome-scale_2013]
+ [pplacer](http://matsen.fhcrc.org/pplacer/)[@matsen_pplacer_2010]
+ [r8s](http://loco.biosci.arizona.edu/r8s/)[@marazzi_locating_2012]
+ [RAxML](http://sco.h-its.org/exelixis/web/software/raxml/)[@stamatakis_raxml_2014]
+ [RevBayes](http://revbayes.github.io/intro.html)[@hohna_probabilistic_2014]

# Parser functions

The `ggtree` package implement several parser functions, including:

+ `read.beast` for parsing output of [BEASE](http://beast2.org/)
+ `read.codeml` for parsing output of [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) (`rst` and `mlc` files)
+ `read.codeml_mlc` for parsing `mlc` file (output of `CODEML`)
+ `read.hyphy` for parsing output of [HYPHY](http://hyphy.org/w/index.php/Main_Page)
+ `read.jplace` for parsing `jplace` file including output from [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) and [pplacer](http://matsen.fhcrc.org/pplacer/)
+ `read.nhx` for parsing `NHX` file including output from [PHYLODOG](http://pbil.univ-lyon1.fr/software/phyldog/) and [RevBayes](http://revbayes.github.io/intro.html)
+ `read.paml_rst` for parsing `rst` file (output of `BASEML` and `CODEML`)
+ `read.r8s` for parsing output of [r8s](loco.biosci.arizona.edu/r8s/)
+ `read.raxml` for parsing output of [RAxML](http://sco.h-its.org/exelixis/web/software/raxml/)

# S4 classes

Correspondingly, `ggtree` defines several `S4` classes to store evolutionary evidences inferred by these software packages, including:

+ _`apeBootstrap`_ for bootstrap analysis of `ape::boot.phylo()`[@paradis_ape_2004], output of `apeBoot()` defined in `ggtree`
+ _`beast`_ for storing output of `read.beast()`
+ _`codeml`_ for storing output of `read.codeml()`
+ _`codeml_mlc`_ for storing output of `read.codeml_mlc()`
+ _`hyphy`_ for storing output of `read.hyphy()`
+ _`jplace`_ for storing output of `read.jplace()`
+ _`nhx`_ for storing output of `read.nhx()`
+ _`paml_rst`_ for _`rst`_ file obtained by [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html), including _`BASEML`_ and _`CODEML`_.
+ _`phangorn`_ for storing ancestral sequences inferred by `R` package `phangorn`[@schliep_phangorn_2011], output of `phyPML()` defined in `ggtree`
+ _`r8s`_ for storing output of `read.r8s()`
+ _`raxml`_ for storing output of `read.raxml()`

The _`jplace`_ class is also designed to store user specified annotation data.

Here is an overview of these `S4` classes:


In addition, `ggtree` also supports _`phylo`_, _`multiPhylo`_ (defined by `ape`[@paradis_ape_2004]), _`phylo4`_, _`phylo4d`_ (defined by `phylobase`) _`obkData`_ (defined in `OutbreakTools`) and _`phyloseq`_ (defined in `phyloseq`).

In `ggtree`, tree objects can be merged and evidences inferred from different phylogenetic analyses can be combined or compared and visualized.

Viewing a phylogenetic tree in `ggtree` is easy by using the command `ggtree(tree_object)` and annotating a phylogenetic tree is simple by adding graphic layers using the grammar of graphics.

For each class, we defined _`get.fields`_ method to get the annotation features that available in the object that can be used to annotate a phylogenetic tree directly in `ggtree`. A _`get.tree`_ method can be used to convert tree object to `phylo` (or `multiPhylo` for `r8s`) object that are widely supported by other `R` packages. 

The _`groupOTU`_ method is used for clustering related OTUs (from tips to their most recent common ancestor). Related OTUs are not necessarily within a clade, they can be distantly related. _`groupOTU`_ works fine for monophyletic (clade), polyphyletic and paraphyletic, while _`groupClade`_ only works for clade (monophyletic). These methods are useful for clustering related OTUs or clades.

The _`fortify`_ method is used to convert tree object to a `data.frame` which is familiar by `R` users and easy to manipulate. The output `data.frame` contains tree information and all evolutionary evidences (if available, e.g. _*dN/dS*_ in `codeml` object).

Detail descriptions of `slots` defined in each class are documented in class man pages. Users can use `class?className` (e.g. `class?beast`) to access man page of a class.

# Getting Tree Data into R

## Parsing BEAST output

file <- system.file("extdata/BEAST", "beast_mcc.tree", package="ggtree")
beast <- read.beast(file)

Since _`%`_ is not a valid character in _`names`_, all the feature names that contain _`x%`_ will convert to _`0.x`_. For example, _`length_95%_HPD`_ will be changed to _`length_0.95_HPD`_.

The _`get.fields`_ method return all available features that can be used for annotation.

Users can use `ggtree(beast)` to visualize the tree and add layer to annotate it.

```{r warning=FALSE, fig.width=10, fig.height=10}
ggtree(beast, ndigits=2, branch.length = 'none') + geom_text(aes(x=branch, label=length_0.95_HPD), vjust=-.5, color='firebrick')

`ggtree` provides `geom_range` layer to display uncertainty of branch length.

```{r warning=FALSE, fig.width=10, fig.height=10}
ggtree(beast) + geom_range(range='length_0.95_HPD', color='red', alpha=.6, size=2)

In `FigTree`, only `heigh_0.95_HPD` is meaningful since the branch is scaled by `height`. In `ggtree` we can display HPD of `rate`, `height` or other variable if available since `ggtree` can rescale a tree using `rescale_tree` function or by specifing `branch.length` in `ggtree` function.

```{r warning=FALSE, fig.width=10, fig.height=10}
ggtree(beast, branch.length = 'rate') + geom_range(range='rate_0.95_HPD', color='red', alpha=.6, size=2)

With `ggtree`, evolutionary evidences inferred by commonly used software packages (`BEAST` in this example) can be easily transformed to a tidy `data.frame` by `fortify` method.

beast_data <- fortify(beast)

## Parsing PAML output

### _`rst`_ file

The _`read.paml_rst`_ function can parse `rst` file from `BASEML` and `CODEML`. The only difference is the space in the sequences. For `BASEML`, each ten bases are separated by one space, while for `CODEML`, each three bases (triplet) are separated by one space.

```{r fig.width=12, fig.height=10, warning=FALSE, fig.align="center"}
brstfile <- system.file("extdata/PAML_Baseml", "rst", package="ggtree")
brst <- read.paml_rst(brstfile)
p <- ggtree(brst) +
    geom_text(aes(x=branch, label=marginal_AA_subs), vjust=-.5, color='steelblue') +

Similarly, we can parse the `rst` file from `CODEML` and visualize the tree with amino acid substitution.
crstfile <- system.file("extdata/PAML_Codeml", "rst", package="ggtree")
crst <- read.paml_rst(crstfile)

`ggtree` defines the `update` operator, `%<%`, that can update a tree view (applying the same pattern of visualization and annotation) with another tree object.

```{r fig.width=12, fig.height=10, warning=FALSE, fig.align="center"}
p %<% crst

We can find that these two figures have different evolution distances and subsitutions inferred by `BASEML` and `CODEML` are slightly different.


#### mlc file

The `mlc` file output by `CODEML` contains _`dN/dS`_ estimation.

mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="ggtree")
mlc <- read.codeml_mlc(mlcfile)

Please aware that _`/`_ and _`*`_ are not valid characters in _`names`_, they were changed to _`_vs_`_ and _`_x_`_ respectively.

So _`dN_vs_dS`_ is _`dN/dS`_, _`N_x_dN`_ is _`N*dN`_, and _`S_x_dS`_ is _`S*dS`_.

```{r fig.width=8, fig.height=8, warning=FALSE, fig.align="center"}
ggtree(mlc) + geom_text(aes(x=branch, label=dN_vs_dS), color='blue', vjust=-.2)

The paramter _`branch.length`_ can be one of available annotations:

For example, if we set _`branch.length`_ to _`dN_vs_dS`_, the tree view will be re-scaled based on $\omega$ (_`dN/dS`_).

```{r fig.width=8, width=60, warning=FALSE, fig.align="center"}
ggtree(mlc, branch.length = "dN_vs_dS", aes(color=dN_vs_dS)) +
    scale_color_continuous(name='dN/dS', limits=c(0, 1.5),
                           oob=scales::squish, low="darkgreen", high="red")+
    theme_tree2(legend.position=c(.9, .5))

We can also plot the _`dN`_ or _`dS`_ tree and others. More examples (including evidences inferred by BEAST) can be found in the [Tree Annotation](treeAnnotation.html) vignette.  

#### _`CODEML`_ output: rst and mlc files\

We annotate the tree with information presented in _`rst`_ and _`mlc`_ file separately as demonstrated in previous sessions.

We can also use both of them which is highly recommended. 

ml <- read.codeml(crstfile, mlcfile)

All the features in both files are available for annotation. For example, we can annotate _`dN/dS`_ with the tree in _`rstfile`_ and amino acid substitutions with the tree in _`mlcfile`_.

## Parsing HYPHY output

```{r warning=FALSE}
nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="ggtree")
ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="ggtree")
tipfas <- system.file("extdata", "pa.fas", package="ggtree")
hy <- read.hyphy(nwk, ancseq, tipfas)

```{r fig.width=16, fig.height=10, warning=FALSE, fig.align="center"}
ggtree(hy) + geom_text(aes(x=branch, label=AA_subs), vjust=-.5)

## Parsing r8s output

[r8s](http://loco.biosci.arizona.edu/r8s/) output contains 3 trees, namely `TREE`, `RATO` and `PHYLO` for time tree, rate tree and absolute substitution tree respectively.

```{r fig.width=4, fig.height=6, width=60, warning=FALSE, fig.align="center"}
r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="ggtree"))
ggtree(r8s, branch.length="TREE", mrsd="2014-01-01") + theme_tree2()

`branch.length` should be one of `TREE`, `RATO` or `PHYLO` for selecting the corresponding tree.

User can also view 3 trees simultaneously.

```{r fig.width=16, fig.height=10, width=60, warning=FALSE, fig.align="center"}
ggtree(get.tree(r8s), aes(color=.id)) + facet_wrap(~.id, scales="free_x")

## Parsing output of RAxML bootstraping analysis

```{r fig.width=12, fig.height=10, width=60, warning=FALSE, fig.align="center"}
raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="ggtree")
raxml <- read.raxml(raxml_file)
ggtree(raxml) + geom_label(aes(label=bootstrap, fill=bootstrap)) + geom_tiplab() +
    scale_fill_continuous(low='darkgreen', high='red') + theme_tree2(legend.position='right')

## Parsing NHX tree

NHX (New Hampshire eXtended) format is commonly used in phylogenetics software (e.g. [PHYLDOG](http://pbil.univ-lyon1.fr/software/phyldog/)[@boussau_genome-scale_2013], [RevBayes](http://revbayes.github.io/intro.html)[@hohna_probabilistic_2014], *etc*.)

nhxfile <- system.file("extdata", "ADH.nhx", package="ggtree")
nhx <- read.nhx(nhxfile)
ggtree(nhx) + geom_tiplab() + geom_point(aes(color=S), size=5, alpha=.5) + 
    theme(legend.position="right") +
    geom_text(aes(label=branch.length, x=branch), vjust=-.5) + 
    xlim(NA, 0.3)

## Parsing Phylip tree

Phylip format contains taxa sequences with Newick tree text.

phyfile <- system.file("extdata", "sample.phy", package="ggtree")
phylip <- read.phylip(phyfile)
ggtree(phylip) + geom_tiplab()

User can view sequence alignment with the tree via `msaplot()` function.

```{r fig.width=10, fig.height=6}
msaplot(phylip, offset=1)

## Parsing EPA and pplacer output

[EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)[@berger_EPA_2011] and [PPLACER](http://matsen.fhcrc.org/pplacer/)[@matsen_pplacer_2010] have common output file format, `jplace`, which can be parsed by `read.jplace()` function. 

jpf <- system.file("extdata/sample.jplace",  package="ggtree")
jp <- read.jplace(jpf)

In `ggtree`, we provide _`get.placements`_ method to access the placement.

## get only best hit
get.placements(jp, by="best")
## get all placement
get.placements(jp, by="all")

This is only a tiny sample file. In reality, [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) and [PPLACER](http://matsen.fhcrc.org/pplacer/) may place thousands of short reads on a reference tree.

We may, for example, count the number of placement and annotate this information in the tree. 

# merging tree objects

In ggtree, tree objects can be merged and evidences inferred from different phylogenetic analyses can be combined or compared and visualized.

User can use the command like:
tree <- merge_tree(tree_object_1, tree_object_2)
## it's chainable, and serveral tree objects can be merged into one tree object
tree <- merge_tree(tree_object_1, tree_object_2) %>% merge_tree(tree_object_3) %>% merge_tree(tree_object_4)

Here we use a tree analyzed by BEAST and CodeML and merge them into one.

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)

rst_file <- system.file("examples/rst", package="ggtree")
mlc_file <- system.file("examples/mlc", package="ggtree")
codeml_tree <- read.codeml(rst_file, mlc_file)

merged_tree <- merge_tree(beast_tree, codeml_tree)


After merging, all evidences inferred from different tools can be used to annotate the tree simultaneously. In this example, we used `dN/dS` inferred by CodeML to color the tree and annotate the tree with `posterior` inferred by BEAST.

```{r fig.width=20, fig.height=26, warning=FALSE}
ggtree(merged_tree, aes(color=dN_vs_dS), mrsd="2013-01-01", ndigits = 3) +
    geom_text(aes(label=posterior), vjust=.1, hjust=-.1, size=5, color="black") + 
    scale_color_continuous(name='dN/dS', limits=c(0, 1.5),
                           oob=scales::squish, low="green", high="red")+

# `jplace` file format
The _`jplace`_ file format was defined by Masten[@matsen_format_2012] for phylogenetic placements. We employed this file format to store phylogenetic tree and user specified annotation data. Suppose we have a tree, and the associated data as shown below:

tree <- system.file("extdata", "pa.nwk", package="ggtree")
data <- read.csv(system.file("extdata", "pa_subs.csv", package="ggtree"), stringsAsFactor=FALSE)

The _`data`_ contains amino acid substitutions from parent node to child node and GC contents of each node. We can annotate the tree as demonstrated in [User specified annotation](treeAnnotation.html#user-specified-annotation) session of [Tree Annotation](treeAnnotation.html) vignette.

`ggtree` provides a function, _`write.jplace`_, to combine a tree and an associated data and store them to a single _`jplace`_ file.
outfile <- tempfile()
write.jplace(tree, data, outfile)

The _`read.jplace`_ function was designed to read the _`jplace`_ file and store the information to a _`jplace`_ object.
jp <- read.jplace(outfile)

Now we know the _`jp`_ object that stores the tree and the associated amino acid substitution and GC content information, we can view the tree and display the associated annotation data on it directly by _`ggtree`_.

```{r fig.width=12, fig.height=12, warning=FALSE, fig.align="center"}
ggtree(jp) + 
    geom_text(aes(x=branch, label=subs), color="purple", vjust=-1, size=3) + 
    geom_text(aes(label=gc), color="steelblue", hjust=-.6, size=3) +

# References