# Unifrac and Tree Roots

Unifrac distances have the attraction of including phylogenetic relatedness, based on a tree of the representative sequences, in the distances among samples calculated from an OTU table. FastTree is the usual method of choice in generating the tree, although USEARCH also provides a method. Both methods calculate unrooted trees, and calculation of Unifrac distances requires a rooted tree. The problem arises in how to best root the tree. I found a discussion of the problem on the` phyloseq` GitHub site (https://github.com/joey711/phyloseq/issues/597).

For the smaller data sets of old, I relied on the `midpoint` function in the `phangorn` package to root my trees, but have since discovered that it is prone to failure with the large data sets generated with newer sequencing platforms. `Phyloseq` assigns a root randomly, which is certainly fast, but is that the best practice? How much difference does it make between runs? Assigning a seed only enables you to get the same result again. I decided to investigate using a large data set containing 31,938 OTUs.

As a first step, I simply made side-by-side plots for two PCoA ordinations made based on weighted Unifrac distances. I used pre-computed distance matrices for the two ordinations.

````set.seed(123)` ```
dw.pd.1 <- phyloseq::distance(expt, method = "wunifrac")``` `## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root` ```
## as -- New.CleanUp.ReferenceOTU27288 -- in the phylogenetic tree in the data``` ```
## you provided.``` `set.seed(957)` ```
dw.pd.2 <- phyloseq::distance(expt, method = "wunifrac")``` ```
## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root``` ```
## as -- 61057 -- in the phylogenetic tree in the data you provided.``````

The warning messages show that different OTUs were chosen for each of the runs, and while we expect the ordinations to be similar, visual inspection reveals some differences.

``````ord.pf.1 <- ordinate(expt, method = "PCoA", distance = dw.pd.1)
plt.1 <- plot_ordination(expt, ord.pf.1, color = "P_Location", title = "Trial 1")
ord.pf.2 <- ordinate(expt, method = "PCoA", distance = dw.pd.2)
plt.2 <- plot_ordination(expt, ord.pf.2, color = "P_Location", title = "Trial 2")
ggarrange(plt.1, plt.2, ncol = 2, legend = "bottom", common.legend = TRUE`````` Fig. 1: Two PCoA ordinations of the same data based on weighted Unifrac distances. Only the OTUs chosen to toot the tree differed.

Next I made a Procrustes analysis comparing the two ordinations.

``````scores.1 <- ord.pf.1\$vectors[ , 1:2]
scores.2 <- ord.pf.2\$vectors[ , 1:2]
pro <- procrustes(scores.1, scores.2, symmetric = TRUE)
plot(pro)``````

Fig. 2: Plot of Procrustes errors between the two ordinations in Fig. 1.The plot shows relative placement of samples according to the two distance matrices, with the greatest differences in the lower left quadrant.

````protest(scores.1, scores.2, permutations = 99999)` ```##
## Call:  ## protest(X = dw.pd.1, Y = dw.pd.2, permutations = 99999, choices = c(1, 2))
##
## Procrustes Sum of Squares (m12 squared): 0.02374
## Correlation in a symmetric Procrustes rotation: 0.9881
## Significance: 1e-05
##
## Permutation: free
## Number of permutations: 99999``````

The correlation between the two distance matrices is very high, but not perfect. It is unlikely that such small differences would make a difference in analysis of community differences, by PERMANOVA, for example.

In the discussion mentioned above, Sebastian Schmidt proposed choosing an out group based on the longest tree branch terminating in a tip. His function for finding such an out group is:

``````pick_new_outgroup <- function(tree.unrooted){
require("magrittr")
require("data.table")
require("ape") # ape::Ntip
# tablify parts of tree that we need.
treeDT <-
``` ```    cbind(
``` ```        data.table(tree.unrooted\$edge),
``` ```        data.table(length = tree.unrooted\$edge.length)
``` ```    )[1:Ntip(tree.unrooted)] %>%
``` ```cbind(data.table(id = tree.unrooted\$tip.label))
``` ```# Take the longest terminal branch as outgroup
``` ```new.outgroup <- treeDT[which.max(length)]\$id
``` `return(new.outgroup)` ```}
``````

Applying the function to our data gives us the out group to use to root our tree:

``````my.tree <- phy_tree(expt)
out.group <- pick_new_outgroup(my.tree)
out.group``` `##  "New.CleanUp.ReferenceOTU8001"````

Then we can use this out group to root the tree in expt.

``````new.tree <- ape::root(my.tree, outgroup=out.group, resolve.root=TRUE)
phy_tree(expt) <- new.tree
phy_tree(expt)```
```##
## Phylogenetic tree with 31938 tips and 31937 internal nodes.
##
## Tip labels:
## 82502, New.CleanUp.ReferenceOTU38100, 70168, 100112, 113933, 13153, ...
## Node labels:
## Root, , 0.449, 0.000, 0.400, 0.000, ...
##
## Rooted; includes branch lengths.``````

Now that the tree is rooted, we can calculate the corresponding distance matrix and compare the ordination based on it to one we obtained with `phyloseq`‘s random root assignment method.

``````dw.og <- phyloseq::distance(expt, method = "wunifrac")
``````ord.og<- ordinate(expt, method = "PCoA", distance = dw.og)
``````plt.3 <- plot_ordination(expt, ord.pf.3, color = "P_Location", title = "By pick_new_outgroup")
``````plot.3 <- ggarrange(plt.1, plt.3, ncol = 2, legend = "bottom", common.legend = TRUE)
````plot.3```` Fig. 3: PCoA ordinations comparing result with randomly rooted tree (left) with tree rooted by longest terminal branch (right).

Again, visible inspection reveals some small differences between the ordinations. The overall difference between them can be determined with a Procrustes analysis.

``````scores.3 <- ord.og\$vectors[ , 1:2]
pro <- procrustes(scores.1, scores.3, symmetric = TRUE)
protest(scores.1, scores.3, permutations = 99999)``` ```##
## Call
## protest(X = scores.1, Y = scores.3, permutations = 99999)
## Procrustes Sum of Squares (m12 squared): 0.008256
## Correlation in a symmetric Procrustes rotation: 0.9959
## Significance: 1e-05
## Permutation: free
## Number of permutations: 99999
``` `plot(pro)````