Rooting Unifrac Trees in phyloseq Objects

The method of rooting trees described in the post “Unifrac and Tree Roots” is now included in QsRutils beginning with version 0.3.2 as function root_phyloseq_tree. Given a phyloseq object with an unrooted tree, it returns the same type of phyloseq object with the tree rooted by the longest terminal branch.

Unifrac and Tree Roots

 

Unifrac distances have the attraction of including phylogenetic relatedness, based on a tree of the representative sequences, in the distances among samples calculated from an OTU table. FastTree is the usual method of choice in generating the tree, although USEARCH also provides a method. Both methods calculate unrooted trees, and calculation of Unifrac distances requires a rooted tree. The problem arises in how to best root the tree. I found a discussion of the problem on the phyloseq GitHub site (https://github.com/joey711/phyloseq/issues/597).

For the smaller data sets of old, I relied on the midpoint function in the phangorn package to root my trees, but have since discovered that it is prone to failure with the large data sets generated with newer sequencing platforms. Phyloseq assigns a root randomly, which is certainly fast, but is that the best practice? How much difference does it make between runs? Assigning a seed only enables you to get the same result again. I decided to investigate using a large data set containing 31,938 OTUs.

As a first step, I simply made side-by-side plots for two PCoA ordinations made based on weighted Unifrac distances. I used pre-computed distance matrices for the two ordinations.

set.seed(123)
dw.pd.1 <- phyloseq::distance(expt, method = "wunifrac")
## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- New.CleanUp.ReferenceOTU27288 -- in the phylogenetic tree in the data
## you provided.
set.seed(957)
dw.pd.2 <- phyloseq::distance(expt, method = "wunifrac")

## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- 61057 -- in the phylogenetic tree in the data you provided.

The warning messages show that different OTUs were chosen for each of the runs, and while we expect the ordinations to be similar, visual inspection reveals some differences.

ord.pf.1 <- ordinate(expt, method = "PCoA", distance = dw.pd.1)
plt.1 <- plot_ordination(expt, ord.pf.1, color = "P_Location", title = "Trial 1")
ord.pf.2 <- ordinate(expt, method = "PCoA", distance = dw.pd.2)
plt.2 <- plot_ordination(expt, ord.pf.2, color = "P_Location", title = "Trial 2")
ggarrange(plt.1, plt.2, ncol = 2, legend = "bottom", common.legend = TRUE

Fig. 1: Two PCoA ordinations of the same data based on weighted Unifrac distances. Only the OTUs chosen to toot the tree differed.

Next I made a Procrustes analysis comparing the two ordinations.

scores.1 <- ord.pf.1$vectors[ , 1:2]
scores.2 <- ord.pf.2$vectors[ , 1:2]
pro <- procrustes(scores.1, scores.2, symmetric = TRUE)
plot(pro)

Fig. 2: Plot of Procrustes errors between the two ordinations in Fig. 1.

The plot shows relative placement of samples according to the two distance matrices, with the greatest differences in the lower left quadrant.

protest(scores.1, scores.2, permutations = 99999)
##
## Call:  ## protest(X = dw.pd.1, Y = dw.pd.2, permutations = 99999, choices = c(1, 2)) 
## 
## Procrustes Sum of Squares (m12 squared): 0.02374 
## Correlation in a symmetric Procrustes rotation: 0.9881 
## Significance: 1e-05 
## 
## Permutation: free 
## Number of permutations: 99999

The correlation between the two distance matrices is very high, but not perfect. It is unlikely that such small differences would make a difference in analysis of community differences, by PERMANOVA, for example.

In the discussion mentioned above, Sebastian Schmidt proposed choosing an out group based on the longest tree branch terminating in a tip. His function for finding such an out group is:

pick_new_outgroup <- function(tree.unrooted){
require("magrittr")
require("data.table")
require("ape") # ape::Ntip
# tablify parts of tree that we need.
treeDT <- 
     cbind(
         data.table(tree.unrooted$edge),
         data.table(length = tree.unrooted$edge.length)
     )[1:Ntip(tree.unrooted)] %>% 
 cbind(data.table(id = tree.unrooted$tip.label))
 # Take the longest terminal branch as outgroup
 new.outgroup <- treeDT[which.max(length)]$id
 return(new.outgroup) }

Applying the function to our data gives us the out group to use to root our tree:

my.tree <- phy_tree(expt)
out.group <- pick_new_outgroup(my.tree)
out.group
## [1] "New.CleanUp.ReferenceOTU8001"

Then we can use this out group to root the tree in expt.

new.tree <- ape::root(my.tree, outgroup=out.group, resolve.root=TRUE)
phy_tree(expt) <- new.tree
phy_tree(expt)
## 
## Phylogenetic tree with 31938 tips and 31937 internal nodes.
## 
## Tip labels:
## 82502, New.CleanUp.ReferenceOTU38100, 70168, 100112, 113933, 13153, ...
## Node labels:
## Root, , 0.449, 0.000, 0.400, 0.000, ...
## 
## Rooted; includes branch lengths.

Now that the tree is rooted, we can calculate the corresponding distance matrix and compare the ordination based on it to one we obtained with phyloseq's random root assignment method.

dw.og <- phyloseq::distance(expt, method = "wunifrac")
ord.og<- ordinate(expt, method = "PCoA", distance = dw.og)
plt.3 <- plot_ordination(expt, ord.pf.3, color = "P_Location", title = "By pick_new_outgroup")
plot.3 <- ggarrange(plt.1, plt.3, ncol = 2, legend = "bottom", common.legend = TRUE)
plot.3

Fig. 3: PCoA ordinations comparing result with randomly rooted tree (left) with tree rooted by longest terminal branch (right).

Again, visible inspection reveals some small differences between the ordinations. The overall difference between them can be determined with a Procrustes analysis.

scores.3 <- ord.og$vectors[ , 1:2]
pro <- procrustes(scores.1, scores.3, symmetric = TRUE)
protest(scores.1, scores.3, permutations = 99999)

 

## 
## Call
## protest(X = scores.1, Y = scores.3, permutations = 99999)
## Procrustes Sum of Squares (m12 squared): 0.008256
## Correlation in a symmetric Procrustes rotation: 0.9959
## Significance: 1e-05
## Permutation: free
## Number of permutations: 99999
plot(pro)

Fig. 4: Plot of Procrustes errors between ordinations in Fig. 3.

The difference between these two ordinations is certainly no greater than between those obtained with multiple runs taking random root assignments.

Conclusion: I was somewhat surprised that differences between multiple Unifrac calculations with random root assignment could be detected. I had surmised that distances among tree tips would stay the same, and therefore Unifrac distances among samples would stay the same. Perhaps they do, within rounding error.  The differences are certainly too small to impact discrimination of treatments in this data set. Still, I prefer the method based on rooting the tree by the longest branch terminating in a tip because it is not random, and by the function above is easily and quickly done.

Update R in Ubuntu

If you installed R from the Ubuntu repository with

sudo apt-get install r-base

you most likely got an out of date version. In February 2018, that method still gave me R version 3.2.3 (2015-12-10). To get the latest versions of R and its packages, you need to add CRAN to the apt-get repositories. Do this with the code below. Enter one line at a time. Cut and paste to prevent errors.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
sudo apt-get update

Now if you open Synaptic Package Manager and search for r-base, you should see the installed versions and the latest versions of several packages (e.g.  r-base, r-base-core, r-base-core-dbg, r-base-dev, r-base.html). Mark the packages for upgrade and click on apply. This will most likely result in a major update of R so that the next time you run R only the base packages will be available. You will have to re-install any additional R packages that you use to match the updated version of R.

Adapted from https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2

Rotatable 3D Plots

The R package vegan includes the function ordiplot for making ordination plots using R’s base graphics. Additionally vegan provides several functions for enhancing the plots with spiders, hulls, and ellipses. It is even possible to overlay an ordination plot with a cluster diagram. (See my package ggordiplots on GitHub for making the same kinds of plots with ggplot graphics.) Vegan’s functions for adding these layers begin with “ordi”: ordispider, ordihull, ordiellipse, ordicluster. Earlier this year I discovered the package vegan3d. It makes use of rgl graphics which means that the plots it generates can be scaled and rotated with the mouse. This is not just fun – it allows you to see how well separated treatment groups are in the ordination space.

Vegan3d’s function ordirgl is similar to ordiplot – it draws the ordination plot depicting samples and/or species. Other functions beginning with “orgl” are similar to the ones beginning with “ordi” in vegan: orglpoints, orglspider, orglellipse, orglcluster. See the vegan3d documentation for full instructions on how to use these.

I made an interactive html page of a slightly enhanced version of the example given in the vegan3d documentation, viewable if you CLICK HERE. Some explanation of the last code block of the Rmarkdown file reproduced here is necessary.

open3d() # Opens a graphic device for the plot.
## wgl 
##   8
par3d(windowRect = c(20, 40, 1000, 1200)) # left, top, right, bottom in pixels
my.color <- c("red", "blue")
ordirgl(ord, size=4, col = my.color[as.numeric(sam$Type)]) # plot points
with(sam, orglspider(ord, Type, col = my.color[1:2], scaling = "sites")) # add spiders
with(sam, orglellipse(ord, Type, col = my.color[1:2], kind = "se", conf = 0.95, scaling = "sites")) # add ellipses
rglwidget() # You cannot run this line in RStudio.

 

The open3d command opens an rgl graphics window. If you are using RStudio, this window will open outside of RStudio. The next 5 lines build the plot in the graphics window. At any step after the line beginning with ordirgl you can zoom and rotate the plot in the graphics window. The last line, rglwidget(), is necessary to knit the html file. If you try to run it in RStudio, it will cause RStudio to freeze. If you try to use the RStudio menu to knit the html document, RStudio will freeze. So DO NOT do either of these things!

To knit the Rmarkdown file into an html file, you must enter the command in the console window. In this case the command is:

rmarkdown::render(input = "vegan3d_example.Rmd", output_format = "html_document", output_file = "vegan3d_example.html")