Import DADA2 ASV Tables into phyloseq

ASV Tables Created in R

ASV tables created using the Bioconductor/R version of DADA2 are matrix files with samples as rows and taxa as columns. The taxa names are the sequences themselves. Because these matrices can be quite large they are most conveniently saved as compressed rds files. Read these files into R and create an experiment level phyloseq object containing an OTU or ASV table and representative sequences with the following R script:

# Load libraries:
library(phyloseq)
library(Biostrings)
library(RDPutils

# Read in the ASV file:
otu <- readRDS("seqtab_collapsed_nochim.rds")

# Get the representative sequences:
rep.seqs <- colnames(otu)
rep.seqs <- Biostrings::DNAStringSet(rep.seqs)

# Generate taxa names and assign them to the representative
# sequences and the ASV table taxa names (i.e. column names):
otu.names <- RDPutils::make_otu_names(1:length(rep.seqs))
colnames(otu) <- otu.names
names(rep.seqs) <- otu.names

# Import into phyloseq:
otu.table <- otu_table(otu, taxa_are_rows = FALSE)
expt <- phyloseq(otu.table, rep.seqs)
expt

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 19891 taxa and 75 samples ]
refseq()      DNAStringSet:      [ 19891 reference sequences ]

The representative sequences can then be exported to a fasta file, classified by your favorite method, treed if appropriate, and the results read into R and combined with the phyloseq object. Export the representative sequences with the R code:

Biostrings::writeXStringSet(rep.seqs,  file = "rep_seqs.fasta", format = "fasta")

ASV Tables Created in QIIME2

QIIME saves its objects termed “artifacts” as qza files. These are actually zip files containing some extra information about the object. It is possible to extract the OTU (or ASV) table by simply unzipping the table object, or you can use QIIME2 commands to export a text version of the object.  If you use the dada2 plug-in, the taxa names for the ASV table are hashes that encode the sequences, rather than the sequences themselves. Therefore if you want to include representative sequences in your phyloseq object, you will have to extract or export them separately. Here are the QIIME2 commands  I use to put the required files in the sub-directory phyloseq:

# Export OTU table:
mkdir phyloseq
qiime tools export \
--input-path table.qza \
--output-path phyloseq

# Convert biom format to tab-separated text format:
biom convert \
-i phyloseq/feature-table.biom \
-o phyloseq/otu_table.tsv \
--to-tsv

# Modify otu_table.txt to make it easier to read into R.
# Use sed to delete the first line and "#OTU ID" from the
# second line.
cd phyloseq
sed -i '1d' otu_table.tsv
sed -i 's/#OTU ID//' otu_table.tsv
cd ../

# Export representative sequences:
qiime tools export \
--input-path rep-seqs.qza \
--output-path phyloseq

The text files are then readily read into R and combined into a phyloseq object. Just remember that in this case taxa are rows of the OTU table.:

# Load libraries:
library(phyloseq)
library(Biostrings)
otu <- read.table("otu_table.tsv", row.names = 1, header = TRUE, sep = "\t")
otu.table <- phyloseq::otu_table(otu, taxa_are_rows = TRUE)
rep.seqs <- Biostrings::readDNAStringSet("dna-sequences.fasta", format = "fasta")
expt <- phyloseq::phyloseq(otu.table, rep.seqs)
expt

phyloseq-class experiment-level object
otu_table() OTU Table: [ 19887 taxa and 75 samples ]
refseq() DNAStringSet: [ 19887 reference sequences ]

 

Compact Letter Displays

Compact letter displays (CBDs) are letters that show which treatment groups are not significantly different by some statistical test. It is often desirable to include CBDs on graphs. Here I show how to add them to a box plot created with ggplot2.

First, make an example plot using the iris data:

library(ggplot2)
library(car)
library(QsRutils)
data("iris")
plt1 <- ggplot(data = iris, aes(x=Species, y=Petal.Length)) +     geom_boxplot()
plt1

I want to add CBDs just above the tops of the box whiskers or highest outlier in the plot. I can use the base boxplot function to capture the coordinates for these points.

box.rslt <- with(iris, graphics::boxplot(Petal.Length ~ Species, plot = FALSE))
str(box.rslt)
List of 6
 $ stats: num [1:5, 1:3] 1.1 1.4 1.5 1.6 1.9 3.3 4 4.35 4.6 5.1 ...
 $ n    : num [1:3] 50 50 50
 $ conf : num [1:2, 1:3] 1.46 1.54 4.22 4.48 5.37 ...
 $ out  : num [1:2] 1 3
 $ group: num [1:2] 1 2
 $ names: chr [1:3] "setosa" "versicolor" "virginica"

The fifth row of box.rslt$stats gives the y coordinates for the tops of the whiskers or the largest outlier if present.

box.rslt$stats
[,1] [,2] [,3]
[1,] 1.1 3.30 4.50
[2,] 1.4 4.00 5.10
[3,] 1.5 4.35 5.55
[4,] 1.6 4.60 5.90
[5,] 1.9 5.10 6.90

Next I have to get a vector of the letters to add to the plot. For this example, I will make a pairwise t-test which outputs a matrix of p-values.

ptt.rslt <- with(iris, pairwise.t.test(Petal.Length, Species, pool.sd = FALSE))

Looking at the structure of ptt.rslt, we see that ptt.rslt$p.value gives a matrix of p.values:

str(ptt.rslt)

List of 4
$ method : chr "t tests with non-pooled SD"
$ data.name : chr "Petal.Length and Species"
$ p.value : num [1:2, 1:2] 1.99e-45 2.78e-49 NA 4.90e-22
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "versicolor" "virginica"
.. ..$ : chr [1:2] "setosa" "versicolor"
$ p.adjust.method: chr "holm"
- attr(*, "class")= chr "pairwise.htest"

ptt.rslt$p.value

           setosa      versicolor
versicolor 1.986887e-45 NA
virginica 2.780888e-49 4.900288e-22

From this matrix we can use QsRutils::make_letter_assignments to get a vector of letters for our CBDs.

ltrs <- make_letter_assignments(ptt.rslt)
str(ltrs)

List of 3
 $ Letters          : Named chr [1:3] "a" "b" "c"
  ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
 $ monospacedLetters: Named chr [1:3] "a  " " b " "  c"
  ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
 $ LetterMatrix     : logi [1:3, 1:3] TRUE FALSE FALSE FALSE TRUE FALSE ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
  .. ..$ : chr [1:3] "a" "b" "c"
 - attr(*, "class")= chr "multcompLetters"

ltrs$Letters
    setosa versicolor  virginica 
       "a"        "b"        "c"

ltrs$Letters gives the vector of letters that we want. Now we can make a data frame to add the CBDs to the box plot.

x <- c(1:length(ltrs$Letters))
y <- box.rslt$stats[5, ]
cbd <- ltrs$Letters
ltr_df <- data.frame(x, y, cbd)
ltr_df
           x   y cbd
setosa     1 1.9   a
versicolor 2 5.1   b
virginica  3 6.9   c

If we plot the CBDs at the coordinates in ltr_df, they will over plot the tops of the whiskers or the highest outlier if present. We need to nudge the CBDs upward to avoid the overlap. To determine how much to nudge, I will get the range of the Y-axis and nudge upward 5% of this range.

lmts <- get_plot_limits(plt1)
y.range <- lmts$ymax - lmts$ymin
y.nudge <- 0.05 * y.range
plt1 + 
    geom_text(data = ltr_df, aes(x=x, y=y, label=cbd), nudge_y = y.nudge)

The CBDs are perfectly positioned, and without any trial and error.

Get ggplot plot panel limits

I have added a new function (get_plot_limits) to my package QsRutils.  It extracts the minimum and maximum X and Y values for a ggplot panel. This is useful in formatting ggplots. For example, you may wish to expand the panel to avoid text running out of the panel, or nudge text relative to some point. For an example, see my post on adding compact letter displays to box plots created with ggplot2.

Rooting Unifrac Trees in phyloseq Objects

The method of rooting trees described in the post “Unifrac and Tree Roots” is now included in QsRutils beginning with version 0.3.2 as function root_phyloseq_tree. Given a phyloseq object with an unrooted tree, it returns the same type of phyloseq object with the tree rooted by the longest terminal branch.

Unifrac and Tree Roots

 

Unifrac distances have the attraction of including phylogenetic relatedness, based on a tree of the representative sequences, in the distances among samples calculated from an OTU table. FastTree is the usual method of choice in generating the tree, although USEARCH also provides a method. Both methods calculate unrooted trees, and calculation of Unifrac distances requires a rooted tree. The problem arises in how to best root the tree. I found a discussion of the problem on the phyloseq GitHub site (https://github.com/joey711/phyloseq/issues/597).

For the smaller data sets of old, I relied on the midpoint function in the phangorn package to root my trees, but have since discovered that it is prone to failure with the large data sets generated with newer sequencing platforms. Phyloseq assigns a root randomly, which is certainly fast, but is that the best practice? How much difference does it make between runs? Assigning a seed only enables you to get the same result again. I decided to investigate using a large data set containing 31,938 OTUs.

As a first step, I simply made side-by-side plots for two PCoA ordinations made based on weighted Unifrac distances. I used pre-computed distance matrices for the two ordinations.

set.seed(123)
dw.pd.1 <- phyloseq::distance(expt, method = "wunifrac")
## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- New.CleanUp.ReferenceOTU27288 -- in the phylogenetic tree in the data
## you provided.
set.seed(957)
dw.pd.2 <- phyloseq::distance(expt, method = "wunifrac")

## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root
## as -- 61057 -- in the phylogenetic tree in the data you provided.

The warning messages show that different OTUs were chosen for each of the runs, and while we expect the ordinations to be similar, visual inspection reveals some differences.

ord.pf.1 <- ordinate(expt, method = "PCoA", distance = dw.pd.1)
plt.1 <- plot_ordination(expt, ord.pf.1, color = "P_Location", title = "Trial 1")
ord.pf.2 <- ordinate(expt, method = "PCoA", distance = dw.pd.2)
plt.2 <- plot_ordination(expt, ord.pf.2, color = "P_Location", title = "Trial 2")
ggarrange(plt.1, plt.2, ncol = 2, legend = "bottom", common.legend = TRUE

Fig. 1: Two PCoA ordinations of the same data based on weighted Unifrac distances. Only the OTUs chosen to toot the tree differed.

Next I made a Procrustes analysis comparing the two ordinations.

scores.1 <- ord.pf.1$vectors[ , 1:2]
scores.2 <- ord.pf.2$vectors[ , 1:2]
pro <- procrustes(scores.1, scores.2, symmetric = TRUE)
plot(pro)

Fig. 2: Plot of Procrustes errors between the two ordinations in Fig. 1.

The plot shows relative placement of samples according to the two distance matrices, with the greatest differences in the lower left quadrant.

protest(scores.1, scores.2, permutations = 99999)
##
## Call:  ## protest(X = dw.pd.1, Y = dw.pd.2, permutations = 99999, choices = c(1, 2)) 
## 
## Procrustes Sum of Squares (m12 squared): 0.02374 
## Correlation in a symmetric Procrustes rotation: 0.9881 
## Significance: 1e-05 
## 
## Permutation: free 
## Number of permutations: 99999

The correlation between the two distance matrices is very high, but not perfect. It is unlikely that such small differences would make a difference in analysis of community differences, by PERMANOVA, for example.

In the discussion mentioned above, Sebastian Schmidt proposed choosing an out group based on the longest tree branch terminating in a tip. His function for finding such an out group is:

pick_new_outgroup <- function(tree.unrooted){
require("magrittr")
require("data.table")
require("ape") # ape::Ntip
# tablify parts of tree that we need.
treeDT <- 
     cbind(
         data.table(tree.unrooted$edge),
         data.table(length = tree.unrooted$edge.length)
     )[1:Ntip(tree.unrooted)] %>% 
 cbind(data.table(id = tree.unrooted$tip.label))
 # Take the longest terminal branch as outgroup
 new.outgroup <- treeDT[which.max(length)]$id
 return(new.outgroup) }

Applying the function to our data gives us the out group to use to root our tree:

my.tree <- phy_tree(expt)
out.group <- pick_new_outgroup(my.tree)
out.group
## [1] "New.CleanUp.ReferenceOTU8001"

Then we can use this out group to root the tree in expt.

new.tree <- ape::root(my.tree, outgroup=out.group, resolve.root=TRUE)
phy_tree(expt) <- new.tree
phy_tree(expt)
## 
## Phylogenetic tree with 31938 tips and 31937 internal nodes.
## 
## Tip labels:
## 82502, New.CleanUp.ReferenceOTU38100, 70168, 100112, 113933, 13153, ...
## Node labels:
## Root, , 0.449, 0.000, 0.400, 0.000, ...
## 
## Rooted; includes branch lengths.

Now that the tree is rooted, we can calculate the corresponding distance matrix and compare the ordination based on it to one we obtained with phyloseq's random root assignment method.

dw.og <- phyloseq::distance(expt, method = "wunifrac")
ord.og<- ordinate(expt, method = "PCoA", distance = dw.og)
plt.3 <- plot_ordination(expt, ord.pf.3, color = "P_Location", title = "By pick_new_outgroup")
plot.3 <- ggarrange(plt.1, plt.3, ncol = 2, legend = "bottom", common.legend = TRUE)
plot.3

Fig. 3: PCoA ordinations comparing result with randomly rooted tree (left) with tree rooted by longest terminal branch (right).

Again, visible inspection reveals some small differences between the ordinations. The overall difference between them can be determined with a Procrustes analysis.

scores.3 <- ord.og$vectors[ , 1:2]
pro <- procrustes(scores.1, scores.3, symmetric = TRUE)
protest(scores.1, scores.3, permutations = 99999)

 

## 
## Call
## protest(X = scores.1, Y = scores.3, permutations = 99999)
## Procrustes Sum of Squares (m12 squared): 0.008256
## Correlation in a symmetric Procrustes rotation: 0.9959
## Significance: 1e-05
## Permutation: free
## Number of permutations: 99999
plot(pro)

Fig. 4: Plot of Procrustes errors between ordinations in Fig. 3.

The difference between these two ordinations is certainly no greater than between those obtained with multiple runs taking random root assignments.

Conclusion: I was somewhat surprised that differences between multiple Unifrac calculations with random root assignment could be detected. I had surmised that distances among tree tips would stay the same, and therefore Unifrac distances among samples would stay the same. Perhaps they do, within rounding error.  The differences are certainly too small to impact discrimination of treatments in this data set. Still, I prefer the method based on rooting the tree by the longest branch terminating in a tip because it is not random, and by the function above is easily and quickly done.

Update R in Ubuntu

If you installed R from the Ubuntu repository with

sudo apt-get install r-base

you most likely got an out of date version. In February 2018, that method still gave me R version 3.2.3 (2015-12-10). To get the latest versions of R and its packages, you need to add CRAN to the apt-get repositories. Do this with the code below. Enter one line at a time. Cut and paste to prevent errors.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
sudo apt-get update

Now if you open Synaptic Package Manager and search for r-base, you should see the installed versions and the latest versions of several packages (e.g.  r-base, r-base-core, r-base-core-dbg, r-base-dev, r-base.html). Mark the packages for upgrade and click on apply. This will most likely result in a major update of R so that the next time you run R only the base packages will be available. You will have to re-install any additional R packages that you use to match the updated version of R.

Adapted from https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2

Rotatable 3D Plots

The R package vegan includes the function ordiplot for making ordination plots using R’s base graphics. Additionally vegan provides several functions for enhancing the plots with spiders, hulls, and ellipses. It is even possible to overlay an ordination plot with a cluster diagram. (See my package ggordiplots on GitHub for making the same kinds of plots with ggplot graphics.) Vegan’s functions for adding these layers begin with “ordi”: ordispider, ordihull, ordiellipse, ordicluster. Earlier this year I discovered the package vegan3d. It makes use of rgl graphics which means that the plots it generates can be scaled and rotated with the mouse. This is not just fun – it allows you to see how well separated treatment groups are in the ordination space.

Vegan3d’s function ordirgl is similar to ordiplot – it draws the ordination plot depicting samples and/or species. Other functions beginning with “orgl” are similar to the ones beginning with “ordi” in vegan: orglpoints, orglspider, orglellipse, orglcluster. See the vegan3d documentation for full instructions on how to use these.

I made an interactive html page of a slightly enhanced version of the example given in the vegan3d documentation, viewable if you CLICK HERE. Some explanation of the last code block of the Rmarkdown file reproduced here is necessary.

open3d() # Opens a graphic device for the plot.
## wgl 
##   8
par3d(windowRect = c(20, 40, 1000, 1200)) # left, top, right, bottom in pixels
my.color <- c("red", "blue")
ordirgl(ord, size=4, col = my.color[as.numeric(sam$Type)]) # plot points
with(sam, orglspider(ord, Type, col = my.color[1:2], scaling = "sites")) # add spiders
with(sam, orglellipse(ord, Type, col = my.color[1:2], kind = "se", conf = 0.95, scaling = "sites")) # add ellipses
rglwidget() # You cannot run this line in RStudio.

 

The open3d command opens an rgl graphics window. If you are using RStudio, this window will open outside of RStudio. The next 5 lines build the plot in the graphics window. At any step after the line beginning with ordirgl you can zoom and rotate the plot in the graphics window. The last line, rglwidget(), is necessary to knit the html file. If you try to run it in RStudio, it will cause RStudio to freeze. If you try to use the RStudio menu to knit the html document, RStudio will freeze. So DO NOT do either of these things!

To knit the Rmarkdown file into an html file, you must enter the command in the console window. In this case the command is:

rmarkdown::render(input = "vegan3d_example.Rmd", output_format = "html_document", output_file = "vegan3d_example.html")