Blog

Import QIIME2 Results into R

After processing amplicon sequencing data with QIIME2, I like to import QIIME2 results into R as an experiment level phyloseq object. Such phyloseq objects have slots for an OTU or ASV table, a taxonomy table, representative sequences, a tree file of the representative sequences and a sample data table. Thus, all of the components can be saved together in a single file. The QIIME2 scripts on this website include outputting QIIME2 artifacts into files that are easily imported into phyloseq. Below is an R script for importing all of the components into an experiment level phyloseq object.

If necessary, install the phyloseq and Biostrings packages from Bioconductor. The speedytax package is installed from CRAN.

library(phyloseq)
library(Biostrings)
library(speedytax)

# Read in the otu/asv table
asv <- read.table("otu_table.tsv", header = TRUE, row.names = 1, sep = "\t")
# Convert the otu/asv table into a phyloseq object.
asv <- phyloseq::otu_table(asv, taxa_are_rows = TRUE)
# Read in the sample data table; row names must match otu/asv table sample names.
sam <- read.table(file = "sample_data.tsv", header = TRUE, row.names = 1, sep = "\t")
# Convert the sample data into a phyloseq object
sam <- phyloseq::sample_data(sam)
# Read in the fasta file of representative sequences.
rep_seqs <- Biostrings::readDNAStringSet("dna-sequences.fasta", format = "fasta")
# Read in the classification table. In this example the representative sequences
# were classified with the RDP Classifier. speedytax includes other functions for
# importing QIIME2 and USEARCH sintax classification files.
rdp_taxa <- speedytax::import_rdp_tax_table("rdp_214_classified.txt", confidence = 0.5)
# Read in the tree file.
my_tree <- phyloseq::read_tree("rooted_tree.nwk")
# Package all into a single experiment level phyloseq object.
expt <- phyloseq::phyloseq(asv, sam, rep_seqs, rdp_taxa, my_tree)
expt

An often performed  subsequent step is to filter the results taxonomically, for example  removing taxa/representative sequences that are not classified or that may be chloroplast sequences. In the example above, the representative sequences were classified with the RDP Classifier, in which case I filter the results with the following R script.

# Copy expt to expt_filt
expt_filt <- expt
# Filter taxonomically.
# Get the rank names.
rank_names(expt_filt)
# Check classification at the domain level.
get_taxa_unique(expt_filt, taxonomic.rank = "Domain")
# Keep only taxa/sequences classified as Bacteria or Archaea.
expt_filt <- subset_taxa(expt_filt, Domain %in% c("Bacteria", "Archaea"))
# Re-check classification at the Domain level.
get_taxa_unique(expt_filt, taxonomic.rank = "Domain")
# Check classification at the phylum level
get_taxa_unique(expt_filt, taxonomic.rank = "Phylum")
# Remove taxa/sequences not classified to the phylum level
expt_filt <- subset_taxa(expt_filt, Phylum != "uncl_Bacteria")
expt_filt <- subset_taxa(expt_filt, Phylum != "uncl_Archaea")
# Re-check classification at the phylum level
get_taxa_unique(expt_filt, taxonomic.rank = "Phylum")
# Subset to taxa/sequences classified to Cyanobacteria.
cyan <- subset_taxa(expt_filt, Phylum == "Cyanobacteriota")
# Check classification of Cyanobacteriota at the class level
get_taxa_unique(cyan, taxonomic.rank = "Class")
# Remove taxa/sequences classified to chloroplasts 
# or potentially to chloroplasts.
expt_filt <- subset_taxa(expt_filt, Class != "Chloroplast")
expt_filt <- subset_taxa(expt_filt, Class != "uncl_Cyanobacteriota")
# Check tht potential chloroplast taxa/sequences have been removed.
cyan <- subset_taxa(expt_filt, Phylum == "Cyanobacteriota")
get_taxa_unique(cyan, taxonomic.rank = "Class")
# Print summary of the experiment level phyloseq object
expt_filt
# Save the taxonomically filtered experiment level object.
saveRDS(expt_filt, file = "expt_filt.rds")

A similar procedure can be applied when another classifier is used to classify the representative sequences. For example, potential chloroplast sequences will not be found if chloroplast sequences are not part of the database used to build the classifier.

USEARCH 64-bit is now FREE!

In June 2024 Robert Edgar donated binaries for versions 5 through 11 of USEARCH to the public domain. You can now download them for free, including the 64-bit versions that used to cost $1,500, from this GitHub repository. At the same time USEARCH version 12 was launched as an open-source project at https://github.com/rcedgar/usearch12. See this paper announcing the open-source project.

Reference

Zhou Y, Liu YX, Li X. 2024. USEARCH 12: Open-source software for sequencing analysis in bioinformatics and microbiome. Imeta 3:e236.

RDP Website Discontinued

After more than 30 years in one form or another, the RDP website is no more. The server has died, the code is old, funding is exhausted, staff have retired or moved on. Developed in the pyrosequencing era, the pipeline tools are overwhelmed by today’s larger Illumina data sets.

The website is survived by the stand-alone version of the RDP Classifier available on Sourceforge and the command line version of RDP Tools. The later can be installed following the instructions on this website and here. Knowing there are still users of FunGene, we were hoping to find someone to host it, but so far with no takers. We have a nearly functional Amazon machine image of FunGene that could be used to build new gene models for Xander. If you are interested in using it or taking over FunGene, please leave a comment or email me (see the contact page).

SeqKit

SeqKit is an excellent program with 32 sub-commands for manipulating fasta and fastq files. Its abilities include converting fastq to fasta format, extracting amplicon regions, dereplication, filtering by length, removing gaps and reverse complementing sequences, to name just a few. I find it particularly useful in sampling large sequence files for the purpose of creating smaller datasets to be used in developing processing pipelines. It can even rematch forward and reverse reads should they somehow become unmatched in your workflow.

SeqKit is most easily installed using one of the package managers conda, mamba or micromamba. For example:

micromamba create --name seqkit -c bioconda seqkit

After installation and activation, get a list of available commands with:

seqkit --help

and help on any of the individual commands with:

seqkit <command> --help

For example,

seqkit seq --help

RDP Classifier Updated (Again!)

In August 2023 the RDP Classifier was updated to version 2.14 using bacterial and archaeal training set No. 19. This training set includes 600 new genera and 2,500 new species over the previous version released in 2020, as well as numerous taxonomic rearrangements and adoption of new phyla names as proposed in Oren and Garrity (2021).  For further details, see the release notes.

You may download the updated classifier from Sourceforge. In addition, there are species rank and QIIME2 formatted versions of the training data.

The web version of the RDP Classifier is no longer available.

Reference

Oren, A. and G.M. Garrity. 2021. Valid publication of the names of forty-two phyla of prokaryotes. Int J Syst Evol Microbiol. 2021 Oct;71(10). doi: 10.1099/ijsem.0.005056. PMID: 34694987.

Scanning FastQC Results for Many Files

The first step in working with sequencing data should always be evaluating its quality. This can be done with the FastQC program which generates html formatted reports you can read in your browser, but if you have hundreds of paired-read files, do you really want to examine them all one by one? I know I don’t, and so up until now I have been examining a subset, say 10 or 12 of the files. But I recently discovered an R package capable of summarizing the results, and using tidyverse functions you can filter the summary to show only samples with potential problems.

The name of the package is, unsurprisingly perhaps, fastqcr and it can be installed from the CRAN repository using the “Packages” tab in RStudio or  by entering the following command in the R terminal:

install.packages("fastqcr")

See the GitHub page https://github.com/kassambara/fastqcr for full instructions on the use of the package.

Not all of the reports are pertinent for processing 16S rRNA gene amplicon data, so I have been using something like the following to examine such data:

library(fastqcr)
library(dplyr)
rpt <- qc_aggregate(qc.dir = "fastqc/")
rpt |> print(n=11)

Where my FastQC results are in the directory fastqc/ under my working directory. This produces something like:

From this we can see the names of the individual reports (modules) per file and that the sequences should all be 301 bp in length. With this information we can filter rpt to show which samples have potential problems regarding overall sequence quality and the presence of uncalled bases or adapter sequences.

rpt |>
  dplyr::filter(module %in% c('Per base sequence quality',
                              'Per base N content',
                              'Adapter Content'),
               status != 'PASS' | `seq.length` != 301) |>
  print(n=Inf)

For this example data, we get the following:

If there were no potential problems for the items, you would get an empty result.

speedyseq – Enhanced phyloseq Functions

I recently discovered the R package speedyseq by Michael McLaren. It includes much faster implementations of some phyloseq functions : tax_glom(), tip_glom(), and the plotting functions that make use of psmelt()plot_bar(), plot_heatmap(), and plot_tree(). There are also lots of convenience functions including ones for extracting the otu, sample data and taxonomy tables, and even the entire phyloseq object, as tibbles.

For further information, check out the Readme file on GitHub at https://github.com/mikemc/speedyseq.

speedyseq can be installed in R using either of the following methods:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("mikemc/speedyseq")

Or:

if (!requireNamespace("devtools", quietly =TRUE))
    install.packages("devtools")
devtools::install_github("mikemc/speedyseq"

 

Using Named Colors with ggplot2

Consider these two plots:

The colors assigned to the phyla are not the same in the two plots. We cannot make them so simply by faceting because there are unique phyla at each site and the sites were sampled at different depths.  In fact, they were plotted using different tibbles. Still, it would be much easier to interpret the plots if the same colors were assigned to the phyla that are common between the plots. We can do this by using a palette of named colors.

First, get a vector of the union of unique phyla for each plot and determine its length. The plots were made from two tibbles, df1 and df2.

unique.phyla <- union(unique(df1$Phylum), unique(df2$Phylum))
length(unique.phyla)
[1] 9

So while there are only 7 phyla in each plot, there are a total of 9 in the two plots. Next, get a vector of 9 colors and name the colors with the names of the phyla.

my.colors <- RColorBrewer::brewer.pal(n = length(unique.phyla),name = "Paired")
names(my.colors) <- unique.phyla
my.colors
Cyanobacteria  Campilobacterota  Nitrospirae Proteobacteria
"#A6CEE3"      "#1F78B4"         "#B2DF8A"       "#33A02C"
Abditibacteriota Saccharibacteria Verrucomicrobia Acidobacteria
"#FB9A99"        "#E31A1C"        "#FDBF6F"       "#FF7F00"
Bacteroidetes
"#CAB2D6"

Replot, using the named palette for the fill colors. The plots above were made using the same code as below except that the scale_fill_manual lines were omitted.

plt1 <- ggplot(df1, aes(x=Depth, y=value, fill = Phylum)) +
    geom_col() +
    ylab("Relative Abundance") +
   scale_fill_manual(values = my.colors, aesthetics = "fill") +
    xlab("Depth (m)") +
    ggtitle("Site 1")
plt2 <- ggplot(df2, aes(x=Depth, y=value, fill = Phylum)) +
    geom_col() +
    scale_fill_manual(values = my.colors, aesthetics = "fill") +
    ylab("Relative Abundance") + xlab("Depth (m)") +
    gtitle("Site 2")

cowplot::plot_grid(plt1, plt2)

Voila! Now the color assignments are consistent between phyla. This approach can be generalized to solve the same problem with other types of plots.