A Guide to Processing Amplicon Sequences

Introduction

Analysis of 16S rRNA gene sequences is the most common method of profiling microbial communities. Typically, DNA is extracted from samples of soil, sediment or water, amplified by PCR using primers targeting a portion of the gene sequence, and submitted for high throughput sequencing. By some methods, the entire gene sequence can be obtained. The sequence data must then be processed and analyzed through the steps outlined below.

The processing and analysis steps described in this guide are more up to date than those included in the workshop tutorials on this site. Those are taken from workshops I taught during the 454 or pyrosequencing era using the RDP Tools programs developed by the RDP at Michigan State University. Since then, sequencing technology has evolved to include Illumina MiSeq, Illumina HiSeq, PacBio and LoopSeq methods giving less error prone results and in some cases longer sequences. Also, sequence processing tools from several sources have been collected into QIIME and more recently QIIME 2 for easier implementation. I especially emphasize how to prepare raw sequencing data for analysis with QIIME 2; these important steps are omitted from the tutorials on the QIIME 2 website. I then cover several different QIIME 2 workflows.

Sampling Data for Workflow Development

In developing a workflow for a specific dataset you will undoubteldly want to try different program parameters and possibly even different methods. Results are usually compared based on overall retention and merged lengths and the best method chosen. This is time consuming and possibly undoable on a personal computer for the full dataset. My approach is to choose six or so samples at random, sample them to 10,000 sequences each, and use this sampled data to develop my workflow. For each stage in the workflow, I may try several approaches or vary parameters for a particular approach, and then choose the best workflow based primarily on read retention. I then run the analysis of the full dataset on a cluster or in the cloud, e.g. on AWS. The point of this section is to get you to adopt a similar procedure.

Know Your Sequences

Before you begin, it is imperative to understand characteristics of your raw sequence data. How long are the reads (sequences) and in what format (paired or not, interleaved or not, compressed)? Do they include adapters, barcodes, primers? What primers were used? What region do the primers amplify, and how long is that region? If the reads contain primer sequences, where are those sequences located? Have the sequences been binned by sample? At least most of this information can be gotten from the sequencing facility, but it does not hurt to check. The Know Your Sequences page accessed via the menu on the right covers some methods for checking, especially checking those characteristics important to deciding how to trim and merge paired sequences.

Assessing Quality

The first step in your workflow should be checking the quality of your sequence data using FastQC. The individual reports for each sample can be agglomerated using MultiQC. The R packages fastqcr and dplyr can also be used to agglomerate and filter FastQC results to quickly identify problem samples. The Assessing Quality page accessed via the menu on the right explains how to do these things.

Trimming Primers

I consider primer trimming to be a quality control step. For that reason, if the sequencing facility returns both trimmed and untrimmed sequences, I prefer to trim the untrimmed sequences myself. The Trimming Primers page accessed via the menu on the right explains how to trim sequences. It covers using cutadapt to trim sequences, and the choice of parameters to be used for different cases. It also covers how to use an R script to check primer trimming. Checking primer trimming is especially important for ITS sequences.

Merging Sequences

If you have paired reads, should you merge them? Can you do so reliably? If so, what program should you use, and with what parameters? To answer these questons you need to know characteristics of both the reads and the merging program you use. If, when and how also depend on the workflow you adopt. All of these topics with examples are covered on the Merging Sequences page accessible via the menu on the right.

Treeing Representative Sequences

If your analysis is to include some phylogenetic approach, using unifrac distances or calculating Faith’s phylogenetic index for example, you need to tree the representative sequences for each cluster or ASV. This is appropriate only if the sequences carry phylogenetic information. 16S rRNA gene sequences do, ITS sequences do not.

Classifying Representative Sequences

There are a variety of classification programs. Perhaps the gold standard is the RDP naive bayesian classifier. QIIME 2’s feature-classifier plug-in supports classification by Naive Bayes, vsearch and BLAST+. USEARCH and VSEARCH have their own versions of the sintax classifier. The R version of DADA2 includes a classifier. Mothur implments a naive bayesian classifier.

The databases used with these programs are more important to the results than the programs themselves. The default 16S rRNA gene database included with the RDP Classifier contains only validly named species classified to the genus level. Validly named essentially means only isolates (think mainly medically important), and so that database is not appropriate for classifying sequences from environmental samples. SILVA, Greengenes2 and GTDB databases all include 16S rRNA gene sequences from additional taxa and are more relevant for classifying sequences from environmental samples. The UNITE ITS database is used for classifying fungal ITS sequences.

For classifying bacteria in environmental samples I prefer the Greengenes2 and GTDB datasets. SILVA is often used but seems to contain more errors. Wish now I had written down the details, but I once found a bacterial genus in the phylum Chordata. And many of the “species” are in fact the organism from which the bacterium was isolated.

QIIME2 Plug-ins and Workflows

QIIME2 is actually a collection of programs written in different computer languages and presented as plug-ins so that they have a common interface. There are plug-ins for every step required to process amplicon sequences, from importing the data to creating plots, and these are strung together in a script to create a workflow. The script includes parameters for each underlying program. The tutorials on the QIIME2 website are examples of such workflow scripts.

In some cases, the plug-in method does not give you access to all parameters that may be set for the underlying program. In such cases you may wish to call the underlying program directly while in the QIIME 2 environment. For example, processing PacBio and LoopSeq data with DADA2 require setting parameters not avaiable using the plug-in interface, so in those cases I run DADA2 directly in R . Another example is that the cutadapt plug-in does not allow setting the -l parameter to fix the maximum length of the trimmed sequences. When I need to do that, I run cutadapt directly.

Analysis

I consider sequence processing to end with the creation of an OTU or ASV table, a set of representative sequences, a classification table for the representative sequences, a sample data table associating meta data withe the samples, and possibly a tree file for the representative sequences. Analyis is what you do with these components.

QIIME 2 includes some plug-ins for quick and easy standard analyses, but I think these tend to be rigid. My usual approach is to use R to package the components into a phyloseq object, and use R for all subsequent data analysis. All standard techniques – diversity indices, ordinations, etc. – are easily done in R. In addition, R is extremely flexible, and I tell students that , “If you can think it, you can do it.” For example, I have plotted trajectories of group centroids through ordination space with time and in response to rainfall events. Show me a QIIME 2 plug-in with those capabilities!