Processing ITS sequences with QIIME2 and DADA2


Internal Transcribed Spacer (ITS) sequences have been adopted as bar codes for fungal species. Primers may be designed to either ITS1, between the 18S and 5S rRNA gene sequences, or ITS2, between the 5S and 28S rRNA gene sequences. Both of these regions vary greatly in length, so that with most primer sets it is not possible to merge paired reads without biasing against some fungal groups.  For that reason, in this tutorial we will use the forward reads only.

Processing ITS sequences differs from processing 16S sequences in another aspect, too. Because the sequences do not reflect phylogeny, the representative sequences cannot be aligned in a meaningful manner and no phylogenetic tree can be constructed. Thus there is no need to include these steps when processing ITS sequences.

QIIME2 Installation

QIIME2 is readily installed using a conda environment. See my tutorial for how to create virtual environments and the QIIME2 installation page for how to install the latest QIIME2 version in its own environment.

Importing Sample Sequences

This tutorial begins with ITS forward sequence files that have already been demultiplexed and trimmed of artifacts and primers. A manifest file is used to associate sample names with the sequence files. The header line should be exactly as in the following example. Subsequent lines are tab-delimited, with the sample names in the first column and the full path to the forward sequence files in the second column. The sample names should not include periods or underscores, and should not begin with a digit.

sample-id absolute-filepath
sample-1 $PWD/some/filepath/sample1_R1.fastq
sample-2 $PWD/some/filepath/sample2_R1.fastq

The QIIME2 command for importing single end sequence files is:

qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path plate_1_manifest_file.tsv \
--output-path single-end-demux.qza \
--input-format SingleEndFastqManifestPhred33V2

All of the sequence data is stored compressed in the file single-end-demux.qza. If you wish, you may create a visualization file from it with the following command:

qiime demux summarize \
--i-data single-end-demux.qza \
--o-visualization demux.qzv

To view demux.qzv, open with your browser and drag the file into the window at the top of the page.

Denoise the Sequences

The next step is to run the DADA2 plugin. In addition to correcting sequencing errors, this plugin removes chimeras, clusters the the sequences at 100% similarity, and outputs an ASV table and the representative sequences. Rather than filtering on quality using FIGARO selected truncation parameters as for 16S sequences, I filter using quality scores and expected number of errors. I do not hard trim regions expected to be conserved portions of 18S, 5S, or 28S rRNA gene regions. Depending on the primers used, they can vary significantly in length, and so the length to hard trim may not be predictable. Also, I do not truncate the sequences to a fixed length. More recent versions of DADA2 can handle sequences of varying length. The following command executes DADA2.

qiime dada2 denoise-single \
--i-demultiplexed-seqs single-end-demux.qza \
--p-trunc-len 0 \
--p-max-ee 2 \
--p-trunc-q 2 \
--p-n-threads 20 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats denoising-stats.qza

Export DADA2 Results

The output of the DADA2 plugin includes the ASV table, the representative sequences, and some statistics on the procedure, all in compressed format. Export the results in formats that are easily read into R and phyloseq.

# Export OTU table
mkdir phyloseq
qiime tools export \
--input-path table.qza \
--output-path phyloseq

# Convert biom format to tsv format
biom convert \
-i phyloseq/feature-table.biom \
-o phyloseq/otu_table.txt \
cd phyloseq
sed -i '1d' otu_table.tsv
sed -i 's/#OTU ID//' otu_table.tsv
cd ../

# Export representative sequences
qiime tools export \
--input-path rep-seqs.qza \
--output-path phyloseq

Classify the Representative Sequences

The representative sequences can be classified by any of several means. Here I use the RDP classifier with the database created in my tutorial Training the RDP Classifier.

cd phyloseq
java -Xmx10g -jar /usr/local/RDPTools/classifier.jar classify -c 0.8 -f allrank -t training_files/ -o classified_repseqs.txt dna-sequences.fasta

Alternatively, the representative sequences can be classified in QIIME2 and the results exported in a file format that can be read into R. See my tutorial on training the QIIME2 classifier with ITS references sequences from UNITE.

qiime feature-classifier classify-sklearn \
--i-classifier unite-ver8-99-classifier-04.02.2020.qza \
--i-reads dada2-single-end-rep-seqs.qza \
--o-classification taxonomy-single-end.qza

Export the QIIME2 classification results:

qiime tools export \
--input-path taxonomy-single-end.qza\
--output-path phyloseq