Processing Kozich Method V4 Sequences

This tutorial makes use of sequences returned by Michigan State University’s Research Technology Support Faciltiy (RTSF). RTSF follows the procedure of Kozich et. al (2013) which targets the V4 region of 16S rRNA gene sequences. The faciltiy returns paired 250 bp reads that have already been binned by sample and trimmed of primers. Thus a minimum of preprocessing needs to be done before feeding the sequences into QIIME 2 for processing.

Download Data

Open your terminal, create a directory for this tutorial and move into it. Then download the sequences to use with the code below. The sequences have already been sampled to approximately 10,000 sequences per sample.

wget https://github.com/jfq3/data_sets/raw/refs/heads/master/kozich_sampled/kozich_sampled.tar.gz
tar xzf kozich_sampled/kozich_sampled.tar.gz

Check Quality

Run fastqc and multiqc to check the quality of the sequences.

Are there any samples that should be excluded on the basis of poor quality or too few counts?

Do Illumina adapters need to be removed? If so, remove them by following these instructions.

If you filter or subset the downloaded sequences in any way, run all subsequent steps on those altered sequences.

Run FIGARO

Run Figaro to determine truncation parameters to use with the DADA2 plugin.

What are the recommended truncation parameters to use? What is the expected retention percent for this filtration step?

Besides the file of trunction parameters, FIGARO generates plots, one for forward reads and one for reverse reads, showing the observed and expected error rate for each read position. If the curves match, DADA2 should do a good job of correcting errors. Are these predictions good or bad?

Process with QIIME 2 and DADA2

Create a manifest file using the instructions on the page Creating a QIIME 2 Manifest File.

Activate your QIIME 2 environment and run the following script to process the sequences. If necessary, edit the truncation parameters for the DADA2 plugin to match your results from running Figaro, or your own choices from observing the quality plots.

#Import sequences
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest_file.tsv \
--output-path paired-end-demux.qza \
--input-format PairedEndFastqManifestPhred33V2

# Visualize imported data
qiime demux summarize \
--i-data paired-end-demux.qza \
--o-visualization demux.qzv

# Run dada2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end-demux.qza \
--p-trunc-len-f 89 \
--p-trunc-len-r 186 \
--p-pooling-method 'pseudo' \
--p-n-threads 4 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats denoising-stats.qza

# Export denoising stats
qiime metadata tabulate \
--m-input-file denoising-stats.qza \
--o-visualization denoising-stats.qzv

# Tree the representative sequences
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences rep-seqs.qza \
--p-n-threads 4 \
--o-alignment aligned-rep-seqs.qza \
--o-masked-alignment masked-aligned-rep-seqs.qza \
--o-tree unrooted-tree.qza \
--o-rooted-tree rooted-tree.qza

# Export OTU table
mkdir phyloseq
qiime tools export \
--input-path table.qza \
--output-path phyloseq

# Convert biom format to tsv format
biom convert \
-i phyloseq/feature-table.biom \
-o phyloseq/otu_table.tsv \
--to-tsv
cd phyloseq
sed -i '1d' otu_table.tsv
sed -i 's/#OTU ID//' otu_table.tsv
cd ../

# Export representative sequences
qiime tools export \
--input-path rep-seqs.qza \
--output-path phyloseq

# Export tree files
qiime tools export \
--input-path unrooted-tree.qza \
--output-path phyloseq
cd phyloseq
mv tree.nwk unrooted_tree.nwk
cd ../

qiime tools export \
--input-path rooted-tree.qza \
--output-path phyloseq
cd phyloseq
mv tree.nwk rooted_tree.nwk

Classify

Classify the representative sequences using one of the methods given on the page Classifying Representative Sequences.

Questions

The above script generates two qzv files: demux.qzv and denoising-stats.qzv. The first includes the number of sequences and quality plots for the raw sequences imported in the first step (qiime tools import). The second contains a report for each sample of the sequences passing each stage of processing by DADA2: filtered, denoised, merged and filtered of chimera.

Load demux.qza into QIIME2 View. From the quality plots, would you have chosen different truncation parameters? If so, after examining denoising-stats.qzv (below), re-run the processing with your own truncation parameters and compare overall retention results.

Load denoising-stats.qzv into QIIME2 View. How does the retention for the filtration step compare with FIGARO’s prediction?

References

Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. 2013. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol 79:5112-20.