Introduction
The RDP Classifier is a naive Bayesian classifier that provides rapid and accurate taxonomic placement of rRNA gene sequences. Training sets are currently provided for bacterial and archaeal 16s rRNA sequences and fungal LSU and ITS sequences. The classifier provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. Output is in two forms. The classification table format is appropriate to the unsupervised method workflow, and the hierarchical classification format is appropriate to the supervised method.
If using the supervised approach, you will classify several to many fasta
or fastq
sample files, usually at the same time. The hierarchical classification output is in a format that may be imported directly into a phyloseq
object containing both an OTU table and taxonomy table with RDPutil
’s function hier2phyloseq
. Classifications made at different times using the same training set may be combined. For 16S rRNA gene sequences, the output includes both actual sequence counts and copy number adjusted counts.
If using the unsupervised (clustering) method, you will classify a single fasta
or fastq
file containing only the representative sequences for the distance of interest. The classification table results may be incorporated directly into a biom
file or read into R as a phyloseq
class tax_table
object using RDPutils
’s function make_tax_table
.
The exercises in this tutorial include only the classification of multiple samples – i.e. they are for the supervised approach. Classification of representative sequences directly into a biom
file will be covered in a later pipeline tutorial and is also included in the tutorial posted on RDP’s web site as Command-line RDPTools with R/Bioconductor package Phyloseq. The vignette in RDPutils
covers importation of a representative sequence classification table into a phyloseq tax_table
object.
Command-line Classifier – 16S
Create the sub-directory class_16S
in your home directory. Download the file command_line_classification_16s.zip and put it in class_16S
. Unzip the file and list the directory:
mkdir ~/class_16s
cd class_16s
unzip command_line_classifier_16s.zip
ls -l
When you list the files, you should get something like the following:
total 59 -rw-r--r-- 1 quensenj anr 50653 Nov 13 07:44 command_line_classifier_16s.zip -rw-r--r-- 1 quensenj anr 330 Nov 13 07:39 command_line_classification_script.sh drwxr-xr-x 2 quensenj anr 6 Nov 13 07:05 trimmed_seqs
Trimmed sequences are in the sub-directory trimmed_seqs
. If the script file command_line_classify_script.sh
is not executable, make it so:
chmod u+x command_line_classify_script.sh
ls -l
Which should give:
total 59 -rw-r--r-- 1 quensenj anr 50653 Nov 13 07:44 command_line_classification_16s.zip -rwxr--r-- 1 quensenj anr 330 Nov 13 07:39 command_line_classification_script.sh drwxr-xr-x 2 quensenj anr 6 Nov 13 07:05 trimmed_seqs
Let’s examine the script file. Type:
less command_line_classifification_script.sh
Which prints the following to the screen:
#! /bin/bash
# command_line_classification_script
# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path to RDPTools on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path to RDPTools on my local installation
# Classifiy trimmed sequences. Trimmed sequences are in sub-folder trimmed_seqs
java -Xmx2g -jar $RDPToolsDir/classifier.jar classify --gene 16srrna -c 0.5 --format fixrank --hier_outfile hier_file.txt --outputFile classified.txt trimmed_seqs/*.fastq
The first line, #!/bin/bash
identifies the file as a shell script. Under the line # Configure paths
there are two options for the variable RDPToolsDir
. This makes the script more portable. RDPToolsDir=/mnt/research/rdp/public/RDPTools
assigns the path to the directory RDPTools
on MSU’s HPCC to the variable RDPToolsDir
. RDPToolsDir=/usr/local/RDPTools
assigns the path to RDPTools
on my computer to the variable RDPToolsDir
. If you followed the directions in the tutorial for installing RDPTools, the path on your computer should be the same. Otherwise, edit this line to give the path to RDPTools
on your computer. Comment out whichever path you are not using. The long line under # Classifiy trimmed sequences
performs the classification. It is actually all on one line but displayed on multiple lines here so that it will not run off of the page. The first part of the line assigns 2 gigs of memory to java (-Xmx2g
), loads the java program classifier.jar
and runs the command classify
. The rest of the line gives arguments or options to the command as explained in the table below:
Argument | Explanation |
-g or -- 16srrna |
Sets database to bacterial rRNA gene sequences. This is the default. Other options are fungallsu, fungalits_warcup, fungalits_unite. |
-c 0.5 | Sets confidence level to 0.5, appropriate for short reads. |
-f or -- format fixrank |
Prevents a ragged classification table when imported into phyloseq using RDPutils by limiting ranks to domain, phylum, class, order, family and genus. |
-h or -- hier_outfile |
Assigns the name of the output hierarchical classification file. |
-o or -- outputFile |
Assigns the name of the output classification table: there is a row for each sequence. |
If you have not already, exit less
by typing q
. Run the script by typing:
./command_line_classification_script.sh
Examine your results by listing the directory and examining the new files with less.
For example:
ls -l
less classified.text
Page down with the space bar. Page up by typing b
. Exit by typing q
. Note the structures of the output files hier_file.txt
and classified.txt
. Each row in the file classified.txt
gives the sequence name and then a series of columns in groups of three. The first group of three consists of “Bacteria, domain, 1.0.” This is interpreted as the first sequence has been assigned to the domain Bacteria with a confidence of 1.0. The last group on the same line is “Mesorhizobium, genus, 1.0,” meaning the sequence has been assigned to the genus Mesorhizobium with a confidence of 1.0. The hier
file has a different structure. The first column gives taxid numbers, followed by the lineage, the name (in the third column) of the rank in the fourth column. The entries under the sample names are the counts for the row name and rank. The files’ structures are easier to see if the files are opened in a spreadsheet program.
Command-line Classifying with Dereplication – 16S
If you have a large number of sequences per sample and there are many duplicate reads, you may be able to speed up classifying them by first dereplicating the sequences. You might compare execution times for one or a few samples with and without dereplication to help you decide between the two methods (dereplication takes time, too). The script for classifying 16S fastq
files with dereplication is:
#!/bin/bash
# Command line classifier with dereplication 16s
# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path to RDPTools on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path to RDPTools on my local installation
for f in $(ls trimmed_seqs/*.fastq)
do
# Dereplicate
java -Xmx2g -jar $RDPToolsDir/Clustering.jar derep -u -o ${f/.fastq/_derep.fastq} {f/fastq/ids} ${f/fastq/sample} $f
# Classify and expand
java -Xmx2g -jar $RDPToolsDir/classifier.jar classify -c 0.5 -f fixrank -g 16srrna -o ${f/.fastq/_classified.txt} -h ${f/.fastq/_hier.txt} ${f/.fastq/_derep.fastq},${f/fastq/ids} # No spaces allowed between the last two arguments on this line.
done
# Merge individual sample hier files
mkdir cnadjusted
mv trimmed_seqs/cn*hier.txt cnadjusted
java -Xmx2g -jar $RDPToolsDir/classifier.jar merge-count cn_merged_hier.txt cnadjusted/*hier.txt
java -Xmx2g -jar $RDPToolsDir/classifier.jar merge-count merged_hier.txt trimmed_seqs/*hier.txt
As written, the script is run from one directory above directory trimmed_seqs
holding the fastq
files to be classified. IMPORTANT: In the second java line, the last two items (fastq file and corresponding ids file) must be separated by a comma only – no spaces. Otherwise the script will not run. Separate classification (both classification table and hierarchical formats) files are written for each sample. The counts for the hierarchy files are exploded back to the original number of counts in the samples. The last two lines combine the separate hierarchical output files. Counts in cn_merged_hier.txt
are adjusted for copy number.
Try classifying 16S rRNA gene sequences using the script and sequences in the file command_line_classifier_with_dereplication_16s.zip. Create a sub-directory in your home directory, put the downloaded file in it, and then unzip the file. Make the script file executable, and edit the path to RDPTools
if necessary. Then run the script. Watch your terminal screen as the files are processed, noting the ratio between total sequences and unique sequences. For this data set, dereplication reduces the number of sequences that have to be classified roughly in half.
Command-line Classifying with Dereplication – Non-16S
There is no copy number adjustment for gene sequences other than 16S. Thus an appropriate script for classification of 28S (LSU) fungal sequences with dereplication is:
#!/bin/bash
# Command line classifier with dereplication 28s
# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path to RDPTools on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path to RDPTools on my local installation
for f in $(ls trimmed_seqs/*.fastq)
do
# Dereplicate
java -Xmx2g -jar $RDPToolsDir/Clustering.jar derep -u -o ${f/.fastq/_derep.fastq} ${f/fastq/ids} ${f/fastq/sample} $f
# Classify and expand
java -Xmx2g -jar $RDPToolsDir/classifier.jar classify -c 0.5 -f fixrank -g fungallsu -o ${f/.fastq/_classified.txt} -h ${f/.fastq/_hier.txt} ${f/.fastq/_derep.fastq},${f/fastq/ids} # No spaces allowed between the last two arguments on this line.
done
# Merge hier files
java -Xmx2g -jar $RDPToolsDir/classifier.jar merge-count merged_hier.txt trimmed_seqs/*hier.txt
The script is the same except for the gene specified and the lines having to do with copy number adjustment have been removed. Try running this script with the example data provided in the file command_line_classifier_with_dereplication_28s.zip. By now you know the drill of creating a directory, downloading and unzipping the example files, editing the path, making the script executable if it is not already, and running the script. Again, watch your terminal as the files are processed, paying attention to the total and unique sequences for each sample. Is dereplication worthwhile for this data set?