Command Line Supervised Approach (RDP Classifier)

Introduction

The RDP Classifier is a naive Bayesian classifier that provides rapid and accurate taxonomic placement of rRNA gene sequences. Training sets are currently provided for bacterial and archaeal 16s rRNA sequences and fungal LSU and ITS sequences. The classifier provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. Output is in two forms. The classification table format is appropriate to the unsupervised method workflow, and the hierarchical classification format is appropriate to the supervised method.

If using the supervised approach, you will classify several to many fasta or fastq sample files, usually at the same time. The hierarchical classification output is in a format that may be imported directly into a phyloseq object containing both an OTU table and taxonomy table with RDPutil’s function hier2phyloseq. Classifications made at different times using the same training set may be combined. For 16S rRNA gene sequences, the output includes both actual sequence counts and copy number adjusted counts.

If using the unsupervised (clustering) method, you will classify a single fasta or fastq file containing only the representative sequences for the distance of interest. The classification table results may be incorporated directly into a biom file or read into R as a phyloseq class tax_table object using RDPutils’s function make_tax_table.

The exercises in this tutorial include only the classification of multiple samples – i.e. they are for the supervised approach. Classification of representative sequences directly into a biom file will be covered in a later pipeline tutorial and is also included in the tutorial posted on RDP’s web site as Command-line RDPTools with R/Bioconductor package Phyloseq. The vignette in RDPutils covers importation of a representative sequence classification table into a phyloseq tax_table object.

Command-line Classifier – 16S

Create the sub-directory class_16S in your home directory. Download the file command_line_classification_16s.zip and put it in class_16S. Unzip the file and list the directory:

mkdir ~/class_16s
cd class_16s
unzip command_line_classifier_16s.zip
ls -l

When you list the files, you should get something like the following:

total 59
-rw-r--r-- 1 quensenj anr 50653 Nov 13 07:44 command_line_classifier_16s.zip 
-rw-r--r-- 1 quensenj anr   330 Nov 13 07:39  command_line_classification_script.sh 
drwxr-xr-x 2 quensenj anr     6 Nov 13 07:05 trimmed_seqs

Trimmed sequences are in the sub-directory trimmed_seqs. If the script file command_line_classify_script.sh is not executable, make it so:

chmod u+x command_line_classify_script.sh
ls -l

Which should give:

total 59
-rw-r--r-- 1 quensenj anr 50653 Nov 13 07:44 command_line_classification_16s.zip
-rwxr--r-- 1 quensenj anr   330 Nov 13 07:39 command_line_classification_script.sh
drwxr-xr-x 2 quensenj anr     6 Nov 13 07:05 trimmed_seqs

Let’s examine the script file. Type:

less command_line_classifification_script.sh

Which prints the following to the screen:

#! /bin/bash
# command_line_classification_script
# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path to RDPTools on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path to RDPTools on my local installation

# Classifiy trimmed sequences. Trimmed sequences are in sub-folder trimmed_seqs
java -Xmx2g -jar $RDPToolsDir/classifier.jar classify --gene 16srrna -c 0.5 --format fixrank --hier_outfile hier_file.txt --outputFile classified.txt trimmed_seqs/*.fastq

The first line, #!/bin/bash identifies the file as a shell script. Under the line # Configure paths there are two options for the variable RDPToolsDir. This makes the script more portable. RDPToolsDir=/mnt/research/rdp/public/RDPTools assigns the path to the directory RDPTools on MSU’s HPCC to the variable RDPToolsDir. RDPToolsDir=/usr/local/RDPTools assigns the path to RDPTools on my computer to the variable RDPToolsDir. If you followed the directions in the tutorial for installing RDPTools, the path on your computer should be the same. Otherwise, edit this line to give the path to RDPTools on your computer. Comment out whichever path you are not using. The long line under # Classifiy trimmed sequences performs the classification. It is actually all on one line but displayed on multiple lines here so that it will not run off of the page. The first part of the line assigns 2 gigs of memory to java (-Xmx2g), loads the java program classifier.jar and runs the command classify. The rest of the line gives arguments or options to the command as explained in the table below:

Argument Explanation
-g or --16srrna Sets database to bacterial rRNA gene sequences. This is the default. Other options are fungallsu, fungalits_warcup, fungalits_unite.
-c 0.5 Sets confidence level to 0.5, appropriate for short reads.
-f or --format fixrank Prevents a ragged classification table when imported into phyloseq using RDPutils by limiting ranks to domain, phylum, class, order, family and genus.
-h or --hier_outfile Assigns the name of the output hierarchical classification file.
-o or --outputFile Assigns the name of the output classification table: there is a row for each sequence.

If you have not already, exit less by typing q. Run the script by typing:

./command_line_classification_script.sh

Examine your results by listing the directory and examining the new files with less. For example:

ls -l
less classified.text

Page down with the space bar. Page up by typing b. Exit by typing q. Note the structures of the output files hier_file.txt and classified.txt. Each row in the file classified.txt gives the sequence name and then a series of columns in groups of three. The first group of three consists of “Bacteria, domain, 1.0.” This is interpreted as the first sequence has been assigned to the domain Bacteria with a confidence of 1.0. The last group on the same line is “Mesorhizobium, genus, 1.0,” meaning the sequence has been assigned to the genus Mesorhizobium with a confidence of 1.0. The hier file has a different structure. The first column gives taxid numbers, followed by the lineage, the name (in the third column) of the rank in the fourth column. The entries under the sample names are the counts for the row name and rank. The files’ structures are easier to see if the files are opened in a spreadsheet program.

Command-line Classifying with Dereplication – 16S

If you have a large number of sequences per sample and there are many duplicate reads, you may be able to speed up classifying them by first dereplicating the sequences. You might compare execution times for one or a few samples with and without dereplication to help you decide between the two methods (dereplication takes time, too). The script for classifying 16S fastq files with dereplication is:

#!/bin/bash
# Command line classifier with dereplication 16s
# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path to RDPTools on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path to RDPTools on my local installation

for f in $(ls trimmed_seqs/*.fastq)
do
# Dereplicate
java -Xmx2g -jar $RDPToolsDir/Clustering.jar derep -u -o ${f/.fastq/_derep.fastq} {f/fastq/ids} ${f/fastq/sample} $f
# Classify and expand
java -Xmx2g -jar $RDPToolsDir/classifier.jar classify -c 0.5 -f fixrank -g 16srrna -o ${f/.fastq/_classified.txt} -h ${f/.fastq/_hier.txt} ${f/.fastq/_derep.fastq},${f/fastq/ids} # No spaces allowed between the last two arguments on this line.
done

# Merge individual sample hier files
mkdir cnadjusted
mv trimmed_seqs/cn*hier.txt cnadjusted
java -Xmx2g -jar $RDPToolsDir/classifier.jar merge-count cn_merged_hier.txt cnadjusted/*hier.txt
java -Xmx2g -jar $RDPToolsDir/classifier.jar merge-count merged_hier.txt trimmed_seqs/*hier.txt

As written, the script is run from one directory above directory trimmed_seqs holding the fastq files to be classified. IMPORTANT: In the second java line, the last two items (fastq file and corresponding ids file) must be separated by a comma only – no spaces. Otherwise the script will not run. Separate classification (both classification table and hierarchical formats) files are written for each sample. The counts for the hierarchy files are exploded back to the original number of counts in the samples. The last two lines combine the separate hierarchical output files. Counts in cn_merged_hier.txt are adjusted for copy number.

Try classifying 16S rRNA gene sequences using the script and sequences in the file command_line_classifier_with_dereplication_16s.zip. Create a sub-directory in your home directory, put the downloaded file in it, and then unzip the file. Make the script file executable, and edit the path to RDPTools if necessary. Then run the script. Watch your terminal screen as the files are processed, noting the ratio between total sequences and unique sequences. For this data set, dereplication reduces the number of sequences that have to be classified roughly in half.

Command-line Classifying with Dereplication – Non-16S

There is no copy number adjustment for gene sequences other than 16S. Thus an appropriate script for classification of 28S (LSU) fungal sequences with dereplication is:

#!/bin/bash
# Command line classifier with dereplication 28s
# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path to RDPTools on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path to RDPTools on my local installation

for f in $(ls trimmed_seqs/*.fastq)
do
# Dereplicate
java -Xmx2g -jar $RDPToolsDir/Clustering.jar derep -u -o ${f/.fastq/_derep.fastq} ${f/fastq/ids} ${f/fastq/sample} $f
# Classify and expand
java -Xmx2g -jar $RDPToolsDir/classifier.jar classify -c 0.5 -f fixrank -g fungallsu -o ${f/.fastq/_classified.txt} -h ${f/.fastq/_hier.txt} ${f/.fastq/_derep.fastq},${f/fastq/ids} # No spaces allowed between the last two arguments on this line.
done

# Merge hier files
java -Xmx2g -jar $RDPToolsDir/classifier.jar merge-count merged_hier.txt trimmed_seqs/*hier.txt

The script is the same except for the gene specified and the lines having to do with copy number adjustment have been removed. Try running this script with the example data provided in the file command_line_classifier_with_dereplication_28s.zip. By now you know the drill of creating a directory, downloading and unzipping the example files, editing the path, making the script executable if it is not already, and running the script. Again, watch your terminal as the files are processed, paying attention to the total and unique sequences for each sample. Is dereplication worthwhile for this data set?