Stand-alone RDP Classifier Tutorial

Introduction

The RDP Classifier is also a multi-classifier. That is, it can classify a single fasta or fastq sequence, all of the sequences in a multiple fasta or fastq file (e.g. a file listing all of the sequences for a single sample) and all or the sequences in more than one multiple fasta or fastq file (e.g. files for several samples). The tutorial on the Command Line Supervised Approach (RDP Classifier) page provides instructions for the later case in a Linux environment. But Linux/Mac OS and Windows environents differ in several respects: the line endings, the slashes used in directory paths, and the file extentions used for script files. For that reason, I provide separate tutorials for the two environments on this page.

Linux and Mac OS Environments

To begin, open the terminal. Create and then move into the directory classify_16S in your home directory using the following commands. You may copy and paste the commands into the terminal.

cd
mkdir classify_16S
cd classify_16S

Download the file classify_16s_linux.zip, place it in the directory classify_16S and unzip it. This will produce the following files:

C:/classify_16S
     command_line_classify.sh
     Native_1_2.fastq
     single_16s_seq.fastq
     trimmed_seqs
          Native_1_2.fastq
          Native_1_4.fastq
          USGA_1_7_A.fastq
          USGA_2_7_A.fastq

The file single_16s_seq.fastq is just that: a fastq file containing a single sequence. Native_1_2.fastq is a multiple fastq file containing 118 partial 16S rRNA gene sequences. The directory trimmed_seqs contains four multiple fastq files; these came from four different samples in an experiment.

Classify a Single Sequence

Assuming you have installed the RDP Classifier in the directory C:/rdp_classifier_2.14 and are in the directory ~/classify_16S, classify the sequence with the command:

java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile single_seq_classified.txt single_16s_seq.fastq

If you installed the classifier in a different directory, edit the path to the classifier.jar file as appropriate. The results will be written to the file single_seq_classified.txt. You may open it in a text editor, spreadheet program and/or inspect it with the less command:

less -S single_16s_classified.txt

The above command gives arguments for the parameters --gene, --conf, --format and --outputFile. The sequence to be classified is at the end of the command. The --conf parameter specifies the confidence level to be used if the results are to be filtered by confidence. A value of 0.5 is suggested for partial 16S rRNA gene sequences and 0.8 for full length sequences.

The RDP Classifier contains three databases besides the one for 16S rRNA gene sequences. These are fungallsu for fungal 28S rRNA gene sequences and fungalits_warcup and fungalits_unite for fungal ITS sequences. To use one of these databases, just substitute its name for 16srrna in the command above.

To see a list  of all of the command parameters with explanations of their possible arguments, enter the following into the terminal:

java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify

Classify Sequences in a Single Multiple Fastq File

Classify all of the sequences in the file Native_1_2_A.fastq by entering the following command into the terminal:

java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile multi_seq_classified.txt Native_1_2_A.fastq

If you installed the classifier in a different directory, edit the path to the classifier.jar file as appropriate. The results will be written to the file multi_seq_classified.txt. You may open it in a text editor, spreadsheet program and/or inspect it with the less command:

less -S multi_seq_classified.txt

Classify Sequences in Several Multiple Fasta Files

Classify all of the sequences in all four of the fastq files in the trimmed_seqs directory with the command:

java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile classified.txt --hier_outfile hier.txt trimmed_seqs/*.fastq

The outputFile is required. It gives the classification of all of the sequences in all of the files, but because it does not indicate which sequences came from which files it is of little use.

The files hier.txt and cnadjusted_hier.txt are of greater use because they contain separate columns for each sample. The file hier.txt gives the raw counts per OTU for each sample. These counts are adjusted for copy number (the number of times the same 16S rRNA gene occurs in a taxa) in the cnadjusted_hier.txt file.   The best way to visualize contents of these files is to open them in a spreadsheet program and filter the rank column  to incude only the lowest rank and empty cells. Then the taxonomy will be given in the second column and the counts or adjusted counts under headings for each file/sample beginning in the fifth column. Results in these files can also be imported into a phyloseq object containing OTU and taxonomy  tables with the RDPutils::hier2phyloseq function. See the GitHub page for how to install the R package RDPutils.

Windows Environments

To begin, open a terminal by searching for either cmd or PowerShell. Create and then move into the directory classify_16S in your home directory using the following commands. You may copy and paste the commands into the terminal.

cd
mkdir classify_16S
cd classify_16S

Download the file classify_16s_windows.zip, place it in the directory classify_16S and unzip it. This will produce the following files:

C:\classify_16S
     command_line_classify.bat
     Native_1_2.fastq
     single_16s_seq.fastq
     trimmed_seqs
          Native_1_2.fastq
          Native_1_4.fastq
          USGA_1_7_A.fastq
          USGA_2_7_A.fastq

The file single_16s_seq.fastq is just that: a fastq file containing a single sequence. Native_1_2.fastq is a multiple fastq file contianing 118 partial 16S rRNA gene sequences. The directory trimmed_seqs contains four multiple fastq files; these came from four different samples in an experiment.

Classify a Single Sequence

Assuming you have installed the RDP Classifier in the directory C:\rdp_classifier_2.14,  and are still in the directory C:\classify_16S, classify the sequence single_16s_seq.fastq with the command:

java -Xmx2g -jar C:\rdp_classifier_2.14\dist\classifier.jar classify --gene 16srrna -c 0.5 --format fixrank --outputFile single_seq_classified.txt single_16s_seq.fastq

The results are written to the file single_seq_classified.txt.  To view the results, open the directory C:\classify_16S in the file manager and double click on the file classified.txt. For easier viewing, make sure that Wordwrap in the view menu is unchecked. The file is tab-delimited, so you can also import it into a spreadsheet if you wish.

The above command gives arguments for the parameters --gene, --conf, --format and --outputFile. The sequence to be classified is at the end of the command. The --conf parameter specifies the confidence level to be used if the results are to be filtered by confidence. A value of 0.5 is suggested for partial 16S rRNA gene sequences and 0.8 for full length sequences.

The RDP Classifier contains three databases besides the one for 16S rRNA gene sequences. These are fungallsu for fungal 28S rRNA gene sequences and fungalits_warcup and fungalits_unite for fungal ITS sequences. To use one of these databases, just substitute its name for 16srrna in the command above.

To see a list of all of the command parameters with explanations of their possible arguments, enter the following into the terminal:

java -Xmx2g -jar C:\rdp_classifier_2.14\dist\classifier.jar classify

Classify Sequences in a Single Multiple Fastq File

Classify all of the sequences in the file Native_1_2_A.fastq by entering the following command into the terminal:

java -Xmx2g -jar C:\rdp_classifier_2.14\dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile multi_seq_classified.txt Native_1_2_A.fastq

If you installed the classifier in a different directory, edit the path to the classifier.jar file as appropriate. The results will be written to the file multi_seq_classified.txt. You may open it in a text editor, spreadsheet program and/or inspect it with the more command:

more multi_seq_classified.txt

Classify Sequences in Several Multiple Fasta Files

Classify all of thes sequences in all of the multiple fastq files in the trimmed_seqs directory with the command:

java -Xmx2g -jar C:\rdp_classifier_2.14\dist\classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile classified.txt --hier_outfile hier_file.txt trimmed_seqs\*.fastq

The outputFile is required. It gives the classification of all of the sequences in all of the files, but because it does not indicate which sequences came from which files it is of little use.

The files hier.txt and cnadjusted_hier.txt are much more useful because they contain separate columns for each sample. The file hier.txt gives the raw counts per OTU for each sample. In the cnadjusted_hier.txt file these counts are adjusted for copy number (the number of times the same 16S rRNA gene occurs in a taxa).   The best way to visualize contents of these files is to open them in a spreadsheet program and filter the rank column  to incude only the lowest rank and empty cells. Then the taxonomy will be given in the second column and the counts or adjusted counts under headings for each file/sample beginning in the fifth column. Results in these files can also be imported into a phyloseq object containing OTU and taxonomy  tables with the RDPutils::hier2phyloseq function. See the GitHub page for how to install the R package RDPutils.