Introduction
The RDP Classifier is also a multi-classifier. That is, it can classify a single fasta or fastq sequence, all of the sequences in a multiple fasta or fastq file (e.g. a file listing all of the sequences for a single sample) and all or the sequences in more than one multiple fasta or fastq file (e.g. files for several samples). The tutorial on the Command Line Supervised Approach (RDP Classifier) page provides instructions for the later case in a Linux environment. But Linux/Mac OS and Windows environents differ in several respects: the line endings, the slashes used in directory paths, and the file extentions used for script files. For that reason, I provide separate tutorials for the two environments on this page.
Linux and Mac OS Environments
To begin, open the terminal. Create and then move into the directory classify_16S
in your home directory using the following commands. You may copy and paste the commands into the terminal.
cd
mkdir classify_16S cd classify_16S
Download the file classify_16s_linux.zip, place it in the directory classify_16S
and unzip it. This will produce the following files:
C:/classify_16S
command_line_classify.sh
Native_1_2.fastq
single_16s_seq.fastq
trimmed_seqs
Native_1_2.fastq
Native_1_4.fastq
USGA_1_7_A.fastq
USGA_2_7_A.fastq
The file single_16s_seq.fastq
is just that: a fastq file containing a single sequence. Native_1_2.fastq
is a multiple fastq file containing 118 partial 16S rRNA gene sequences. The directory trimmed_seqs
contains four multiple fastq files; these came from four different samples in an experiment.
Classify a Single Sequence
Assuming you have installed the RDP Classifier in the directory C:/rdp_classifier_2.14
and are in the directory ~/classify_16S
, classify the sequence with the command:
java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile single_seq_classified.txt single_16s_seq.fastq
If you installed the classifier in a different directory, edit the path to the classifier.jar
file as appropriate. The results will be written to the file single_seq_classified.txt
. You may open it in a text editor, spreadheet program and/or inspect it with the less
command:
less -S single_16s_classified.txt
The above command gives arguments for the parameters --gene
, --conf
, --format
and --outputFile
. The sequence to be classified is at the end of the command. The --conf
parameter specifies the confidence level to be used if the results are to be filtered by confidence. A value of 0.5 is suggested for partial 16S rRNA gene sequences and 0.8 for full length sequences.
The RDP Classifier contains three databases besides the one for 16S rRNA gene sequences. These are fungallsu
for fungal 28S rRNA gene sequences and fungalits_warcup
and fungalits_unite
for fungal ITS sequences. To use one of these databases, just substitute its name for 16srrna
in the command above.
To see a list of all of the command parameters with explanations of their possible arguments, enter the following into the terminal:
java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify
Classify Sequences in a Single Multiple Fastq File
Classify all of the sequences in the file Native_1_2_A.fastq
by entering the following command into the terminal:
java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile multi_seq_classified.txt Native_1_2_A.fastq
If you installed the classifier in a different directory, edit the path to the classifier.jar
file as appropriate. The results will be written to the file multi_seq_classified.txt
. You may open it in a text editor, spreadsheet program and/or inspect it with the less
command:
less -S multi_seq_classified.txt
Classify Sequences in Several Multiple Fasta Files
Classify all of the sequences in all four of the fastq files in the trimmed_seqs
directory with the command:
java -Xmx2g -jar /mnt/c/rdp_classifier_2.14/dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile classified.txt --hier_outfile hier.txt trimmed_seqs/*.fastq
The outputFile is required. It gives the classification of all of the sequences in all of the files, but because it does not indicate which sequences came from which files it is of little use.
The files hier.txt
and cnadjusted_hier.txt
are of greater use because they contain separate columns for each sample. The file hier.txt
gives the raw counts per OTU for each sample. These counts are adjusted for copy number (the number of times the same 16S rRNA gene occurs in a taxa) in the cnadjusted_hier.txt
file. The best way to visualize contents of these files is to open them in a spreadsheet program and filter the rank column to incude only the lowest rank and empty cells. Then the taxonomy will be given in the second column and the counts or adjusted counts under headings for each file/sample beginning in the fifth column. Results in these files can also be imported into a phyloseq object containing OTU and taxonomy tables with the RDPutils::hier2phyloseq
function. See the GitHub page for how to install the R package RDPutils
.
Windows Environments
To begin, open a terminal by searching for either cmd or PowerShell. Create and then move into the directory classify_16S
in your home directory using the following commands. You may copy and paste the commands into the terminal.
cd
mkdir classify_16S cd classify_16S
Download the file classify_16s_windows.zip, place it in the directory classify_16S
and unzip it. This will produce the following files:
C:\classify_16S
command_line_classify.bat
Native_1_2.fastq
single_16s_seq.fastq
trimmed_seqs
Native_1_2.fastq
Native_1_4.fastq
USGA_1_7_A.fastq
USGA_2_7_A.fastq
The file single_16s_seq.fastq
is just that: a fastq file containing a single sequence. Native_1_2.fastq
is a multiple fastq file contianing 118 partial 16S rRNA gene sequences. The directory trimmed_seqs
contains four multiple fastq files; these came from four different samples in an experiment.
Classify a Single Sequence
Assuming you have installed the RDP Classifier in the directory C:\rdp_classifier_2.14
, and are still in the directory C:\classify_16S
, classify the sequence single_16s_seq.fastq
with the command:
java -Xmx2g -jar C:\rdp_classifier_2.14\dist\classifier.jar classify --gene 16srrna -c 0.5 --format fixrank --outputFile single_seq_classified.txt single_16s_seq.fastq
The results are written to the file single_seq_classified.txt
. To view the results, open the directory C:\classify_16S
in the file manager and double click on the file classified.txt
. For easier viewing, make sure that Wordwrap in the view menu is unchecked. The file is tab-delimited, so you can also import it into a spreadsheet if you wish.
The above command gives arguments for the parameters --gene
, --conf
, --format
and --outputFile
. The sequence to be classified is at the end of the command. The --conf
parameter specifies the confidence level to be used if the results are to be filtered by confidence. A value of 0.5 is suggested for partial 16S rRNA gene sequences and 0.8 for full length sequences.
The RDP Classifier contains three databases besides the one for 16S rRNA gene sequences. These are fungallsu
for fungal 28S rRNA gene sequences and fungalits_warcup
and fungalits_unite
for fungal ITS sequences. To use one of these databases, just substitute its name for 16srrna
in the command above.
To see a list of all of the command parameters with explanations of their possible arguments, enter the following into the terminal:
java -Xmx2g -jar C:\rdp_classifier_2.14\dist\classifier.jar classify
Classify Sequences in a Single Multiple Fastq File
Classify all of the sequences in the file Native_1_2_A.fastq
by entering the following command into the terminal:
java -Xmx2g -jar C:\rdp_classifier_2.14\dist/classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile multi_seq_classified.txt Native_1_2_A.fastq
If you installed the classifier in a different directory, edit the path to the classifier.jar
file as appropriate. The results will be written to the file multi_seq_classified.txt
. You may open it in a text editor, spreadsheet program and/or inspect it with the more
command:
more multi_seq_classified.txt
Classify Sequences in Several Multiple Fasta Files
Classify all of thes sequences in all of the multiple fastq files in the trimmed_seqs
directory with the command:
java -Xmx2g -jar C:\rdp_classifier_2.14\dist\classifier.jar classify --gene 16srrna -conf 0.5 --format fixrank --outputFile classified.txt --hier_outfile hier_file.txt trimmed_seqs\*.fastq
The outputFile is required. It gives the classification of all of the sequences in all of the files, but because it does not indicate which sequences came from which files it is of little use.
The files hier.txt
and cnadjusted_hier.txt
are much more useful because they contain separate columns for each sample. The file hier.txt
gives the raw counts per OTU for each sample. In the cnadjusted_hier.txt
file these counts are adjusted for copy number (the number of times the same 16S rRNA gene occurs in a taxa). The best way to visualize contents of these files is to open them in a spreadsheet program and filter the rank column to incude only the lowest rank and empty cells. Then the taxonomy will be given in the second column and the counts or adjusted counts under headings for each file/sample beginning in the fifth column. Results in these files can also be imported into a phyloseq object containing OTU and taxonomy tables with the RDPutils::hier2phyloseq
function. See the GitHub page for how to install the R package RDPutils
.