Training the RDP Classifier

To train the RDP classifier, you need a sequence file and a taxonomy file, each with special formatting requirements. Additionally, you need two “secret” python scripts: lineage2taxTrain.py  and addFullLineage.py. I euphemistically refer to them as secret because it seems that until recently they were available only by request from RDP staff. Now you can get them from here, and from https://github.com/GLBRC-TeamMicrobiome/python_scripts.

Sequence file

The sequence file must be in fasta format and contain a unique identifier without any white space. The accession number makes a good identifier. Anything after the first white space is ignored. The following are acceptable:

>DQ248313
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAA
>JF735302 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Dactylonectria;s__Dactylonectria_anthuriicol
CCGAGTTTTCAACTCCCAAACCCCTGTGAACATACCATTTTGTTGCCTCGGCGGTGCCTGTTCCGACAGCCCGCCAGAGGACCCCAAACCCAAATTTCCTTGAGTGAGTCTTCTGAGTAACCGATTAAATAAAT

Taxonomy file

The taxonomy file is a tab-delimited text file beginning with a header giving the Sequence ID and names of ranks to be included, for example Domain, Phylum, Class, Order, Family, Genus, Species. The subsequent lines begin with the unique identifiers from the fasta file followed by the taxonomy for each sequence. There are two requirements for this file:

  1. There must be an entry for every rank in every line. Hyphen placeholders are allowed but are not recommended.
  2.  “Convergent evolution” is not allowed. For example, the same genus cannot appear in two different families.

The following is acceptable:

Seq_ID Kingdom Phylum Class Order Family Genus Species
DQ248313 Fungi Ascomycota Xylonomycetes Symbiotaphrinales Symbiotaphrinaceae Symbiotaphrinaceae Symbiotaphrina_buchneri
JF735302 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Dactylonectria Dactylonectria_anthuriicola
JF735264 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Ilyonectria Ilyonectria_robusta
AY677273 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Ilyonectria Ilyonectria_radicicola

If a rank does not exist, you can fill in the missing entries with hyphens as in the table below. This is what the script FormatRefDB.py  in CONSTAX does, but I do not recommend doing this. Using a hyphen as a place holder leads to a “ragged” classification that cannot be properly sorted by rank.

Seq_ID Kingdom Phylum Class Order Family Genus Species
MG190602 Fungi Ascomycota Sordariomycetes Hypocreales Hypocreales_sp
DQ655721 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Nectriaceae_sp
KM225681 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Thyronectria Thyronectria_lamyi
MF120484 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Fusarium Fusarium_redolens

My approach is to fill in the empty spaces with made-up but meaningful ranks as in the table below. The prefixes indicate the highest rank available. The absence of hyphen placeholders means that classification will not be ragged but include all ranks. Thus it will be possible to select, sort, and merge ranks when analyzing your data later.

Seq_ID Kingdom Phylum Class Order Family Genus Species
MG190602 Fungi Ascomycota Sordariomycetes Hypocreales o_Hypocreales o_f_Hypocreales Hypocreales_sp
DQ655721 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae f_Nectriaceae Nectriaceae_sp
KM225681 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Thyronectria Thyronectria_lamyi
MF120484 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Fusarium Fusarium_redolens

In some cases the above approach can lead to problems because the UNITE classification is not consistent. For example, for the sequences in the table below there were different empty ranks in what is really the same lineage. Thus my approach creates two different lineages for the same species. In this case GS18 could be placed in all ranks for class through genus, but a more general approach is to append unique numbers to the species name, i.e.  GS18_sp_1 and GS18_s_2.

Seq_ID Kingdom Phylum Class Order Family Genus Species
KY687548 Fungi Olpidiomycota GS18 c_GS18 o_c_GS18 GS18 GS18_sp_1
UDB014615 Fungi Olpidiomycota GS18 GS18 c_GS18 o_c_GS18 GS18_sp_2

The following is an example of unallowed “convergent” evolution:

Seq_ID Kingdom Phylum Class Order Family Genus Species
KM246227 Fungi Basidiomycota Tremellomycetes Tremellales Tremellaceae Cryptococcus Cryptococcus_sp
AF444372 Fungi Basidiomycota Tremellomycetes Tremellales Cryptococcaceae Cryptococcus Cryptococcus_sp

The solution for this example is to make the family names the same according to some authority in the field. For example, according to MycoBank the correct family here is Tremellaceae, so change Cryptococcaceae to Tremellaceae.

So where to get the taxonomy? Having the accession IDs, you can download the taxonomy from NCBI, or in some cases, as with the UNITE reference files, you can parse it from the sequence headers. In either case, some scripting is required to get everything into proper form. We definitely do not want to try typing in tens of thousands of lines without making any mistakes!

To continue this tutorial, I will show how to train the RDP classifier using the UNITE general release database file. UNITE provides pre-formatted files for QIIME, mothur and USEARCH, but not specifically for the RDP classifier. The main problem is that, because fungal taxonomy is such a mess, there are many missing intermediate ranks, indicated in the sequence headers as “unidentified.” These need to be handled some other way. I make up meaningful intermediate ranks, as explained above. Also, some sequences are actually duplicates, which need to be removed. To begin, create a working directory, enter it, download the references sequences and unzip them. You can also delete files you will not need.

cd
mkdir unite_db
cd unite_db
wget https://files.plutof.ut.ee/public/orig/EB/0C/EB0CCB3A871B77EA75E472D13926271076904A588D2E1C1EA5AFCF7397D48378.zip
unzip EB0CCB3A871B77EA75E472D13926271076904A588D2E1C1EA5AFCF7397D48378.zip
rm -rf developer
rm DB_readme.txt 
rm EB0CCB3A871B77EA75E472D13926271076904A588D2E1C1EA5AFCF7397D48378.zip

Download the three python scripts you will need and unzip them to the same directory, or to a scripts directory if you prefer. Make them executable.

wget https://john-quensen.com/wp-content/uploads/2019/05/classifier_training_scripts.zip
unzip classifier_training_scripts.zip
chmod 755 *.py

Remove duplicate sequences. You may need to edit the path to RDPTools in the command below to match your installation.

java -Xmx4g -jar /usr/local/RDPTools/classifier.jar rm-dupseq -d -i sh_general_release_dynamic_02.02.2019.fasta -o db_file.fasta

The script unite2rdp_train_files.py reads the db_file.fasta and creates a suitable taxonomy file named rawTaxonomy.txt and a matching fasta file named rawSeqs.fasta.  It solves some problems particular to this release of the UNITE database and likely will not work with other reference fasta files. See the detailed explanation of the script to learn how it is specific to this data set and how it may be modified to deal with other reference sets. Run the script from the directory containing db_file.fasta:

 python unite2rdp_train_files.py db_file.fasta

The files rawTaxonomy.txt and rawSeqs.fasta are written to the same directory. Now that we have the necessary files, we can train the classifier with the following code:

python lineage2taxTrain.py rawTaxonomy.txt > ready4train_taxonomy.txt
python addFullLineage.py rawTaxonomy.txt rawSeqs.fasta > ready4train_seqs.fasta
java -Xmx10g -jar /usr/local/RDPTools/classifier.jar train -o training_files -s ready4train_seqs.fasta -t ready4train_taxonomy.txt

Rinse and repeat

Ooops! We got errors:

edu.msu.cme.rdp.classifier.train.NameRankDupException: Error: duplicate taxon name and rank in the taxonomy file.
cryptococcus genus 2
aleurina genus 2

Find the source of the problems using grep:

grep "Cryptococcus" db_file.fasta
grep "Aleurina" db_file.fasta

Among the output I found:

>Cryptococcus_sp|HQ623619|SH1528221.08FU|reps|k__Fungi;p__Basidiomycota;c__Tremellomycetes;o__Tremellales;f__Tremellaceae;g__Cryptococcus;s__Cryptococcus_sp 
>Cryptococcus_consortionis|KJ707235|SH1631625.08FU|reps|k__Fungi;p__Basidiomycota;c__Tremellomycetes;o__Tremellales;f__Cryptococcaceae;g__Cryptococcus;s__Cryptococcus_consortionis 

>Aleurina_sp|JX316298|SH1558168.08FU|reps|k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pyronemataceae;g__Aleurina;s__Aleurina_sp	
>Aleurina_argentina|KC905032|SH1558167.08FU|refs|k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizaceae;g__Aleurina;s__Aleurina_argentina

So, there are two families for the genus Cryptococcus: Tremellaceae and Cryptococcaceae. And there are two families for the genus Aleurina: Pyronemataceae and Pezizaceae.  I used sed to substitute family names with the following commands:

sed -i 's/f__Cryptococcaceae;g__Cryptococcus/f__Tremellaceae;g__Cryptococcus/' db_file.fasta
sed -i 's/f__Pezizaceae;g__Aleurina/f__Pyronemataceae;g__Aleurina/' db_file.fasta

And then try again:

python unite2rdp_train_files.py db_file.fasta 
python lineage2taxTrain.py rawTaxonomy.txt > ready4train_taxonomy.txt
python addFullLineage.py rawTaxonomy.txt rawSeqs.fasta > ready4train_seqs.fasta
java -Xmx10g -jar /usr/local/RDPTools/classifier.jar train -o training_files -s ready4train_seqs.fasta -t ready4train_taxonomy.txt

Success! Output files are written to the sub-directory training_files. The last step is to copy another file to that directory.

cp /usr/local/RDPTools/classifier/samplefiles/rRNAClassifier.properties ./training_files/

An example command to use the new training files to classify sequences is given below. It is assumed that the input file zotus.fasta is in the working directory and training_files is a sub-directory. Otherwise give the full paths.

java -Xmx10g -jar /usr/local/RDPTools/classifier.jar classify -c 0.8 -f allrank -t training_files/rRNAClassifier.properties -o classified_zotus.txt zotus.fasta

A suggestion: Move training_files to a resource directory and rename it in a meaningful manner, for example as UNITE_v8.