To train the RDP classifier, you need a sequence file and a taxonomy file, each with special formatting requirements. Additionally, you need two “secret” python scripts: lineage2taxTrain.py and addFullLineage.py. I euphemistically refer to them as secret because it seems that until recently they were available only by request from RDP staff. Now you can get them from here, and from https://github.com/GLBRC-TeamMicrobiome/python_scripts.
Sequence file
The sequence file must be in fasta format and contain a unique identifier without any white space. The accession number makes a good identifier. Anything after the first white space is ignored. The following are acceptable:
>DQ248313
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAA
>JF735302 k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Dactylonectria;s__Dactylonectria_anthuriicol
CCGAGTTTTCAACTCCCAAACCCCTGTGAACATACCATTTTGTTGCCTCGGCGGTGCCTGTTCCGACAGCCCGCCAGAGGACCCCAAACCCAAATTTCCTTGAGTGAGTCTTCTGAGTAACCGATTAAATAAAT
Taxonomy file
The taxonomy file is a tab-delimited text file beginning with a header giving the Sequence ID and names of ranks to be included, for example Domain, Phylum, Class, Order, Family, Genus, Species. The subsequent lines begin with the unique identifiers from the fasta file followed by the taxonomy for each sequence. There are two requirements for this file:
- There must be an entry for every rank in every line. Hyphen placeholders are allowed but are not recommended.
- “Convergent evolution” is not allowed. For example, the same genus cannot appear in two different families.
The following is acceptable:
Seq_ID | Kingdom | Phylum | Class | Order | Family | Genus | Species |
DQ248313 | Fungi | Ascomycota | Xylonomycetes | Symbiotaphrinales | Symbiotaphrinaceae | Symbiotaphrinaceae | Symbiotaphrina_buchneri |
JF735302 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Dactylonectria | Dactylonectria_anthuriicola |
JF735264 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Ilyonectria | Ilyonectria_robusta |
AY677273 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Ilyonectria | Ilyonectria_radicicola |
If a rank does not exist, you can fill in the missing entries with hyphens as in the table below. This is what the script FormatRefDB.py in CONSTAX does, but I do not recommend doing this. Using a hyphen as a place holder leads to a “ragged” classification that cannot be properly sorted by rank.
Seq_ID | Kingdom | Phylum | Class | Order | Family | Genus | Species |
MG190602 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | – | – | Hypocreales_sp |
DQ655721 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | – | Nectriaceae_sp |
KM225681 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Thyronectria | Thyronectria_lamyi |
MF120484 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Fusarium | Fusarium_redolens |
My approach is to fill in the empty spaces with made-up but meaningful ranks as in the table below. The prefixes indicate the highest rank available. The absence of hyphen placeholders means that classification will not be ragged but include all ranks. Thus it will be possible to select, sort, and merge ranks when analyzing your data later.
Seq_ID | Kingdom | Phylum | Class | Order | Family | Genus | Species |
MG190602 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | o_Hypocreales | o_f_Hypocreales | Hypocreales_sp |
DQ655721 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | f_Nectriaceae | Nectriaceae_sp |
KM225681 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Thyronectria | Thyronectria_lamyi |
MF120484 | Fungi | Ascomycota | Sordariomycetes | Hypocreales | Nectriaceae | Fusarium | Fusarium_redolens |
In some cases the above approach can lead to problems because the UNITE classification is not consistent. For example, for the sequences in the table below there were different empty ranks in what is really the same lineage. Thus my approach creates two different lineages for the same species. In this case GS18 could be placed in all ranks for class through genus, but a more general approach is to append unique numbers to the species name, i.e. GS18_sp_1 and GS18_s_2.
Seq_ID | Kingdom | Phylum | Class | Order | Family | Genus | Species |
KY687548 | Fungi | Olpidiomycota | GS18 | c_GS18 | o_c_GS18 | GS18 | GS18_sp_1 |
UDB014615 | Fungi | Olpidiomycota | GS18 | GS18 | c_GS18 | o_c_GS18 | GS18_sp_2 |
The following is an example of unallowed “convergent” evolution:
Seq_ID | Kingdom | Phylum | Class | Order | Family | Genus | Species |
KM246227 | Fungi | Basidiomycota | Tremellomycetes | Tremellales | Tremellaceae | Cryptococcus | Cryptococcus_sp |
AF444372 | Fungi | Basidiomycota | Tremellomycetes | Tremellales | Cryptococcaceae | Cryptococcus | Cryptococcus_sp |
The solution for this example is to make the family names the same according to some authority in the field. For example, according to MycoBank the correct family here is Tremellaceae, so change Cryptococcaceae to Tremellaceae.
So where to get the taxonomy? Having the accession IDs, you can download the taxonomy from NCBI, or in some cases, as with the UNITE reference files, you can parse it from the sequence headers. In either case, some scripting is required to get everything into proper form. We definitely do not want to try typing in tens of thousands of lines without making any mistakes!
To continue this tutorial, I will show how to train the RDP classifier using the UNITE general release database file. UNITE provides pre-formatted files for QIIME, mothur and USEARCH, but not specifically for the RDP classifier. The main problem is that, because fungal taxonomy is such a mess, there are many missing intermediate ranks, indicated in the sequence headers as “unidentified.” These need to be handled some other way. I make up meaningful intermediate ranks, as explained above. Also, some sequences are actually duplicates, which need to be removed. To begin, create a working directory, enter it, download the references sequences and unzip them. You can also delete files you will not need.
cd
mkdir unite_db
cd unite_db
wget https://files.plutof.ut.ee/public/orig/EB/0C/EB0CCB3A871B77EA75E472D13926271076904A588D2E1C1EA5AFCF7397D48378.zip
unzip EB0CCB3A871B77EA75E472D13926271076904A588D2E1C1EA5AFCF7397D48378.zip
rm -rf developer
rm DB_readme.txt
rm EB0CCB3A871B77EA75E472D13926271076904A588D2E1C1EA5AFCF7397D48378.zip
Download the three python scripts you will need and unzip them to the same directory, or to a scripts directory if you prefer. Make them executable.
wget https://john-quensen.com/wp-content/uploads/2019/05/classifier_training_scripts.zip
unzip classifier_training_scripts.zip
chmod 755 *.py
Remove duplicate sequences. You may need to edit the path to RDPTools in the command below to match your installation.
java -Xmx4g -jar /usr/local/RDPTools/classifier.jar rm-dupseq -d -i sh_general_release_dynamic_02.02.2019.fasta -o db_file.fasta
The script unite2rdp_train_files.py reads the db_file.fasta and creates a suitable taxonomy file named rawTaxonomy.txt and a matching fasta file named rawSeqs.fasta. It solves some problems particular to this release of the UNITE database and likely will not work with other reference fasta files. See the detailed explanation of the script to learn how it is specific to this data set and how it may be modified to deal with other reference sets. Run the script from the directory containing db_file.fasta:
python unite2rdp_train_files.py db_file.fasta
The files rawTaxonomy.txt and rawSeqs.fasta are written to the same directory. Now that we have the necessary files, we can train the classifier with the following code:
python lineage2taxTrain.py rawTaxonomy.txt > ready4train_taxonomy.txt
python addFullLineage.py rawTaxonomy.txt rawSeqs.fasta > ready4train_seqs.fasta
java -Xmx10g -jar /usr/local/RDPTools/classifier.jar train -o training_files -s ready4train_seqs.fasta -t ready4train_taxonomy.txt
Rinse and repeat
Ooops! We got errors:
edu.msu.cme.rdp.classifier.train.NameRankDupException: Error: duplicate taxon name and rank in the taxonomy file. cryptococcus genus 2 aleurina genus 2
Find the source of the problems using grep:
grep "Cryptococcus" db_file.fasta
grep "Aleurina" db_file.fasta
Among the output I found:
>Cryptococcus_sp|HQ623619|SH1528221.08FU|reps|k__Fungi;p__Basidiomycota;c__Tremellomycetes;o__Tremellales;f__Tremellaceae;g__Cryptococcus;s__Cryptococcus_sp >Cryptococcus_consortionis|KJ707235|SH1631625.08FU|reps|k__Fungi;p__Basidiomycota;c__Tremellomycetes;o__Tremellales;f__Cryptococcaceae;g__Cryptococcus;s__Cryptococcus_consortionis >Aleurina_sp|JX316298|SH1558168.08FU|reps|k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pyronemataceae;g__Aleurina;s__Aleurina_sp >Aleurina_argentina|KC905032|SH1558167.08FU|refs|k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizaceae;g__Aleurina;s__Aleurina_argentina
So, there are two families for the genus Cryptococcus: Tremellaceae and Cryptococcaceae. And there are two families for the genus Aleurina: Pyronemataceae and Pezizaceae. I used sed to substitute family names with the following commands:
sed -i 's/f__Cryptococcaceae;g__Cryptococcus/f__Tremellaceae;g__Cryptococcus/' db_file.fasta
sed -i 's/f__Pezizaceae;g__Aleurina/f__Pyronemataceae;g__Aleurina/' db_file.fasta
And then try again:
python unite2rdp_train_files.py db_file.fasta
python lineage2taxTrain.py rawTaxonomy.txt > ready4train_taxonomy.txt
python addFullLineage.py rawTaxonomy.txt rawSeqs.fasta > ready4train_seqs.fasta
java -Xmx10g -jar /usr/local/RDPTools/classifier.jar train -o training_files -s ready4train_seqs.fasta -t ready4train_taxonomy.txt
Success! Output files are written to the sub-directory training_files. The last step is to copy another file to that directory.
cp /usr/local/RDPTools/classifier/samplefiles/rRNAClassifier.properties ./training_files/
An example command to use the new training files to classify sequences is given below. It is assumed that the input file zotus.fasta is in the working directory and training_files is a sub-directory. Otherwise give the full paths.
java -Xmx10g -jar /usr/local/RDPTools/classifier.jar classify -c 0.8 -f allrank -t training_files/rRNAClassifier.properties -o classified_zotus.txt zotus.fasta
A suggestion: Move training_files to a resource directory and rename it in a meaningful manner, for example as UNITE_v8.