unite2rdp_train_files.py

Two files are necessary to train the RDP classifier, a sequence file in fasta format, and a tab delimited taxonomy file giving the lineage for each sequence in the fasta file. The script unite2rdp_train_files.py reads a UNITE general reference file and parses the information in the fasta headers to create these two files. I wrote the script specifically for the UNITE general FASTA release version 8. Understanding how the script works will enable you to modify the two functions it contains so that it can work with other reference sequences.

The UNITE general release fungi data base sequences have headers like:

>Glomeraceae|AM076560|SH146432.05FU|refs|k__Fungi;p__Glomeromycota;c__Glomeromycetes;o__Glomerales;f__Glomeraceae;g__Glomus;s__uncultured_Glomus

The accession number for the sequence is between the first two vertical bars. The taxonomy string comes after the last (fourth) vertical bar. The function rdp_format re-formats the UNITE sequence header to include only the accession number. The function rdp_taxa re-formats the Unite sequence header into a tab-delimited string including the accession number followed by the lineage. It also takes care of unidentified ranks and problems I found specific to GS01 and GS18 lineages. If you need to use a fasta reference file with differently formatted headers, you will need to modify these two functions accordingly.

rdp_format

The code for this function is:

def rdp_format(my_string):
    my_list = my_string.split('|')
    first_part = '>' + my_list[1] + '\n'
    return first_part

The UNITE reference sequence header is passed to the function as my_string. The first line in the function splits my_string on the vertical bars and assigns the result to the variable my_list. If you were to print out my_list, it would look like this:

['>Glomeraceae', 'AM076560', 'SH146432.05FU', 'refs', 'k__Fungi;p__Glomeromycota;c__Glomeromycetes;o__Glomerales;f__Glomeraceae;g__;s__uncultured_Glomus']

Python counts items in the list beginning with 0, so my_list[1] is the second item in the list, the accession number. The second line in the function joins the greater than sign and the accession number together to form the string first_part. In this example, first _part equals >AM076560, which is returned as a sequence header for the output file rawSeqs.fasta.

rdp_taxa

The code for this function is:

def rdp_taxa(my_string):
    global num
    my_list = my_string.split('|')
    seq_id = my_list[1]
    taxa = ''.join(my_list[4:]).split(';')
    for i in range(len(taxa)):
        if 'unidentified' in taxa[i]:
            pre = taxa[i][:3]
            taxa[i] = pre + taxa[i-1]
    for i in range(len(taxa)):
        taxa[i] = taxa[i][3:]
    if 'GS01_sp' in taxa[6] or 'GS18_sp' in taxa[6]:
        num = num + 1
        taxa[6] = taxa[6].replace('\t\n', '')
        taxa[6] = taxa[6] + '_' + str(num) + '\n'
    taxa_string = seq_id + '\t' + '\t'.join(taxa)
    return taxa_string

The accession number is extracted and assigned to the variable seq_id in the same way as in the function rdp_format.  The fourth line splits the fifth item in my_list on the semi-colon character to create the new list taxa. The first for loop renames unidentified taxa with my preferred method. For example, if a lineage included o__Glomerales and f__unidentified, this loop would change the family to f__o__Glomerales. The second for loop removes the first three characters from all taxa. This would leave this particular unidentified family as o__Glomerales. The if statement adds numbers to the ends of species GS01_sp. and GS18_sp. I did not find any other instances in the reference file where species needed to be distinguished from each other. I did it for these species because their lineages could be different depending on how many taxa were unidentified in their lineages.

The next to last line joins seq_id and the six taxa into a tab-delimited string that is returned to the main program where it is written to the output file rawTaxonomy.txt