Chimera Filtering

Introduction

Chimeras are sequences with portions from two or more biological sequences. They can form during PCR, most often as a result of incomplete extension. Their presence in our data artificially increases community diversity measurements as they are essentially recognized as novel species. A first line of defense against this is to adopt PCR conditions that minimize their formation. Reducing the number of amplification cycles reduces the chance of chimera formation (Stevens et al., 2013), as does slowing thermocylcler ramp speed (Smyth et al., 2010). A second line of defense is to use a program to filter them out of our data. No such program is perfect, but RDP recommends DECIPHER and uchime as two of the better ones.

DECIPHER is a reference based program. It compares sequences to a chimera-free database, first classifying the sequence and then determining if short segments of the sequence are uncommon in the phylogenetic group to which it was classified but common in other phylogenetic groups. DECIPHER is available as a web-based tool for 16S sequences (http://decipher.cee.wisc.edu/FindChimeras.html) and as a Bioconductor package (http://decipher.cee.wisc.edu/Download.html). It makes use of RDP’s database of 16S reference sequences and the SILVA 117 database for LSU rRNA. It can also be used to create a reference database if there is a sufficiently large repository of sequences available.

Uchime is a module in the USEARCH program. In reference mode, it detects chimeras by determining if the ends of a sequence match to the same reference sequence. It also has a de novo mode not dependent on a database. In this mode, in a sense it uses the sample itself as a reference database, testing if less abundant sequences can be explained as chimeras of more abundant sequences. It requires that the sample be dereplicated with a size annotation included in the unique fasta IDs, and that the sequences be sorted in decreasing order of this size annotation. Besides having these two modes of operation, uchime has the advantage that it can be easily included in pipeline scripts.

References:

Edgar, R. C., B. J. Haas, J. C. Clemente, C. Quince, and R. Knight. 2011. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 15:2194-2200.

Smyth, R. P., T. E. Schlub, A. Grimm, V. Venturi, A. Chopra, S. Mallal, M. P. Davenport, and J. Mak. 2010. Reducing chimera formation during PCR amplification to ensure accurate genotyping. Gene 469:45-51.

Stevens, J. L., R. L. Jackson, and J. B. Olson. 2013. Slowing PCR ramp speed reduces chimera formation from environmental samples. Journal of Microbiological Methods 93:203-205.

Wright, E. S., L. S. Yilmaz, and D. R. Noguera. 2012. DECIPHER, a Search-Based Approach to Chimera Identification for 16S rRNA Sequences. Applied and Environmental Microbiology 78:717-725.

Web-based Tools

Web based tools have limits as to the number of sequences that can be processed. In reference mode there are also limits as to reference databases that can be used. Thus while they may be convenient to use, they are not as flexible as command-line tools.

DECIPHER

The web-based DECIPHER tool is limited to processing 16S sequences. Files to be processed must be in fasta format and are limited to 50 Mb rather than to a fixed number of sequences. Fastq files may be converted with RDPTools ReadSeq. An example command is:

java -Xmx1g -jar /path/to/RDPTools/ReadSeq.jar to-fasta in_file.fastq > out_file.fasta

Use of web-based DECIPHER is self-explanatory. It is easier to keep things straight by submitting samples separately, but it is requested that no more than four jobs be submitted at a time. A link to the results is emailed to the user when the job finishes.

FunGene Pipeline’s Chimera Check

Uchime in de novo mode may be run from RDP’s FunGene pipeline page. Access the page and fill in the form. Multiple sample files in fasta or fastq format may be zipped together for upload. The sample files are processed separately. A download link will be emailed to the user after the job finishes. Download the results file (in tgz format) and decompress it. Chimera-free sequences by sample will be in the sub-folder chimera_filtered_sequences.

Command-line tools

Instructions for installing the Bioconductor version of DECIPHER and reference databases to use with it can be found at http://decipher.cee.wisc.edu/Download.html. A vignette on using the DECIPHER package to find chimeras is at http://decipher.cee.wisc.edu/FindChimeras.pdf.

USEARCH

The last version of USEARCH to include uchime was version 8.1. It is available as a module on MSU’s HPCC and in the RDP’s directory of public programs. The version of uchime in USEARCH is faster than the earlier stand-alone version and is easily used to process trimmed sequences in reference mode. It an be incorporated into a pipeline after the initial processing step and before the classification (supervised method) or alignment (unsupervised method), as in unsupervised_pipeline_16s.sh used in the 16S pipeline tutorial. You can build your own indexed databases from RDP’s training sets posted at http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/. Versions are available for bacterial and archaeal 16S RNA data, fungal 18S RNA data, and Warcup and Unite ITS data. Download them, decompress them, concatenate them with their reverse complements, and convert the result into an indexed database. For example, for the latest RDP 16S training set:

#! /bin/bash
# Make directory ~/resources if it does not already exist.
# Run these commands from directory containing the zipped training data file.
unzip RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip
cd RDPClassifier_16S_trainsetNo16_rawtrainingdata
java -Xmx2g -jar <your_path_to>/RDPTools/ReadSeq.jar reverse-comp 
    -i trainset16_022016.fa.fa -o rc.fa
cat trainset16_022016.fa rc.fa > rdp_trainset_16.fa
<your_path_to>/usearch8.1 -makeudb_usearch rdp_trainset_16.fa -output rdp_trainset_16.udb
mv rdp_trainset_16.fa ~/resources
mv rdp_trainset_16.udb ~/resources

Note:  Indexed USEARCH data bases are specific to the version of USEARCH and possibly to the command with which they are being used. They should therefore be named in a way that allows them to be identified accordingly. I did not do so in this example.

Exercise

The database you just created is used in this exercise. Download additional files for this exercise from here. Create the directory test_uchime in your home directory. Put the downloaded file in that directory and unzip it. The script file should look like:

#!/bin/bash
# remove_chimeras.sh
# This script is run from a directory above the directory trimmed_seqs containing the
# trimmed sequences to be processed.
# The reference sequences or udb database are assumed to be in directory ~/resources.

# Configure paths
# Next 3 lines for MSU's HPCC
# RDPToolsDir=/mnt/research/rdp/public/RDPTools
# usearch81=/mnt/research/rdp/public/thirdParty/usearch8.1.1831_i86linux64
# refSeqDir=~/resources

# Next 3 lines for my local installation.
RDPToolsDir=/usr/local/RDPTools 
usearch81=/usr/local/bin/usearch8.1
refSeqDir=~/resources

# Remove chimeras with uchime in reference mode. Put non-chimeric sequences in
# sub-directory non-chimeric_seqs.
mkdir non-chimeric_seqs
cd trimmed_seqs
for f in $(ls *.fastq)
do
    $usearch81 -uchime_ref $f -db $refSeqDir/rdp_trainset_16.udb -nonchimeras ../non-chimeric_seqs/${f/fastq/}fasta -strand plus
done

Edit the paths in the script and the name of the USEARCH file if necessary. Check that the script file is executable and change the permissions if necessary. When all is ready, run the script with the command:

./remove_chimeras.sh
>~/test_uchime/non-chimeric-seqs.

USEARCH from a Python Script

I believe that it is better to use the reference mode with uchime when a good database is available. But when one does not exist, as with most functional genes, the de novo mode comes to the rescue.

A python script is available to run uchime in de novo mode on the trimmed sequences from RDP’s initial processing step. The script prepares the sequences for processing by dereplicating, adding the required size annotation to the fasta IDs, and sorting the sequences in decreasing order of this size annotation. It then calls uchime for the chimera checking. The script can process samples individually as does the web-based tool on the FunGene Pipeline page, or with the -c option it will pool all samples together before dereplication and then expand the results back to individual sample files. This is the procedure that Robert Edgar (USEARCH author) recommended in the USEARCH 8.1 manual as he believed it reduced the number of false negatives (i.e. catches more true chimeras). The command to use the script on MSU’s HPCC has the form:

python /mnt/research/rdp/public/pythonscripts/rundenovo_uchime.py -c -o output_dir input_files.extension

If the -c option is ommitted, samples are processed individually. Input files have must extension fasta, fastq, or fa. Wildcards are permitted. For example, input_files.extension could be trimmed_seqs/*.fastq. I suggest that you keep rundenovo_uchime.py and other scripts in a separate scripts folder so they can be easily found and referenced.

Exercise

Download the file run_denovo_uchime.zip from here. Start FileZilla and open a connection to the HPCC. Create the directory test_uchime_denovo in you home directory. Put the downloaded file in that directory and run the following commands:

unzip run_denovo_uchime.zip
ls -l
chmod u+x run_uchime_denovo.py
python run_uchime_denovo.py -c -o out_dir trimmed_seqs/*.fastq

Filtered sequences will be in the directory out_dir/chimera_filtered_sequences.