Command Line Unsupervised Approach (RDP Pipeline)


This exercise is for a command line version of the RDP Pipeline, the unsupervised approach for clustering  rRNA gene amplicons into Operational Taxonomic Units (OTUs) based on their similarities. An OTU table is generated, giving the abundance of each OTU in each sample. To aid interpretation, representative sequences for each OTU are classified with the RDP Classifier. Additionally, since the 16S rRNA gene sequences used in the example can be aligned, a tree of the OTUs is produced. Finally, the OTU table, classification table, sample data table, tree, and representative sequences are combined into a phyloseq object for downstream analysis.


The overall objective of this workshop exercise is to process bacterial 16S sequences using the command line programs  USEARCH, RDPTools, FastTree, and R. The end result is a phyloseq object containing  an OTU table, taxonomy table, sample data, representative sequences, and tree file.  Beginning with pre-processed sequences, the steps are:

  1. Remove chimeric sequences
  2. Process sequence data into an OTU table
  3. Obtain representative sequences for each OTU
  4. Classify the representative sequences
  5. Tree the representative sequences with FastTree
  6. Package these items into a phyloseq object for analysis with R
  7. Create a sample data file in a spreadsheet program
  8. Create the  phyloseq object

In the next section, commands are given and explained for each of these steps separately. Notes are also included on how the commands may be edited to run on MSU’s HPCC, on another machine, and for other types of data – e.g. 28S fungal sequences.

RDP Pipeline Script

The first section of the script, below, configures paths. Two configurations are provided: one for a local installation of RDPTools, FunGene pipelineUSEARCH version 8 and FastTree according to the tutorials on, and a second for MSU’s HPCC. USEARCH version 8 is used because it is the last version to include uchime with the option to output non-chimeric sequences. As supplied in the exercise download, the code block for the local installation has been commented out; thus the script is ready to run on MSU’s HPCC. Both configurations assume that a uchime database for reference based chimera removal is in the sub-directory resources under the user’s home directory. The JAVA_HEAP_SIZE requested is 2g; this is more than sufficient for the small data set in the example. You will likely have to increase it for “real” data.

# rRNA pipeline beginning with trimmed sequences
# This script is run from a directory above the directory trimmed_seqs containing the trimmed sequences to be processed.

# Configure paths:

# This configuration assumes that programs are installed as directed in the tutorials on
# The paths may differ for other installations.
# RDPToolsDir=/usr/local/RDPTools
# infernalDir=/usr/lib/infernal
# cmModelDir=/usr/local/fungene_pipeline/resources/RRNA_16S_BACTERIA
# usearch=/usr/local/bin/usearch8.1
# FastTree=/usr/bin/fasttree

# This configuration is correct for MSU's HPCC.
module load usearch
module load fasttree

# For both configurations:

In this case the cmModelDir points to the HMM model infernal uses to align bacterial 16S rRNA gene sequences. To align archaeal 16S rRNA gene sequences or fungal 28S rRNA gene sequences, substitute one of the following lines:

For archaea:
For fungi:

The next portion of the script removes chimeras using uchime in reference mode. The file rdp_gold.fa (available here) in the user’s resources directory is used as the uchime database. You may, of course, use another database if you prefer.

## Remove chimeras with uchime in reference mode. Non-chimeric sequences are written to the sub-directory non-chimeric_seqs.
mkdir non-chimeric_seqs
cd trimmed_seqs
for f in $(ls ./*.fastq)
     $usearch -uchime_ref $f -db $refSeqDir/rdp_gold.fa -nonchimeras ../non-chimeric_seqs/${f/fastq/}fasta -strand plus

If you choose not to remove chimeras automatically this way, the script must be modified for subsequent steps to work. The easiest way to do this is to replace the for loop in the code block above with a command to write the files to non-chimeric_seqs in fasta format. The script includes the following code block for this purpose, but it is commented out. So, to skip removing chmeras, comment out the code block above and uncomment the next code block below.

## Skip removing chimeras. Instead copy sequences to the sub-directory non-chimeric_seqs.
#mkdir non-chimeric_seqs
#cd trimmed_seqs
#for f in $(ls ./*.fastq)
#     java -Xmx$JAVA_HEAP_SIZE -jar /usr/local/RDPTools/ReadSeq.jar to-fasta $f > ../non-chimeric_seqs/${f/fastq/}fasta

Align the sequences for each sample.

# Align the chimera-free sequences for each sample. Put the results in sub-directory aligned_nc.
cd ../
mkdir aligned_nc
cd non-chimeric_seqs
for f in $(ls *.fasta)
    $infernalDir/cmalign -g --noprob --outformat Stockholm --dnaout -o ../aligned_nc/${f/fasta/}stk $cmModelDir/ $f

Merge the alignments into a single aligned fasta file.

# Merge the alignments.
cd ../
mkdir merged_alignment
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/AlignmentTools.jar alignment-merger aligned_nc merged_alignment/merged_aligned.fasta

Next, the aligned sequences are clustered. There are actually three steps to this process.

# Cluster, 3 steps.
mkdir cluster
cd cluster
# 1) Dereplicate
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -m '#=GC RF' -o derep.fa all_seqs.ids all_seqs.samples ../aligned_nc/*.stk

# 2) Calculate the distance matrix.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar dmatrix --id-mapping all_seqs.ids --in derep.fa --outfile derep_matrix.bin -l 200 --dist-cutoff 0.1

# 3) Cluster
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar cluster --dist-file derep_matrix.bin --id-mapping all_seqs.ids --sample-mapping all_seqs.samples --method complete --outfile all_seq_complete.clust

The next step is to create an OTU file from the cluster results. Rather than writing the result to a tab-delimited file with samples as rows and taxa as columns, results are written to a biom file. I prefer this method because of the way OTUs are named, and the ease with which the biom file is imported into phyloseq later.

# Convert cluster file to biom format.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar cluster-to-biom all_seq_complete.clust 0.03 > all_seq_complete.clust.biom

Get a single representative sequence for each OTU:

# Get representative sequences.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar rep-seqs -c --id-mapping all_seqs.ids --one-rep-per-otu all_seq_complete.clust 0.03 ../merged_alignment/merged_aligned.fasta

Next, classify the representative sequences. Use the classify command option -f biom to add the classification results to the biom file created above.

# Add classification to the biom file.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/classifier.jar classify -c 0.5 -f biom -m all_seq_complete.clust.biom -o all_seq_complete_classified.biom all_seq_complete.clust_rep_seqs.fasta

In order to import the representative sequences into phyloseq, unalign them and remove their descriptions, leaving only their IDs corresponding to their OTU names.

# Unalign representative sequences and remove descriptions.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar to-unaligned-fasta all_seq_complete.clust_rep_seqs.fasta | cut -f1 -d ' ' > unaligned_short_names.fasta

There are two steps to making a phylogenetic tree of the sequences:

# Tree the representative sequences
# First, extract the comparable (model) positions for tree building.
java --Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -f -o all_seq_complete.clust_rep_seqs_model_only.fasta rep_seqs.ids rep_seqs.sample all_seq_complete.clust_rep_seqs.fasta

# Build the tree.
$FastTree -nt -gtr < all_seq_complete.clust_rep_seqs_model_only.fasta > my_expt_tree.nwk

Finally, results that will be imported into phyloseq are packaged into a compressed file for download and moved into the working directory.

# Package results for phyloseq.
tar -czf results4phyloseq.tar all_seq_complete_classified.biom unaligned_short_names.fasta my_expt_tree.nwk
mv results4phyloseq.tar ../


This exercise may be run locally if you have installed RDPTools, FunGene pipelineUSEARCH, and FastTree, or on MSU’s HPCC if you have an account.

If you do not already have a directory names resources, create one:

mkdir ~/resources

If your resources directory does not include rdp_gold.fa, download it to your resources directory:

cd ~/resources

From you home directory, run the following commands to download and unzip the exercise files:


Alternatively, you can download the exercise files from here. In either case, unzipping command_line_RDP_pipeline_exercise.zipcreates the directory test_script containing the script and a sub-directory trimmed_seqs containing four sample fastq files. Move into the directory test_script. If you are running the exercise on MSU’s HPCC, make executable and run the script. Otherwise, edit the configuration portion of the script so that the HPCC configuration block is commented out and the local configuration block is not. Then make sure the script is executable and run it.

cd test_script
chmod 755

If all goes well, the file will be created in test_script.

Import Results into phyloseq

You will most likely do further analysis in R on your own computer. So to continue this exercise, download and/or move to a directory on your own computer and unzip it. Open RStudio and make the directory your working R directory. Then run the following script:

expt <- import_biom("all_seq_complete_classified.biom")
my_seqs <- readDNAStringSet("unaligned_short_names.fasta")
my_tree <- read_tree("my_expt_tree.nwk")
expt <- merge_phyloseq(expt, my_seqs, my_tree)

R should respond by listing the contents of expt:

phyloseq-class experiment-level object
otu_table() OTU Table: [ 215 taxa and 4 samples ]
tax_table() Taxonomy Table: [ 215 taxa by 1 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 215 tips and 213 internal nodes ]
refseq() DNAStringSet: [ 215 reference sequences ]

I find it easiest to create the sample data table in a spreadsheet program and then read it into R and merge it with the experiment level  phyloseq object. To make sure the sample names agree, you can write them to a comma-delimited file that can be read into your spreadsheet program:

write.csv(sample_names(expt), file = "sample_names.csv")

To continue this exercise, open sample_names.csv in your spreadsheet program, add a few columns of made up meta data, and save it as sample_data.csv. The sample names should be in the first column beginning in the second row, and the first row should include the variable names. Then, back in R and from your working directory:

sam <- read.csv("sample_data.csv", header=TRUE, row.names=1)
sam <- sample_data(sam, errorIfNULL = TRUE)
expt <- merge_phyloseq(expt, sam)

phyloseq-class experiment-level object
otu_table() OTU Table: [ 216 taxa and 4 samples ]
sample_data() Sample Data: [ 4 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 216 taxa by 1 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 216 tips and 214 internal nodes ]
refseq() DNAStringSet: [ 216 reference sequences ]