Introduction
This exercise is for a command line version of the RDP Pipeline, the unsupervised approach for clustering rRNA gene amplicons into Operational Taxonomic Units (OTUs) based on their similarities. An OTU table is generated, giving the abundance of each OTU in each sample. To aid interpretation, representative sequences for each OTU are classified with the RDP Classifier. Additionally, since the 16S rRNA gene sequences used in the example can be aligned, a tree of the OTUs is produced. Finally, the OTU table, classification table, sample data table, tree, and representative sequences are combined into a phyloseq
object for downstream analysis.
Objectives
The overall objective of this workshop exercise is to process bacterial 16S sequences using the command line programs USEARCH
, RDPTools
, FastTree
, and R
. The end result is a phyloseq
object containing an OTU table, taxonomy table, sample data, representative sequences, and tree file. Beginning with pre-processed sequences, the steps are:
- Remove chimeric sequences
- Process sequence data into an OTU table
- Obtain representative sequences for each OTU
- Classify the representative sequences
- Tree the representative sequences with
FastTree
- Package these items into a
phyloseq
object for analysis withR
- Create a sample data file in a spreadsheet program
- Create the
phyloseq
object
In the next section, commands are given and explained for each of these steps separately. Notes are also included on how the commands may be edited to run on MSU’s HPCC, on another machine, and for other types of data – e.g. 28S fungal sequences.
RDP Pipeline Script
The first section of the script, below, configures paths. Two configurations are provided: one for a local installation of RDPTools
, FunGene pipeline
, USEARCH
version 8 and FastTree
according to the tutorials on john-quensen.com, and a second for MSU’s HPCC. USEARCH
version 8 is used because it is the last version to include uchime
with the option to output non-chimeric sequences. As supplied in the exercise download, the code block for the local installation has been commented out; thus the script is ready to run on MSU’s HPCC. Both configurations assume that a uchime
database for reference based chimera removal is in the sub-directory resources
under the user’s home directory. The JAVA_HEAP_SIZE
requested is 2g; this is more than sufficient for the small data set in the example. You will likely have to increase it for “real” data.
#!/bin/bash
# rRNA pipeline beginning with trimmed sequences
# This script is run from a directory above the directory trimmed_seqs containing the trimmed sequences to be processed.
# Configure paths:
# This configuration assumes that programs are installed as directed in the tutorials on john-quensen.com.
# The paths may differ for other installations.
# RDPToolsDir=/usr/local/RDPTools
# infernalDir=/usr/lib/infernal
# cmModelDir=/usr/local/fungene_pipeline/resources/RRNA_16S_BACTERIA
# usearch=/usr/local/bin/usearch8.1
# FastTree=/usr/bin/fasttree
# This configuration is correct for MSU's HPCC.
RDPToolsDir=/mnt/research/rdp/public/RDPTools
infernalDir=/mnt/research/rdp/public/thirdParty/infernal-1.1/src
cmModelDir=/mnt/research/rdp/public/fungene_pipeline/resources/RRNA_16S_BACTERIA
module load usearch
module load fasttree
usearch=usearch
FastTree=FastTree
# For both configurations:
refSeqDir=$HOME/resources
JAVA_HEAP_SIZE=2g
In this case the cmModelDir
points to the HMM model infernal
uses to align bacterial 16S rRNA gene sequences. To align archaeal 16S rRNA gene sequences or fungal 28S rRNA gene sequences, substitute one of the following lines:
For archaea: cmModelDir=/mnt/research/rdp/public/fungene_pipeline/resources/RRNA_16S_ARCHAEA
For fungi: cmModelDir=/mnt/research/rdp/public/fungene_pipeline/resources/RRNA_28S
The next portion of the script removes chimeras using uchime
in reference mode. The file rdp_gold.fa
(available here) in the user’s resources directory is used as the uchime
database. You may, of course, use another database if you prefer.
## Remove chimeras with uchime in reference mode. Non-chimeric sequences are written to the sub-directory non-chimeric_seqs.
mkdir non-chimeric_seqs
cd trimmed_seqs
for f in $(ls ./*.fastq)
do
$usearch -uchime_ref $f -db $refSeqDir/rdp_gold.fa -nonchimeras ../non-chimeric_seqs/${f/fastq/}fasta -strand plus
done
If you choose not to remove chimeras automatically this way, the script must be modified for subsequent steps to work. The easiest way to do this is to replace the for loop in the code block above with a command to write the files to non-chimeric_seqs
in fasta format. The script includes the following code block for this purpose, but it is commented out. So, to skip removing chmeras, comment out the code block above and uncomment the next code block below.
## Skip removing chimeras. Instead copy sequences to the sub-directory non-chimeric_seqs.
#mkdir non-chimeric_seqs
#cd trimmed_seqs
#for f in $(ls ./*.fastq)
#do
# java -Xmx$JAVA_HEAP_SIZE -jar /usr/local/RDPTools/ReadSeq.jar to-fasta $f > ../non-chimeric_seqs/${f/fastq/}fasta
#done
Align the sequences for each sample.
# Align the chimera-free sequences for each sample. Put the results in sub-directory aligned_nc.
cd ../
mkdir aligned_nc
cd non-chimeric_seqs
for f in $(ls *.fasta)
do
$infernalDir/cmalign -g --noprob --outformat Stockholm --dnaout -o ../aligned_nc/${f/fasta/}stk $cmModelDir/model.cm $f
done
Merge the alignments into a single aligned fasta file.
# Merge the alignments.
cd ../
mkdir merged_alignment
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/AlignmentTools.jar alignment-merger aligned_nc merged_alignment/merged_aligned.fasta
Next, the aligned sequences are clustered. There are actually three steps to this process.
# Cluster, 3 steps.
mkdir cluster
cd cluster
# 1) Dereplicate
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -m '#=GC RF' -o derep.fa all_seqs.ids all_seqs.samples ../aligned_nc/*.stk
# 2) Calculate the distance matrix.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar dmatrix --id-mapping all_seqs.ids --in derep.fa --outfile derep_matrix.bin -l 200 --dist-cutoff 0.1
# 3) Cluster
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar cluster --dist-file derep_matrix.bin --id-mapping all_seqs.ids --sample-mapping all_seqs.samples --method complete --outfile all_seq_complete.clust
The next step is to create an OTU file from the cluster results. Rather than writing the result to a tab-delimited file with samples as rows and taxa as columns, results are written to a biom file. I prefer this method because of the way OTUs are named, and the ease with which the biom file is imported into phyloseq
later.
# Convert cluster file to biom format.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar cluster-to-biom all_seq_complete.clust 0.03 > all_seq_complete.clust.biom
Get a single representative sequence for each OTU:
# Get representative sequences.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar rep-seqs -c --id-mapping all_seqs.ids --one-rep-per-otu all_seq_complete.clust 0.03 ../merged_alignment/merged_aligned.fasta
Next, classify the representative sequences. Use the classify command option -f biom
to add the classification results to the biom file created above.
# Add classification to the biom file.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/classifier.jar classify -c 0.5 -f biom -m all_seq_complete.clust.biom -o all_seq_complete_classified.biom all_seq_complete.clust_rep_seqs.fasta
In order to import the representative sequences into phyloseq
, unalign them and remove their descriptions, leaving only their IDs corresponding to their OTU names.
# Unalign representative sequences and remove descriptions.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar to-unaligned-fasta all_seq_complete.clust_rep_seqs.fasta | cut -f1 -d ' ' > unaligned_short_names.fasta
There are two steps to making a phylogenetic tree of the sequences:
# Tree the representative sequences
# First, extract the comparable (model) positions for tree building.
java --Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -f -o all_seq_complete.clust_rep_seqs_model_only.fasta rep_seqs.ids rep_seqs.sample all_seq_complete.clust_rep_seqs.fasta
# Build the tree.
export OMP_NUM_THREADS=1
$FastTree -nt -gtr < all_seq_complete.clust_rep_seqs_model_only.fasta > my_expt_tree.nwk
Finally, results that will be imported into phyloseq
are packaged into a compressed file for download and moved into the working directory.
# Package results for phyloseq.
tar -czf results4phyloseq.tar all_seq_complete_classified.biom unaligned_short_names.fasta my_expt_tree.nwk
mv results4phyloseq.tar ../
Exercise
This exercise may be run locally if you have installed RDPTools
, FunGene pipeline
, USEARCH
, and FastTree
, or on MSU’s HPCC if you have an account.
If you do not already have a directory names resources
, create one:
mkdir ~/resources
If your resources
directory does not include rdp_gold.fa
, download it to your resources
directory:
cd ~/resources
wget https://drive5.com/uchime/rdp_gold.fa
From you home directory, run the following commands to download and unzip the exercise files:
wget https://john-quensen.com/wp-content/uploads/2017/11/command_line_RDP_pipeline_exercise.zip
unzip command_line_RDP_pipeline_exercise.zip
Alternatively, you can download the exercise files from here. In either case, unzipping command_line_RDP_pipeline_exercise.zip
creates the directory test_script
containing the script RDP_pipeline_script.sh
and a sub-directory trimmed_seqs
containing four sample fastq files. Move into the directory test_script
. If you are running the exercise on MSU’s HPCC, make RDP_pipeline_script.sh
executable and run the script. Otherwise, edit the configuration portion of the script so that the HPCC configuration block is commented out and the local configuration block is not. Then make sure the script is executable and run it.
cd test_script
chmod 755 RDP_pipeline_script.sh
./RDP_pipeline_script.sh
If all goes well, the file results4phyloseq.zip
will be created in test_script
.
Import Results into phyloseq
You will most likely do further analysis in R
on your own computer. So to continue this exercise, download and/or move results4phyloseq.zip
to a directory on your own computer and unzip it. Open RStudio
and make the directory your working R
directory. Then run the following script:
library("phyloseq")
library("Biostrings")
expt <- import_biom("all_seq_complete_classified.biom")
my_seqs <- readDNAStringSet("unaligned_short_names.fasta")
my_tree <- read_tree("my_expt_tree.nwk")
expt <- merge_phyloseq(expt, my_seqs, my_tree)
expt
R
should respond by listing the contents of expt
:
phyloseq-class experiment-level object
otu_table() OTU Table: [ 215 taxa and 4 samples ]
tax_table() Taxonomy Table: [ 215 taxa by 1 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 215 tips and 213 internal nodes ]
refseq() DNAStringSet: [ 215 reference sequences ]
I find it easiest to create the sample data table in a spreadsheet program and then read it into R
and merge it with the experiment level phyloseq object. To make sure the sample names agree, you can write them to a comma-delimited file that can be read into your spreadsheet program:
write.csv(sample_names(expt), file = "sample_names.csv")
To continue this exercise, open sample_names.csv
in your spreadsheet program, add a few columns of made up meta data, and save it as sample_data.csv
. The sample names should be in the first column beginning in the second row, and the first row should include the variable names. Then, back in R
and from your working directory:
library("phyloseq")
sam <- read.csv("sample_data.csv", header=TRUE, row.names=1)
sam <- sample_data(sam, errorIfNULL = TRUE)
expt <- merge_phyloseq(expt, sam)
expt
phyloseq-class experiment-level object
otu_table() OTU Table: [ 216 taxa and 4 samples ]
sample_data() Sample Data: [ 4 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 216 taxa by 1 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 216 tips and 214 internal nodes ]
refseq() DNAStringSet: [ 216 reference sequences ]