Web-Based Unsupervised Approach

Introduction

The unsupervised approach consists of clustering aligned sequences based on their similarities. These clusters are termed Operational Taxonomic Units, or OTUs. Results are summarized in OTU tables which give the number of occurrences of each OTU in each sample. Taxonomy may be assigned by classifying the representative sequence for each OTU.

Objectives

The overall objective of this workshop exercise is to process aligned 16S sequences using RDP’s web-based tools and to package the results into a phyloseq object containing an OTU table, taxonomy table, sample data, representative sequences, and tree file.  In part, you will follow the steps given in the workflow “Processing 16S rRNA data using an unsupervised method” in RDP tutorials.  Other steps will be done in R using functions in the RDPutils and phyloseq packages. You will learn how to use these tools to:

  1. process sequence data into an OTU table
  2. obtain representative sequences for each OTU
  3. classify the representative sequences
  4. tree the representative sequences
  5. create a sample data file
  6. package these items into a phyloseq object for analysis with R

Exercise

Create a directory, e.g. C:/tutorials, for the exercise files. Download example sequences in zip format from here to C:/tutorials. Later in the exercise, this directory will be your R working directory. You do not need to unzip this example file.

Align Sequences

Once you have trimmed and sorted your sequences, the next step is to align the sequences. Go to RDP’s Aligner web page at https://pyro.cme.msu.edu/aligner/form.spr. Log into your account if necessary. You will then be taken to the Aligner page. Fill in the “job name field” with a brief descriptive name identifying your project. For the “Select a gene field,” select your gene type from the pull down menu. Three gene types are available: bacterial 16S, archael 16S, and fungal 28S. For this exercise, choose bacterial 16S. Next to “Select your file for upload” click on “Choose file” and browse to your files to upload. You can upload separate files for each sample, or several files at a time.  Compressed files containing sequences for several samples, as downloaded above for this example, are also accepted and save upload time. The names of the submitted files will appear in the lower left-hand corner of the screen. After you have chosen all of the files for upload, click on “Submit for alignment.”

Your job will be submitted to the queue. You should receive an email when the job is complete. You also may check your job’s status by clicking on “my jobs” near the top of the screen. When the job is complete you may download the results in a compressed file in either zip or tgz format. (You may download the example file from here.) Save the result to C:/tutorials (or the directory of your choice).

Cluster Sequences

Unzip the result file obtained from the Aligner. The results will be written to the sub-directory alignment. It will contain, for each sample, some informational files and a combined fasta file beginning with “aligned_.”  Combine all of these fasta files into a  new compressed file named cluster_input.zip and put in the directory C:/tutorials. You may download an example of such a file from here.

Submit this new compressed file to the Cluster tool at http://pyro.cme.msu.edu/spring/cluster.spr.  This clustering tool is for nucleotide sequences only.  DO NOT follow the more general instructions for the RDP mcClust tool given in the RDP tutorials.   Fill in the “job name” with a descriptive name for your project.  Fill in values for “Max Distance” and “Step Size.” If you leave these as the default values (okay to do so), you will get clustering results from 0% distance to 15% distance in steps of 1%. Be sure to leave the box “create one cluster file for each input alignment file” unchecked.  Click on “Submit.”

Your job will be submitted to the queue. You should receive an email when the job is complete. You may also may check its status by clicking on “my jobs” near the top of the screen. When the job is complete you may download the results in a compressed file of either a zip or tgz format. (You may download the example result file from here.) Save the result to C:/tutorials (or the directory of your choice).

Make OTU Table

There are now three options for creating OTU tables from the cluster results. When you extract the cluster results file, the results will be written to the sub-directory clustering. You will get one clust file, some informational files, and a series of biom files, one for each distance from 0 to Max Distance by Step Size. If all you want is an OTU table, then it is convenient to import the biom files directly into phyloseq using phyloseq’s import_biom function:

library(phyloseq)
expt <- import_biom("all_samples.clust_0.03.biom")

When this is done the OTUs have names of the format cluster_1, cluster_2, …, cluster_999, etc.

If you want to add classification, representative sequences, and/or a tree to your experiment level phyloseq object, you can avoid downstream problems by using the web-based R-formatter tool to extract OTU tables from the cluster file.  In this case the OTU names have the form “OTUxxxx” where xxxx represents a padded number, for example, from 0000 to 9999.  The padding allows OTU names to be sorted in numerical order.  At 0 distance, OTUs are named consecutively.  At greater distances, some numbers will be missing as clusters will have been merged. The third possibility is to use RDPutils’ clstr2otu function to parse the clust file into an OTU table for a given distance. This is the most flexible option as you can choose either OTU name format and any distance in the clust file, but it is also time consuming.

For the remainder of this exercise we will  use the R-formatter tool and its results. Prepare the cluster file for uploading and get R-formatted OTU tables by doing the following:

  1. Unzip the cluster result file.  (You may download an example file from here.) The cluster file itself has the extension “clust.”
  2. Compress the cluster file (as a zip file) separately.  The zip file name must be in all lower case and cannot include spaces.  It may include underscore characters.  An acceptable name is “my_cluster_file.zip.” Put this file in C:/tutorials.
  3. Go to the Cluster to R Formatter page at https://pyro.cme.msu.edu/clust_format/form.spr.
  4. Fill in the box next to “Select a complete linkage cluster file” by browsing to the zipped version of the cluster file you just created and click on “Open.” (You may download the example my_cluster_file.zip file from here.)
  5. Fill in the box next to “End (0.00 to 0.50 inclusive)” with the maximum distance between sequences used in generating the cluster file.  For this exercise the value is 0.15.
  6. Click on “Download R Input File” and wait for the result (this job is not submitted to a queue).
  7. Save the results file (you may download the example for this exercise from here) to the directory C:/tutorials and unzip it. Results will be written to the sub-directory clust_format.
  8. When you unzip the compressed results file, you will get a series of tab-delimited text files, one for each distance, with samples in rows and OTUs in columns. Copy the one named rformat_dist_0.03.txt to C:/tutorials.

Get Representative Sequences

The first step to getting representative sequences for the OTUs is to merge the alignments for each sample, the ones you submitted to the clustering tool.

  1. Go to the Alignment Merger tool at https://pyro.cme.msu.edu/merger/form.spr.
  2. Fill in the job name field with a descriptive name for your experiment.
  3. Click on “choose files” and browse to the aligned file(s) you uploaded to the clustering tool (cluster_input.zip for this exercise, which you downloaded above). The name(s) of the selected files will appear in the lower left corner of the page.
  4. Click on submit.

Results are returned directly, and the download may even start automatically. Results are also stored in “myjobs.” Save the result to C:/tutorials. You do not need to unzip the file. For this example, the file is named alignment_merger_output.zip and you may download the example file from here.

The second step is to go to the Representative Sequence page at https://pyro.cme.msu.edu/derep/form.spr.

  1. Fill in the job name field with a descriptive name for your experiment.
  2. In the box next to “Maximum Distance,” enter the distance for which you want results as a percentage. For this exercise we want the representative sequence names for clusters at a distance of 0.03, so we enter 3.
  3. Next to “Select cluster file” click on “Choose File” and browse to the compressed cluster file (my_cluster_file.zip in this example). Click on Open to upload the file.
  4. Next to “Select aligned fasta file” click on “Choose File” and browse to the result of the alignment merger tool (alignment_merger_output.zip in this example). Click on Open to upload the file. The selected files will appear in the lower left of the page.
  5. Click on Submit.

Your job will be submitted to the queue. You should receive an email when the job is complete. You may also may check its status by clicking on “my jobs” near the top of the screen. When the job is complete you may download the results in a compressed file of either a zip or tgz format. (You may download the example for this exercise from here.) Save the result to C:/tutorials and unzip it. The result will be written to the same directory, C:/tutorials..

The representative sequences have machine names, i.e. the names assigned by the sequencing machine. The field for “use cluster ID” was meant to substitute the machine name with the cluster or OTU name, but it is currently missing on the Representative Sequence web page. Until this is fixed, the next step is to rename the representative sequences with the names of the OTUs to which they belong. This will be done in R.

Rename Representative Sequences

Open RStudio or the R console and set the working directory to the directory containing the representative sequences, i.e. C:/tutorials. Then follow the instructions and commands in the rest of this section:

Load RDPutils version 1.3.0 or later. Version 1.3.1 includes a bug fix to correctly output OTU names to match either the R-formatter or biom format. For this exercise we are using the R-formatter format, and version 1.3.0 works okay. Then execute the commands in the code boxes below. The “##” symbols denote R output in response to a command just above.

library(RDPutils)
## Loading required package: phyloseq
## Loading required package: reshape2
packageDescription("RDPutils")$Version
## [1] "1.3.1"

Make a data frame with columns associating the representative sequences machine names with the OTU names using the function assoc_repseq_IDs_with_otu_by_fasta. This function parses the cluster number from the fasta descriptions and creates a corresponding OTU name.

assoc.table <- assoc_repseq_IDs_with_otus_by_fasta("tutorial_rep_seqs.fasta", 
    otu_format = "R")
head(assoc.table)
##        RepSeq.ID      OTU Cluster # Cluster size MaxDist   minSS
## 1 HC9DO0P01A4MK6 OTU_0001         1           28 0.01553 0.00039
## 2 HC9DO0P01A88IL OTU_0002         2          435 0.02795 0.00397
## 3 HC9DO0P01AWEWV OTU_0003         3           11 0.02484 0.00076
## 4 HC9DO0P01BEZ78 OTU_0004         4          144 0.01863 0.00128
## 5 HC9DO0P01A9UY4 OTU_4405      4405            2 0.00311 0.00001
## 6 HC9DO0P01AY0OD OTU_0005         5            3 0.02492 0.00070

Before renaming the representative sequences, you need to remove the descriptions from the aligned representative sequences with the function trim_fasta_names.

trim_fasta_names(repseq_file = "tutorial_rep_seqs.fasta", trimmed_names = "trimmed_rep_seqs.fasta")
## [1] "Fasta file with trimmed names written to file trimmed_rep_seqs.fasta."

Now you can rename the aligned representative sequences with the function rename_fasta. These renamed sequences are still aligned and may be treed.

rename_fasta(in_file = "trimmed_rep_seqs.fasta", out_file = "renamed_rep_seqs.fasta", rename_table = assoc.table)
## [1] "Fasta sequences renamed with OTU names; written to file renamed_rep_seqs.fasta."

While in R, also unalign the renamed representative sequences. These renamed unaligned representative sequences will be submitted to the classifier.

unalign_fasta(in_file = "renamed_rep_seqs.fasta", out_file = "unaligned_renamed_rep_seqs.fasta")
## [1] "Unaligned fasta file written to file unaligned_renamed_rep_seqs.fasta."

Classify Representative Sequences

Go to the RDP Classifier web page at https://pyro.cme.msu.edu/derep/form.spr. Fill in the job name field with a descriptive name for your project. For the “select a gene” filed, select “Bacterial 16S” from the pull-down menu.  For the”Select a format” field, select “fixrank.” DO NOT select the “Treat all input files as one sample” button. The value for “Confidence cutoff” does not matter; it will be chosen on import to R. Choose the file to upload by clicking on “Choose file,” browsing to the file unaligned_renamed_rep_seqs.fasta and clicking on “Open.” The name of the selected file will appear in the lower left-hand corner of the Classifier web page. Click on “Submit for Classification.”

Your job will be submitted to the queue. You should receive an email when the job is complete. You may also may check its status by clicking on “my jobs” near the top of the screen. When the job is complete you may download the results in a compressed file of either a zip or tgz format. Save the result to the directory C:/tutorials (you may download the example result file for this exercise from here) and unzip it. The results will be written to the sub-directory classifier. Copy the file classifications.txt from the sub-directory to C:/tutorials.

Make Tree File

If you have FastTree, you may construct a tree file from the aligned representative sequences with the command:

cd C:/tutorials
FastTree -nt renamed_rep_seqs.fasta > my_tree.nwk

FastTree may be run under Linux, Windows, and MacOS. See http://www.microbesonline.org/fasttree/#Install for installation instructions.

Assemble phyloseq Object

Except for a sample data file, we now have all of the files necessary to construct our experiment level phyloseq object.  Return to R and follow the instructions and commands in the rest of this section.

Set the working directory to the folder containing your files, C:/tutorials in this example. Then load the required R packages. I suppressed warnings and messages to save space here, but be aware that all of these packages have dependencies which you must have installed.

setwd("C:/tutorials")
suppressWarnings(suppressMessages(library(RDPutils)))
suppressWarnings(suppressMessages(library(QsRutils)))
suppressWarnings(suppressMessages(library(Biostrings)))
suppressWarnings(suppressMessages(library(ape)))

Convert the classification result into a phyloseq class tax_table using the recommended confidence level for short reads (0.5) with the make_tax_table function.

tax.table <- make_tax_table(in_file = "classifications.txt", confidence = 0.5)

Import the R-formatted OTU file.

otu.table <- read.table(file = "rformat_dist_0.03.txt")
dim(otu.table)
## [1]    4 2240
rownames(otu.table)
## [1] "USGA_1_7_A.fasta"   "Native_1_4_A.fasta" "Native_1_2_A.fasta"
## [4] "USGA_2_7_A.fasta"

Trim the “.fasta” from the sample names, transpose otu.table, convert it to a matrix, and import it as a phyloseq otu-table. Check that the sample names have no extraneous characters.

rownames(otu.table) <- gsub(".fasta", "", rownames(otu.table))
rownames(otu.table)
## [1] "USGA_1_7_A"   "Native_1_4_A" "Native_1_2_A" "USGA_2_7_A"
otu.table <- as.matrix(t(otu.table))
otu.table <- otu_table(otu.table, taxa_are_rows = TRUE, errorIfNULL = TRUE)
class(otu.table)
## [1] "otu_table"
## attr(,"package")
## [1] "phyloseq"
sample_names(otu.table)
## [1] "USGA_1_7_A"   "Native_1_4_A" "Native_1_2_A" "USGA_2_7_A"

Import the reference sequences as a phyloseq object.

ref.seqs <- readDNAStringSet("renamed_rep_seqs.fasta")

Import the tree file as a phyloseq object.

tree.file <- read.tree("my_tree.nwk")

Assemble your experiment-level phyloseq object expt.

expt <- phyloseq(otu.table, tax.table, ref.seqs, tree.file)

Now all slots are filled except for sample_data. A flat text version of the sample data is best created in a spreadsheet program and saved as a tab- or comma-delimited text file. The sample names must match exactly the sample names in the phyloseq object expt that we just created. The easiest way to do this is to export the sample names to a text file and use it to create the sample data file.

write.csv(sample_names(expt), file = "sample_names.csv")

I opened sample_names.csv in Excel, made up a sample data file, and saved it as sample_data.csv. You may download my example from here and place it in C:/tutorials. Read it into R with read.csv and convert it to a phyloseq sample_data object.

sam <- read.csv("sample_data.csv", header = TRUE, row.names = 1)
sam
##              Factor_A Factor_B Variable_A Variable_B
## USGA_1_7_A       USGA      Red        1.2       10.5
## Native_1_4_A   Native     Blue        2.3       12.6
## Native_1_2_A   Native    Green        5.6       14.5
## USGA_2_7_A       USGA   Yellow        1.1        9.6
sam.data <- sample_data(sam, errorIfNULL = TRUE)
class(sam.data)
## [1] "sample_data"
## attr(,"package")
## [1] "phyloseq"
sam.data
##              Factor_A Factor_B Variable_A Variable_B
## USGA_1_7_A       USGA      Red        1.2       10.5
## Native_1_4_A   Native     Blue        2.3       12.6
## Native_1_2_A   Native    Green        5.6       14.5
## USGA_2_7_A       USGA   Yellow        1.1        9.6
sample_names(sam.data)
## [1] "USGA_1_7_A"   "Native_1_4_A" "Native_1_2_A" "USGA_2_7_A"

Merge expt with the sample_data object. The result is an experiment level phyloseq object with all slots filled.

expt <- merge_phyloseq(expt, sam.data)
expt
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2240 taxa and 4 samples ]
## sample_data() Sample Data:       [ 4 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 2240 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 2240 tips and 2238 internal nodes ]
## refseq()      DNAStringSet:      [ 2240 reference sequences ]

You may save expt as a compressed R data file for later analysis.

save(expt, file = "expt.rda")