Command Line FunGene Pipeline


The FunGene Pipeline is for processing functional gene sequences. It differs from the processing of rRNA gene sequences in several respects:

* Chimeras are usually removed using uchime in de novo mode.
* FrameBot is used to correct sequencing errors (insertions & deletions) and translate DNA sequences.
* FrameBot is also used to output an OTU table based on closest match in a reference database. (Analogous to a supervised approach.)
* HMMER3 is used to align the translated sequences.

Here I provide my own script for a command line version of the FunGene pipeline. I should stress that this is not the FunGene pipeline version available on GitHub. That is a collection of python scripts on which the web-based version is based. My script is simpler to use and has several advantages. For example, the script pools samples for chimera checking (now the preferred method) by uchime‘s de novo method, provides a single representative sequence for each cluster (rather than separate ones for each sample), provides a tree of the representative sequences, and allows assigning nomenclature to OTUs based on the closest match in FrambeBot‘s database. Explanations for the various steps are given below followed by an exercise with a small data set.

FunGene Pipeline Script

The first part of the script configures paths. Two configurations are provided: one for a local installation of RDPTools, FunGene pipeline, USEARCH version 8, HMMER3 and FastTree according to the tutorials on, and a second for MSU’s HPCC. USEARCH version 8 is used because it is the last version to include the uchime option to output non-chimeric sequences. As supplied in the exercise download, the code blocks for the local installation have been commented out; thus the script is ready to run on MSU’s HPCC.


# ## Begin configuration secton.

# ## Configure paths
# ## These paths are correct for programs installed according to tutorials.
# ## Required programs are RDPTools, FunGene_pipeline, usearch version 8, FastTree, and HMMER3.
# RDPToolsDir=/usr/local/RDPTools
# usearch=/usr/local/bin/usearch8.1
# hmmerDir=/usr/bin/	
# FastTree=/usr/bin/fasttree

## For HPCC
## These paths are correct for MSU's HPCC
module load USEARCH
module load FastTree

The second part of the script configures some gene specific parameters. The variable hmmfile points to the HMM model, for  nifH in this case, that HMMER3 will use to align the sequences . The path for this file depends on the installation; here again, the path for the local installation is commented out and the path for MSU’s HPCC is active. The other gene specific parameter, gene_ref_file, is the reference file FrameBot will use to process the sequences. The reference file can be either a fasta file of protein sequences having good coverage of the diversity of the gene of interest or an indexed file built from the corresponding nucleotide sequences with the index command, but there is no longer a reason to use an index file. If the FrameBot README section “Reference set preparation” on GitHub  still says that processing is faster with an index file, that information is out of date.   A note further down the same page explains that a kmer pre-filtering heuristic for no-metric-search has been added to FrameBot that greatly speeds up processing when a reference file in fasta format is used. Also, using a reference set in fasta format allows running FrameBot in “Add de novo References” mode. This may help with genes with high diversity or lack of closely related references sequences in the database. In de novo reference mode, if a query sequences does not match a reference sequence with more than 70% identity, it is added to the reference set if certain conditions are met. These conditions include length, identity, abundance cut-offs, and absence of stop codon(s).

# ## Gene specific parameters:
# ## HMMER3 model for local installation:
# hmmfile=/usr/local/fungene_pipeline/resources/nifh/model.hmm

# ## HMMER3 model for MSU's HPCC:

# Framebot reference file

The last part of the configuration sets the JAVA_HEAP_SIZE. This depends on the amount of data being processed. Two gigabytes is more than enough for this example (nine samples of 500 sequences each). You will likely have to increase it for “real” data.

# Set java heap size

# ## End configuration secton.

After configuration, the next step is to remove chimeric sequences. Samples are pooled using Clustering‘s derep command. At the same time, with options -u and -s, size annotation is added to the fasta IDs and the dereplicated sequences are sorted by decreasing size, as required to run uchime in de novo mode. RDP’s dereplicator tool puts two spaces between the sequences’ names and the size annotations. These spaces prevent uchime from running correctly, so I removed them using sed. After chimeras are removed, the ID and sample mappings need to be refreshed before the unique sequences can be exploded back to individual samples with the proper number of sequences in each. I had to put the spaces back in the fasta IDs for refresh-mappings to work, again using sed. After this step, the chimera-free fasta files for each sample are in the folder explode-mappings.

# Run uchime on pooled samples
# First, dereplicate, add size info & sort
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -u -s -o derep.fa all_seqs.ids all_seqs.samples trimmed_seqs/*.fastq
## Remove spaces from fasta IDs; else uchime will not run
cat derep.fa | sed 's/ //g' > derep_no_space.fa 
$usearch -uchime_denovo derep_no_space.fa -nonchimeras nc.fasta # Run uchime in denovo mode
# Put spaces back into fasta IDs; else refresh-mappings will not run
cat nc.fasta | sed 's/length/  length/' > nc_with_space.fasta 
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar refresh-mappings nc_with_space.fasta all_seqs.ids all_seqs.samples filtered_all_seq.ids filtered_all_seq.samples
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar explode-mappings -o explode-mappings filtered_all_seq.ids filtered_all_seq.samples nc_with_space.fasta

Next, the chimera-filtered sequences are processed using FrameBot. This tool corrects sequences for insertions and deletions by comparing them to reference sequences. It then translates the corrected nucleotide sequences into protein sequences and provides a nearest neighbor assignment from the reference database.

# Run FrameBot on sequences in directory explode-mappings.
# Move results into directory fb_out
mkdir fb_out
for f in $(ls explode-mappings/*.fasta)
	java -Xmx2g -jar $RDPToolsDir/FrameBot.jar framebot -o $d $RDPToolsDir/Framebot/refset/$gene_ref_file $f
	mv $d*.* fb_out

# Copy corrected protein sequences to directory fb_corr_prot.
mkdir fb_corr_prot
cp fb_out/*corr_prot.fasta fb_corr_prot

Next, the corrected protein sequences are aligned using HMMER3. This aligner uses a hidden Markov model to guide the alignment. The path and model file are given above in the configuration part of the script and are specific to the gene being analyzed.

# Align the framebot corrected proteins sequences.
mkdir hmm_aligned
for f in $(ls fb_corr_prot/*.fasta)
	$hmmerDir/hmmalign -o hmm_aligned/$d.stk $hmmfile $f

The next steps are the same as for processing rRNA gene sequences using the unsupervised method: merge alignments, cluster, convert to biom format, get and tree representative sequences.

# Merge the alignments.
mkdir merged_alignment
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/AlignmentTools.jar alignment-merger hmm_aligned merged_alignment/merged_aligned.fasta

## Cluster, 3 steps.
mkdir cluster
cd cluster
# 1) Dereplicate
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -m '#=GC RF' -o derep.fa all_seqs.ids all_seqs.samples ../hmm_aligned/*.stk

# 2) Calculate the distance matrix.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar dmatrix --id-mapping all_seqs.ids --in derep.fa --outfile derep_matrix.bin -l 50 --dist-cutoff 0.1

# 3) Cluster
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar cluster --dist-file derep_matrix.bin --id-mapping all_seqs.ids --sample-mapping all_seqs.samples --method complete --outfile all_seq_complete.clust

# Convert cluster file to biom format.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar cluster-to-biom all_seq_complete.clust 0.03 > all_seq_complete.clust.biom

# Get representative sequences.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar rep-seqs -c --id-mapping all_seqs.ids --one-rep-per-otu all_seq_complete.clust 0.03 ../merged_alignment/merged_aligned.fasta

# Unalign representative sequences and remove descriptions.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar to-unaligned-fasta all_seq_complete.clust_rep_seqs.fasta | cut -f1 -d ' ' > unaligned_short_names.fasta

# Tree the representative sequences.
# First, extract the comparable (model) positions for tree building.
java -Xmx$JAVA_HEAP_SIZE -jar $RDPToolsDir/Clustering.jar derep -f -o all_seq_complete.clust_rep_seqs_model_only.fasta rep_seqs.ids rep_seqs.sample all_seq_complete.clust_rep_seqs.fasta

$FastTree < all_seq_complete.clust_rep_seqs_model_only.fasta > my_expt_tree.nwk

Assigning taxonomy to the references sequences is different from doing so for rRNA gene sequences because we do not use the RDP classifier. Here I parse information from the _framebot.txt files for each sample and the fasta IDs for the representative sequences to give two files. The first file allows matching representative sequence machine names with the closest match in the database, and the second allows matching representative sequence machine names with cluster ID. In R, these files will be used to create a phyloseq tax_table of the representative sequences.

# Match FrameBot matches to machine names of representative sequences.
cd ..
mkdir taxa
cp fb_out/*.txt taxa
rm taxa/*failed*.txt
cat taxa/*.txt > matches.txt
grep "STATS" matches.txt | cut -f2,3,4,5,6 | cut -f5,6 -d "|" | sed 's/|/\t/' > match_taxa_machine_names.txt

# Match cluster numbers to machine names of representative sequences.
grep ">" cluster/all_seq_complete.clust_rep_seqs.fasta | cut -f1 -d "," | sed 's/>//' | sed 's/seq_id=//' | sed 's/  /\t/'  >  match_cluster_machine_name.txt

Files necessary for populating a phyloseq object are collected together in a tar file for easy download.

# Package results for phyloseq.
tar -czf results4phyloseq.tar cluster/all_seq_complete.clust cluster/all_seq_complete.clust.biom cluster/my_expt_tree.nwk match_taxa_machine_names.txt match_cluster_machine_name.txt

Several reports are then generated and packaged into a tar file for easy download. These include a tab-delimited OTU table for import into R (here for sequences with at least 80% similarity to their reference sequence), and hist, stat and summary reports. The stat report gives, for each sample, the number of sequences passing FrameBot and the number of frame shifts. The hist report gives the nearest match reference sequence, description and number of sequences close to each reference sequence for different % identity ranges. The summary report gives a list of subject sequences, description, and number of sequences close to the subject. To generate these reports, the _framebot.txt files for all samples need to be concatenated together, here in the file matches.txt made earlier in the script.

# Make stat reports
java -Xmx2g -jar $RDPToolsDir/FrameBot.jar stat -t matrix -e 80 -i filtered_all_seq.ids -s filtered_all_seq.samples matches.txt R_table_80.txt

java -Xmx2g -jar $RDPToolsDir/FrameBot.jar stat -t hist -i filtered_all_seq.ids -s filtered_all_seq.samples matches.txt hist_report.txt

java -Xmx2g -jar $RDPToolsDir/FrameBot.jar stat -t stat -i filtered_all_seq.ids -s filtered_all_seq.samples matches.txt stat_report.txt

java -Xmx2g -jar $RDPToolsDir/FrameBot.jar stat -t summary -i filtered_all_seq.ids -s filtered_all_seq.samples matches.txt summary_report.txt

# Package stat reports
tar -czf stat_reports.tar R_table_80.txt hist_report.txt stat_report.txt summary_report.txt


The script is run from a directory one level above the sub-directory trimmed_seqs containing sequences that have already been trimmed of primers and bar codes and sorted by sample. For this example, the sequences are amplicons of nifH.

This exercise may be run locally or as a job submitted to MSU’s HPCC. Create a directory, e.g. test_fungene, download the compressed file for the exercise from here,  place it in the directory you just created, and unzip it. Alternatively enter the following in your terminal:

cd ~/
mkdir test_fungene
cd test_fungene

This puts three files (, FunGene_pipeline.qsub, and sample_data.csv) in the directory and creates a sub-directory named trimmed_seqs containing nine sample fastq files of 500 sequences each. As supplied, the configuration portion of is correct for MSU’s HPCC, and you may submit the job by simply entering:

chmod 755
chmod 755 FunGen_pipeline.qsub
qsub FunGene_pipeline.qsub

To run the job locally, change the configuration by commenting lines pertinent to the HPCC and uncommenting lines pertinent to your local installation. Edit lines for your local installation if necessary.  Then run the exercise by entering in your terminal:

chmod 755

Expect the exercise to take between 30 and 40 minutes to run.

Create Experiment-level phyloseq Object

The above script packages files necessary for creating a phyloseq object in Put this file in your R working directory and unzip it. Some of the files will be in a sub-directory cluster. For convenience, move these up one level to your working directory. Also, put the file sample_data.csv from the exercise download in your R working directory. Then in R and from your R working directory enter:

expt <- import_biom("all_seq_complete.clust.biom")
my_tree <- read_tree("my_expt_tree.nwk")
my_tax <- make_framebot_tax_table(clstr_machine="match_cluster_machine_name.txt", taxa_machine="match_taxa_machine_names.txt")
my_seqs <- readAAStringSet("unaligned_short_names.fasta")
sam <- read.csv("sample_data.csv", header = TRUE, row.names= 1)
sam <- sample_data(sam, errorIfNULL = TRUE)
expt <- merge_phyloseq(expt, sam, my_tree, my_seqs, my_tax)

The above requires RDPutils version 1.3.2 or above. Install it following the instructions on on the GitHub page of this site. After running the above commands, R should respond with the following:

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 413 taxa and 9 samples ]
sample_data() Sample Data:       [ 9 samples by 2 sample variables ]
tax_table()   Taxonomy Table:    [ 413 taxa by 3 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 413 tips and 411 internal nodes ]
refseq()      AAStringSet:      [ 413 reference sequences ]