Command Line Initial Processing

Initial processing usually consists of sorting your sequences by sample and trimming them of primers and bar codes. They may also be filtered by quality and length. These tasks are accomplished by the SeqFilters module of RDPTools. The exercises here cover three situations:

16S Single Reads with Primers & Barcodes

Begin by making a sub-directory in your home directory and then moving into that sub-directory.

mkdir ~/init_proc
cd ~/init_proc

I put the example files for this exercise in a zip file. Download it from here and put it in the directory ~/init_proc. Then unzip the file.

unzip command_line_16S_initial_processing.zip 
ls -l

You should get a listing of the files like:

total 1712
-rw-rw-r-- 1 john john 1471670 Nov  9  2015 16S_example_454Reads.fastq
-rw-rw-r-- 1 john john     703 Nov  9  2015 16S_initial_processing.sh
-rw-rw-r-- 1 john john      56 Sep  3 11:12 16S_primer.txt
-rw-rw-r-- 1 john john      91 Sep  3 11:13 16S_tag_file.txt
-rw-rw-r-- 1 john john  263411 Sep  3 11:15 command_line_16S_initial_processing.zip

Raw sequences are in the file 16S_example_454Reads.fastq. The file 16S_tag_file.txt is a tab-delimited file associating bar codes with sample names, and 16S_primer.txt gives the forward and reverse primers used. 16S_initial_processing.sh is a shell script we will use to perform the initial processing of the sequences in example_454Reads.fastq. All of these are text files. You can examine them with the less command. For example:

less 16S_tag_file.txt

Which should give:

tgtcacacga  USGA_1_7_A
tgtcgtcgca  Native_1_4_A
acacatacgc  Native_1_2_A
acagtcgtgc  USGA_2_7_A

Exit less by typing q. Let’s examine the script file. Type:

less 16S_initial_processing.sh

Which prints the following to the screen:

#!/bin/bash

# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path for my local installation

# Initial processing (sorting & trimming)
java -Xmx2g -jar $RDPToolsDir/SeqFilters.jar --forward-primers AYTGGGYDTAAAGNG --max-forward 2 --reverse-primers CCGTCAATTCMTTTRAGT --max-reverse 0 --seq-file 16S_example_454Reads.fastq --min-length 300 --max-length 350 --skip-notag --tag-file 16S_tag_file.txt --outdir initial_process

# Create directory for trimmed sequences and copy them to it.
mkdir trimmed_seqs
cd initial_process
find . -name *trimmed.fastq  -type f -exec cp {} ../trimmed_seqs/ \;

# Remove the trimmed portion of the name.
cd ../trimmed_seqs
for f in $(ls *trimmed.fastq)
do
  mv $f ${f/_trimmed/}
done

The first line, #!/bin/bash identifies the file as a shell script.
RDPToolsDir=/mnt/research/rdp/public/RDPTools assigns the path to the directory RDPTools on MSU’s HPCC to the variable RDPToolsDir. This line has been commented out – it begins with a pound (#) sign. The next line RDPToolsDir=/usr/local/RDPTools has not been commented out and assigns the path to the directory RDPTools on my local installation of RDPTools to the variable RDPToolsDir. If you installed RDPTools in a different directory you will need to edit this path. The long line under # Initial processing (sorting & trimming) performs the initial processing. It is actually all on one line but displayed on multiple lines here so that it will not run off of the page. The first part of the line assigns 2 gigs of memory to java (-Xmx2g), loads the java program SeqFilter.jar and runs the command SeqFilters. The rest of the line gives arguments or options to the command. In this example the arguments are the same that we used with the web-based initial processing tool. You will likely need to edit these to process your own data.

Argument Explanation
--forward-primers The forward primer sequence. If more than one, separate with commas.
--max-forward 2 Allow 2 mismatches to the forward primer.
--reverse-primer The forward primer sequence. If mote than one, separate with commas.
--max-reverse 0 Allow 0 mismatches to the reverse primer.
--seq-file Gives the name of the sequence file to be processed.
--min-length 300 Filter out sequences less than 300 bp.
--max_length 350 Filter out sequences greater than 350 bp.
--skip-notag Do not process sequences without a bar code.
--tag-file Gives the name of the tag file.
--out-dir Gives the name of the directory to which results will be written.

When the script is run, the output directory initial_process is created in the working directory, init_proc in this example. It in turn contains two sub-directories, tag_sort and result_dir. Directory tag_sort contains untrimmed sequences sorted by bar code into separate sequence files for each sample. Directory result_dir contains sub-directories for each sample, and within those sub-directories are the trimmed fastq files that we want for further processing. These files end with _trimmed.fastq. The remainder of the script collects all of the trimmed fastq files into the directory trimmed_seqs and then removes the _trimmed portion of the file names. I do this so that the sample names in further steps will be the same as in the tag file used for initial processing.

If you have not already, exit less by typing q. To perform the preliminary processing for this tutorial, from the working directory init_proc list the files again:

ls -l

If the beginning of the line with the script file 16S_initial_processing.sh does not contain any x’s, but something like -rw-r--r--, it is not executable. In that case, make it executable by you by typing the following:

chmod u+x 16S_initial_processing.sh
ls -l

The line for the script file should now begin with something like -rwxr--r--. The x in this position means that the file is executable by the user, you. Once the script file is executable, run it by typing:

./16S_initial_processing.sh

Examine your results by listing the various directories. You may examine the contents of a file with the less command, for example:

cd trimmed_seqs
less Native_1_4_A.fastq

Page down with the space bar. Page up by typing b. Exit by typing q.

Note:

The above script will not overwrite the directory trimmed_seqs. If you want to run the script over again, perhaps with different parameters, you need to first remove the directory trimmed_seqs. You can do this from the working directory (init_proc) with the command:

rm -rf trimmed_seqs

28S Single Reads with Primers & Barcodes

The initial processing of 28S data is the same as for 16S data with one exception. Because the primers were too far apart to get complete reads from primer to primer, the same bar code was put on both forward and reverse primers. Thus we expect to get a mixture of reads, some from the 3′-end and some from the 5′-end of the amplicons. When we processed this data using the RDP web-based tool, we put both primers in the forward primer box. For processing the same data from the command line, we will list both primers after the option for the forward primer(s).

Begin this exercise by making a sub-directory in you home directory and then moving into that sub-directory.

mkdir ~/init_proc_28s
cd ~/init_proc_28s

I put the example files for this exercise in a zip file. Download it from here and put it in the directory init_proc_28s. Then unzip the file.

unzip command_line_28S_initial_processing.zup
ls -l

You should get a listing of the files like:

total 76432
-rw-rw-r-- 1 john john 62341144 Nov 8 2015 28S_example_454Reads.fastq
-rw-rw-r-- 1 john john 761 Sep 6 09:38 28S_initial_processing.sh
-rw-rw-r-- 1 john john 54 Nov 8 2015 28S_primer.txt
-rw-rw-r-- 1 john john 255 Nov 8 2015 28S_tag_file.txt
-rwxrwx--- 1 john john 15908192 Sep 6 09:40 command_line_28S_initial_processing.zip

Exit less by typing q. If you examine the script file, you will find it almost identical to the one for processing 16S reads (above). Enter the following command:

less 28S_initial_processing.sh

which prints the following code to the screen:

mkdir ~/init_proc_28s
#!/bin/bash

# Configure paths
# RDPToolsDir=/mnt/research/rdp/public/RDPTools # Path on MSU's HPCC
RDPToolsDir=/usr/local/RDPTools # Path for my local installation

# Initial processing (sorting & trimming)
java -Xmx2g -jar $RDPToolsDir/SeqFilters.jar --forward-primers ACCCGCTGAACTTAAGC,CCGTGTTTCAAGACGGG --max-forward  1 --seq-file 28S_example_454Reads.fastq --min-length 200 --max-length 400 --skip-notag --tag-file 28S_tag_file.txt --outdir initial_process

# Create directory for trimmed sequences and copy them to it.
mkdir trimmed_seqs
cd initial_process
find . -name *trimmed.fastq  -type f -exec cp {} ../trimmed_seqs/ \;

# Remove the trimmed portion of the name.
cd ../trimmed_seqs
for f in $(ls *trimmed.fastq)
do
        mv $f ${f/_trimmed/}
        
done

The difference is that both primers are put after the forward-primers argument, separated by only a comma. IMPORTANT: There is no space after the comma separating the two primer sequences. If you include a space, then the second primer sequence is ignored, and you get only sequences matching the first primer.

Edit the value for RDPToolsDir as appropriate to your installation of RDPTools using nano. Then make the script executable with the chmod command and run it.

chmod u+x 28S_initial_processing.sh
./28S_initial_processing.sh

Processed sequences are collected together in the directory trimmed_seqs. You can examine the results by listing the various directories with ls and viewing file contents with less, as in the example for 16S reads above.

Assembling Paired Reads

PANDAseq (PAired-eND Assembler for DNA sequences) is a program for assembling paired sequences from Illumina runs. The original program is available on GitHub (https://github.com/neufeld/pandaseq) and the manual for the same is at https://storage.googleapis.com/pandaseq/pandaseq.html. The original PANDAseq is also available on MSU’s HPCC as a module.

RDP modified (Cole et al., 2014) the original PANDAseq to improve accuracy of assembly by performing a modified statistical analysis using the sequencer supplied Q scores to find the most likely region of overlap. These improvements can now be largely incorporated into the original PANDAseq by using the options -A rdp_mle and -C min_readqscore:25.

The modified PANDAseq can be downloaded from http://rdp.cme.msu.edu/download/RDP_Assembler.tgz. Instructions for installing it locally are given in the tutorial Installing Modified PANDAseq.

This exercise uses the script below to assemble paired reads with RDP’s modified version of PANDAseq using the recommended parameters. The script automatically assembles sequences for all paired files provided the file names have the form “Mock_A_TATAGCGA-GACACCGT_L001_R1_001.fastq” and “Mock_A_TATAGCGA-GACACCGT_L001_R2_001.fastq.” It takes all characters before the sequence “_TAT” as the sample name. It assumes the “_R1_” in a fastq file name indicates the forward read and “_R2_” the reverse read. The way that the script recognizes these features can be easily modified to deal with other ways the files might be named.

The script is run from a directory containing only the paired fastq reads (and the script). It writes log files and merged reads as fastq files to the sub-directory merged_reads. It will not run if the sub-directory merged_reads already exists. This is a safety feature to prevent the accidental deletion of previous results.

Exercise

Begin by creating the directory test_pandaseq in your home directory. Download the file test_pandaseq.zip from here, put it in test_pandaseq and unzip it. Then list the directory contents.

mkdir ~/test_pandaseq
cd ~/test_pandaseq
unzip test_pandaseq.zip
ls -l

You should get something like:

total 564
-rwxrwx--- 1 john john    401 Sep  7 11:25 auto_pandaseq.sh
-rw-r--r-- 1 john john 141481 Sep 18  2013 Mock_A_TATAGCGA-GACACCGT_L001_R1_001.fastq
-rw-r--r-- 1 john john 141481 Sep 18  2013 Mock_A_TATAGCGA-GACACCGT_L001_R2_001.fastq
-rw-r--r-- 1 john john  84872 Sep 18  2013 Mock_B_TATAGCGA-GTCAGATA_L001_R1_001.fastq
-rw-r--r-- 1 john john  84872 Sep 18  2013 Mock_B_TATAGCGA-GTCAGATA_L001_R2_001.fastq
-rw-rw-r-- 1 john john 112970 Sep  7 12:07 test_pandaseq.zip

In the listing above, auto_pandaseq.sh is executable. If it is not in your case, you need to make it executable with chmod as in the exercises above for 16S and 28S single read sequences. Print auto_pandaseq.sh to the screen with cat.

cat auto_pandaseq.sh

You should get the following:

#!/bin/bash

pandaseq=~/RDP_Assembler/pandaseq/pandaseq # My local installation
# pandaseq=/mnt/research/rdp/public/RDP_misc_tools/pandaseq/.libs/lt-pandaseq # MSU's HPCC installation

mkdir merged_reads

for f in $(ls *_R1_*.fastq)
do
    $pandaseq -N -o 10 -e 25 -F -d rbfkms -l 220 -L 280 -f $f -r ${f/_R1_/_R2_} 1> merged_reads/${f/_TAT*/}_merged.fastq 2> merged_reads/${f/_TAT*/}_merged.log
done

Notice that this script gives two choices for the value of the executable pandaseq. The first, pandaseq=~/RDP_Assembler/pandaseq/pandaseq is valid for my local installation of modified PANDAseq. This might have to be edited for your local installation of PANDAseq, but it should be the same if you follow the tutorial to install it in Ubuntu. The second choice, # pandaseq=/mnt/research/rdp/public/RDP_misc_tools/pandaseq/.libs/lt-pandaseq # MSU's HPCC installation, is valid for running the exercise interactively on MSU’s HPCC; here it has been commented out.

If necessary for your installation, use an editor such as nano or gedit to make the proper choice for the pandaseq variable and/or to edit the path for your local installation. Be sure to save your result.

Explanation of Options

You will notice that there are several options and flags for the pandaseq command. The table below explains what they are.

Option Explanation
-N Eliminate sequences with ambiguous bases.
-o 10 Sets minimum accepted overlap to 10 bases.
-e 25 Sets minimum accepted Q score to 25. Suggested range is 25 to 27.
-F Output assembled reads in fastq format.
-d rbfkmsa Eliminates some items from log file – see manual for details.
-l 220 Filters out assembled reads less than 220 bases.
-L 280 Filters out assembled reads greater than 280 bases.
-f Input forward read fastq file.
-r Input reverse read fastq file.

To process your own data, you may need to edit some of the options, particularly for l and L. But to complete this exercise, simply run the shell script from the directory containing the shell script and the paired sequence files.

./auto_pandaseq.sh   

Results are written to the sub-directory merged_reads. Examine results by changing into the merged_reads directory and using ls, cd, less, tail.

If you want to re-run the script file, first remove the merged_reads sub-directory.

cd ~/test_pandaseq
rm -rf merged_reads

Reference

Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje. 2014. Ribosomal Database Project: data and tools for high throughput rRNA analysis Nucl. Acids Res. 41(Database issue):D633-D642; doi: 10.1093/nar/gkt1244 [PMID: 24288368]