Processing Psomagen Sequences

This amplicon processing exercise makes use of data from the sequencing company Psomagen. Sequencing was performed on the Illumina MiSeq platform and targeted the V3/V4 region of the 16S rRNA gene.

Create a directory for the exercise and a sub-directory for the raw sequences.

cd
mkdir psomagen_exercise
mkdir psomagen_exercise/raw_sequences

Download some files to the directory. The file Psomagen_primers includes, you guessed it, the primer sequences. These target the V3/V4 region of the 16S rRNA gene sequence.

The sample_data.txt file includes some metadata for the samples in the experiment. We will use it later to analyze the data in R.

cd ~/psomagen_exercise
wget https://github.com/jfq3/data_sets/raw/refs/heads/master/Psomagen_sampled/Psomagen_primers
wget https://github.com/jfq3/data_sets/raw/refs/heads/master/Psomagen_sampled/sample_data.txt

Download sequences to the raw sequence directory and extract them.

cd raw_sequences
wget https://github.com/jfq3/data_sets/raw/refs/heads/master/Psomagen_sampled/sampled_reads.tar.gz
tar xzf sampled_reads.tar.gz

Determine Read Characteristics

Are the reads paired? How can you tell?

Do they require demultiplexing, or are they binned by sample?

What are the lengths of these reads? (Use usearch -fastx_info)

Are primers present? If so, what are their positions in the reads?

Run FastQC and MultiQC

Are there any samples that need to be removed because of low quality or too few reads?

Are there Illumina adapters present? If so, remove them.

Decide on a Processing Strategy

Can the reads be successfully merged? Can merging be done with DADA2? Or do you need more control to prevent overhangs?

To answer these questions, I suggest that you merge primer trimmed reads for one sample with usearch -fastq_mergepairs. Usearch will automatically trim overhangs leaving only the target region. Then use usearch -fastx_info to find the size distribution of the merged reads.

Process with QIIME 2

Should you denoise the sequences to produce ASVs? ? With DADA2 or with Deblur? Or should you cluster the sequences into OTUs?

You will need to create a manifest file to import the sequences into QIIME 2.