The sequences for this execise are from the Chinese sequencing company Novogene. The sequenceing was performed on the Illumina NovaSeq platform.
Download Data
Create a directory for the exercise and a sub-directory named raw_seqs for the sequences. Download the sequences, primer sequences and sample data with the following code:
#Make directories
cd
mkdir novogene_exercise
cd novogene_exercise
mkdir raw_seqs
# Download raw sequences
cd raw_seqs
wget https://github.com/jfq3/data_sets/raw/refs/heads/master/novogene_sampled/sampled_reads.tar.gz
tar xzf sampled_reads.tar.gz
# Download primers:
cd ../
wget https://github.com/jfq3/data_sets/raw/refs/heads/master/novogene_sampled/Novogene%20primers
# Download sample data:
wget https://github.com/jfq3/data_sets/raw/refs/heads/master/novogene_sampled/sample_data.txt
Examine the Reads
Are these paired reads? If paired, are they interleaved or are forward and reverse reads in separate files?
Do they require binning by sample?
How long are the reads?
Align the primers with the reads. Where do the primers hit? At the very ends of the reads, or is there some padding prior to the primer sequence? How long are the primers?
How long is the target region?
Assuming an overlap lenght of 12 bp for merging, what is the maximum length of merged reads?
Check Read Quality
Run fastqc and multiqc. Are there any samples that shold be removed for poor quality or too few reads?
Are there warnings about the presence of Illumina adapters? If so, they shold be removed before proceeding.
Process the Reads
Decide on a processing strategy and process with QIIME 2.
What is the overall read retention?
What is the length distribution of the representative sequences?