Downloading Sequences from NCBI’s SRA

The SRA (Short Read Archive) is a repository for data from sequencing projects. Today most journals require data to be deposited before publication, but researchers may also upload data there without publishing. The data is public, at least after some given date, and the idea is that it allows other researchers to download data to replicate analyses in a publication, or to analyze it in a different way. Here I present a method for automating the download of all sequence files from a project and renaming them according to their sample names.

Requirements

This tutorial assumes that you are operating in a Linux-like environment with python3 installed.  You also need two or three additional programs that can be installed easily using conda if you do not already have them:

  • SRA Toolkit
  • E-utils
  • parallel (optional, but greatly speeds up downloads)

Downloading Project Information

esearch -db sra -query PRJNA524467 | efetch -format runinfo > runinfo.csv

Examine the file by opening it in a spreadsheet program. The first column contains the ID’s for the run (i.e. sequence) files. These ID’s are unique but otherwise meaningless. Search the other columns for sample names or something else that can be used to construct meaningful sample names. All meta data should be included but it is not required to be complete. This is a frustration because without it, it is likely not possible to replicate a published result.

In this example, sample names appear in column 30, and we can parse the information we need with the following:

cat runinfo.csv | cut -f 1 -d , | grep SRR > runids.txt

cat runinfo.csv | cut -f 1,30 -d , > run_sample_name.csv
cat run_sample_name.csv | sed 's/,/\t/' | grep SRR > run_sample_names.tsv

The first line extracts the run ID’s to the file runids.txt. This file will be used to download the sequences. The other two lines create a tab-delimited file associating run ID’s with sample names. This file will be used to rename the downloaded sequences.

Download Sequences

We will use fastq-dump to download the sequences. There are two ways of doing this, depending on whether or not you have parallel. In either case, it is prudent to first check things out before downloading all of the data.

fastq-dump does not take input from standard out, that is you can not stream the IDs using cat, so you will have to use a script to submit each run . First make a shorter version of runids.txt with only 5 lines:

cat runids.txt | head -5 > sample_runids.txt

Then run the following script to download the first 1,000 sequences from each file identified by the run IDs:

#! /bin/bash
while read f; do
     fastq-dump -X1000 --split-files "$f"
done <sample_runids.txt

Or instead of writing the above script and running it, you could enter on one line:

while read f; do fastq-dump -X 1000 --split-files "$f"; done <sample_runids.txt

If you have parallel, you can do the same thing with one line:

cat runids.txt | head -5 | parallel fastq-dump --split-files -X 1000 {}

Either way, you should get something like this when you list the fastq files:

-rw-r--r-- 1 john john 634516 Jul 17 15:58 SRR8648699_1.fastq
-rw-r--r-- 1 john john 634516 Jul 17 15:58 SRR8648699_2.fastq
-rw-r--r-- 1 john john 634536 Jul 17 15:58 SRR8648700_1.fastq
-rw-r--r-- 1 john john 634536 Jul 17 15:58 SRR8648700_2.fastq
-rw-r--r-- 1 john john 634494 Jul 17 15:58 SRR8648701_1.fastq
-rw-r--r-- 1 john john 634494 Jul 17 15:58 SRR8648701_2.fastq
-rw-r--r-- 1 john john 634628 Jul 17 15:58 SRR8648702_1.fastq
-rw-r--r-- 1 john john 634628 Jul 17 15:58 SRR8648702_2.fastq
-rw-r--r-- 1 john john 634592 Jul 17 15:58 SRR8648706_1.fastq
-rw-r--r-- 1 john john 634592 Jul 17 15:58 SRR8648706_2.fastq

The files with  the suffix “_1” are the forward reads and those with the suffix “_2” are the reverse reads.

If all looks good, you can download all of the sequences by not limiting the number of files or the number of sequences. If not using parallel, you would modify the bash script to the following:

#! /bin/bash
while read f; do 
     fastq-dump --split-files "$f" 
done <runids.txt

Or use the one-line version of the same:

while read f; do fastq-dump --split-files "$f"; done <runids.txt

Or if using parallel, enter the following:

cat runids.txt | parallel fastq-dump --split-files {}

Rename the Sequences

Because the sequence file names are not meaningful to us, we would like to rename them by sample name. Above we made a file, run_sample_names.tsv, associating the run names with the sample names for this data set. We took the sample names from column 30 of the downloaded file runinfo.csv. This may not work in all cases. You may find something serving as a sample name in a different column, or you might make up sample names by pasting together entries from several meta data columns.  In any event, the idea is to create a tab-delimited file associating run IDs with sample names. It should look like this, without a header line:

SRR8648702 112-20-1
SRR8648701 112-20-2
SRR8648700 112-30-1
SRR8648699 112-30-2
SRR8648706 112-40-1
SRR8648705 112-40-2
SRR8648704 113-10-1
SRR8648703 113-10-2
SRR8648708 113-20-1
SRR8648707 113-20-2

I have written the python3 script rename_sra_files.py (download it from GitHub) that takes such a file as input and renames all of the sequence files to the sample names. The script can be located elsewhere, but you need to run it from the same directory containing the fastq files, so:

python ~/scripts/rename_sra_files.py run_sample_names.tsv

The above files are renamed as:

-rw-r--r-- 1 john john 634628 Jul 17 16:37 112-20-1_1.fastq
-rw-r--r-- 1 john john 634628 Jul 17 16:37 112-20-1_2.fastq
-rw-r--r-- 1 john john 634494 Jul 17 16:37 112-20-2_1.fastq
-rw-r--r-- 1 john john 634494 Jul 17 16:37 112-20-2_2.fastq
-rw-r--r-- 1 john john 634536 Jul 17 16:37 112-30-1_1.fastq
-rw-r--r-- 1 john john 634536 Jul 17 16:37 112-30-1_2.fastq
-rw-r--r-- 1 john john 634516 Jul 17 16:37 112-30-2_1.fastq
-rw-r--r-- 1 john john 634516 Jul 17 16:37 112-30-2_2.fastq
-rw-r--r-- 1 john john 634592 Jul 17 16:37 112-40-1_1.fastq
-rw-r--r-- 1 john john 634592 Jul 17 16:37 112-40-1_2.fastq