sratoolkit/2.10.7 to download NCBI SRA data

Log on to O2

If you need help connecting to O2, please review the How to login to O2 wiki page.

From Windows, use MobaXterm (preferred) or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your HMS ID instead of user123:

ssh user123@o2.hms.harvard.edu


Start interactive job, and create working folder



srun --pty -p interactive -t 0-12:0:0 --mem 2000MB -n 1 /bin/bash
 
mkdir /n/scratch/users/${USER:0:1}/$USER/testSratoolKit
 
cd /n/scratch/users/${USER:0:1}/$USER/testSratoolKit


Set default cache path. You can only need to do this once:


# Load sratookit module
module load sratoolkit/2.10.7

# Set sratoolkit to use /n/scratch/users/${USER:0:1}/$USER/ncbi as cache space
# By default, sratoolkit uses /home/$USER/ncbi as cache. If you download multiple data set, your 100G home space will be filled up quickly.
mkdir -p ~/.ncbi
# Configure sratoolkit
vdb-config --interactive
 
# Directly press x key to quit
echo /repository/user/main/public/root = \"/n/scratch/users/${USER:0:1}/$USER/ncbi\" >> ~/.ncbi/user-settings.mkfg

Use sratoolkit prefetch to download sra data, then convert the data from .sra to .fastq format


# Load sratookit module
module load sratoolkit/2.10.7
 
# Use prefetch to download SRA file.
prefetch SRR5138775
 
# Convert SRA file to FASTQ with fastq-dump.
fastq-dump --split-files SRR5138775
 
# Note: The default maximum file size is 20G. When downloading large file more than 20G, it gives error:
prefetch SRR7890863
...
2020-05-27T17:58:24 prefetch.2.9.0 warn: Maximum file size download limit is 20GB
2020-05-27T17:58:24 prefetch.2.9.0: 1) 'SRR7890863' (29GB) is larger than maximum allowed: skipped
2020-05-27T17:58:25 prefetch.2.9.0: 'SRR7890863' has no remote vdbcache
...
Download of some files was skipped because they are too large
You can change size download limit by setting
--min-size and --max-size command line arguments
 
 
# You can add --max-size 35G:
prefetch --max-size 35G SRR7890863


Additional tips: 

  1. If you need download a lot of data, run screen command before starting interactive job, to keep the session alive: 
    screen: Keep Linux Sessions Alive (so you can go back to the same terminal window from anywhere, anytime)
  2. If you a lot of samples to download, running prefetch command one by one is a lot of work. To automate the process, you can find the accession IDs from the website and put them in a loop to download one by one.  For example to download SRR6519510 to SRR6519519:
    for i in {6519510..6519519}; do
         prefetch SRR$i;
    done
  3. If you have more than a dozens of samples to download, running one by one needs lot of time. You can run them in parallel, For example you submit 5 jobs, let each job work on 100 accession IDs. Because these 5 jobs share the same network from O2 to NCBI cloud, these parallel prefetch commands will run slower than in serial mode. Please share your experience.
  4. Let us know if you have any questions.