How do I download sequence records from the web in the NCBI Nucleotide and Protein databases?

Views:

After you have accessed the set of records in the Nucleotide or the Protein that you want to download (example), use the Send to link. The link is located on the right side of the screen above the records and it displays a menu with several options. In the Nucleotide database, the menu provides three record-downloading paths. This approach works best for sets containing up to ~1000 sequence records. See below for better options to download larger sets.

Path 1: Downloading complete records

Select Complete Record -> File (as your Destination) -> Format
From the Format pull-down menu, select one of the twelve available display formats (that include GenBank, FASTA, various flavors of XML, and GFF3).
Use the Sort by menu to specify how the records will be sorted in your download.
Click the Create File button and specify a space on your local computer to store the file.

Path 2: Downloading coding sequences
Use this path when sequences (records) are annotated with coding regions (the CDS feature) and non-coding regions, but you want the coding regions only. Examples are records for larger genomic sequences that encompass more than a single CDS feature, and also mRNA records that contain the untranslated regions (5' UTR and 3' UTR) in addition to the coding region.

Select Coding Sequences -> Format
From the Format pull-down menu select one of the two formats that are available for this path: FASTA Nucleotide or FASTA Protein
Click the Create File button and specify a space on your local computer to store the file.

Path 3: Downloading Gene Features
Use this path for larger genomic sequences (records) that are annotated with several gene features and you want to exclude sequence of intergenic regions.

Select Gene Features -> Format
The Format pull-down menu will offer the single available format for this path: FASTA Nucleotide.
Click the Create File button and specify a space on your local computer to store the file.

There is a single path in the Protein database with steps akin to path 1 in the Nucleotide database.

Use Batch Entrez for larger sets (up to ~10,000 records):

If you experienced a server time-out when trying to download your set, use path 1 and choose the Accession List as the format to download. This format will result in the smallest possible file for a given set.
Split the list into batches of smaller files. You will need to determine empirically the size of each batch.
Proceed to the Batch Entrez tool and follow the instructions on the page to display the records from a batch on the web.
Once you retrieve the records, use the Send to File menu and choose your path/format that you ultimately need.
Repeat the whole process for your next batch.

For large data downloads, consider these alternatives to the sequence downloads from the Nucleotide and Protein databases:

Entrez programing utilities (E-utilities)/Entrez Direct (EDirect); this service works for all Entrez databases.
Assembly download service for data associated with genome assemblies
RefSeq FTP site for Reference Sequences (curated NCBI records)
GenBank FTP site for primary (submitted sequence records)

Keywords: NCBI Nucleotide database, NCBI Protein database, downloading records, web download, API, Batch Entrez

Comments (0)