How can I download a list of IDs for all sequences from a specific organism or taxonomic group at NCBI?

Views:

Use one of these three approaches:

(1) Directly from the web; suitable only for organisms or taxonomic groups that have a relatively small number of sequence records in the Nucleotide or Protein database:

Access the sequence database that you want on the web, for example Nucleotide.
Search for your organism by entering your organism name limited to the organism field, for example:

Salarchaeum japonicum[organism]

Use the Send to link (located top right above the results on the search results page) and select File.
Select either Accession List or GI List as your Format and use the Create File button to download the list.

(2) E-utilities; use the NCBI E-utilities API for organisms or taxonomic groups that have a large number of sequence records in the Nucleotide or Protein database:

Use esearch to search, for example, for all Archaea sequence records in the Nucleotide database. Search URL example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=Archaea[organism]&usehistory=y

The usehistory parameter will generate the Web environment (&WebEnv) and query key (&query_key) parameters that will specify the location of the retrieved GIs on the Entrez history server.

Follow with efetch. Your URL should include the query key number and the web environment (WebEnv) string generated by esearch. Specify the rettype as uilist and retmode as text. Example:

efetch.fcgi?db=nucleotide&query_key=<key>&WebEnv=<webenv string>&rettype=uilist&retmode=text

(3) EDirect; use Entrez Direct (EDirect) as the UNIX command line alternative to E-utilities:

EDirect is a relatively new method for searching and accessing records in NCBI databases. It uses UNIX command line arguments, so you need to have access to a UNIX/LINUX terminal. EDirect will run on UNIX and Macintosh computers that have the Perl language installed, and under the Cygwin UNIX-emulation environment on Windows PC's.

Here are command line examples that would generate the GI list or the accession list for all Archaea records in the Protein database:
esearch -db protein -query "Archaea[organism]" | efetch -db protein -format uid > archaea.gis
esearch -db protein -query "Archaea[organism]" | efetch -db protein -format acc > archaea.acc

Keywords: NCBI api, EDirect, E-utilities, organism search, ID download, NCBI Nucleotide database, NCBI Protein database

Comments (0)