Print

What is the number of the NCBI Protein database records and does each record represent a unique sequence?

To find the current number of records in the Protein database, search the database with the following term*:

all[filter]

While each of the records that you retrieve is unique (designated with a unique accession number), the sequence in a record may be identical to that in another record. The Protein database is redundant and contains a profusion of identical protein sequences. The records in the Protein database originate from several sources. Many protein sequences originate from computationally translated coding regions (CDS) that are annotated on the GenBank (INSDC) sequence records in the Nucleotide database. GenBank (INSDC) is a primary repository of nucleotide sequences and it accepts new sequences even if they are identical to those submitted previously.

Even though the sequences of two records may be identical, the records themselves may differ in the extent of other information that they provide. The curated records — those from the RefSeq source database at NCBI and those from Universal Protein Resource (UniProtKB) — are generally more informative than the GenBank-based records. Comparatively, the curated records:

  • are updated more regularly
  • provide more accurate names for the proteins
  • provide more extensive designations of individual functional regions and sites on the protein sequence (such as conserved domains, signal and mature peptides, and various binding sites)
  • link to more related information at NCBI and elsewhere

Also important: RefSeq and Swiss-Prot are both intrinsically non-redundant. 

If you are focusing on studying protein function, consider:

*You can use the same search term to determine the current number of records in any of the Entrez databases.