Print Article: KA-03437

What are NCBI Reference Sequence (RefSeq) accession numbers and what information is embedded in their format?

NCBI Reference Sequence accession numbers (or RefSeq accessions) uniquely identify sequence records that NCBI derives from selected GenBank records. GenBank is a highly redundant database. Hence, NCBI creates RefSeqs to provide a less redundant representation of the naturally occurring nucleic acid and protein molecules. RefSeqs also allow annotation updates and other maintenance, independently from the primary data.
 

The generic format of a RefSeq accession is as follows:

[two-letter alphabetical prefix][ _ ][series of digits or alphanumeric characters][.][version number]
 

Some examples of Nucleotide RefSeq accession numbers are: NM_001744.6, NC_003619.1NG_009904.1NR_135858.1, and NZ_CASIGT010000001.1,while NP_001735.1 and WP_228380365.1 represent RefSeq records in the Protein database.


How can the accession format help you recognize RefSeq data?

You can quickly recognize a RefSeq accession by the underscore ( _ ) placed between the alphabetical prefix and the remaining alphanumeric characters. To ensure accuracy, you should keep the underscore and the version number when you communicate about a RefSeq record.
RefSeq alphabetic prefixes embed two types of information: (1) different prefixes mean different molecule types and (2) different curation statuses. For example, the “NC_” prefix represents chromosome records, while the “XM_” prefix represents predicted messenger RNA (mRNA). For more details see:

 

Where can you learn more?

Knowledge articles:

GenBank (INSDC) and RefSeq: