Print

What are NCBI Reference Sequence (RefSeq) accession numbers and what information is embedded in their format?

Reference Sequence (RefSeq) accession numbers are distinctly-formatted sequence accession numbers that are assigned to those sequence records that NCBI Reference Sequence staff derive from primary sequence records (GenBank records or those deposited through other collaborating databases). NCBI creates RefSeq records (known as RefSeq's) to provide a less redundant (GenBank is a highly redundant database) representation of the naturally occurring nucleic acid and protein molecules. RefSeq's also allow for annotation updates and other maintenance, independently from the primary data.

The format of a RefSeq sequence accession number is: [two-letter alphabetical prefix][ _ ][series of digits or alphanumerical characters][.][version number]

Some examples of Nucleotide sequence RefSeq accession numbers: NM_001744.6, NC_003619.1NG_009904.1NR_135858.1, and NZ_CASIGT010000001.1 

You will quickly be able to recognize a RefSeq sequence accession by the underscore ( _ ) placed between the prefix and the remaining alphanumerical characters. These can be digits only as in the first four examples. The latter NZ_CASIGT010000001.1 accession represents a RefSeq whole genome shotgun (WGS) record with "NZ_" appended to the accession number of the underlying GenBank record.

Only RefSeq accessions have underscores and you should not omit them while recording/reporting a RefSeq accession number. You should also always include the version number for proper record tracking.

While there is little meaning in the alphabetical prefix of a GenBank accession number, the two-letter prefix has rich embedded information on the molecule type and curation status.