Print Article: KA-03578

What are reference genomes and how can you find these at NCBI?

NCBI archives over two million prokaryotic genomes and thousands of eukaryotic and viral/viroid genomes. You can encounter hundreds of genome assemblies just for a single species! Among all these assemblies, NCBI selects the best genome of a species to serve the research community as the reference genome assembly (or reference genome or reference assembly). With some exceptions, you will find one reference genome per species. Examples of species with two reference genomes are Escherichia coli with references for pathogenic and non-pathogenic strains and Canis lupus with references for the two subspecies, dog and dingo.

NCBI uses several methods and rules when selecting and classifying a genome as the reference. In addition to computational approaches, NCBI also receives input from the research community. You can see the details of these methods and rules in the following document:

 

How can you find and recognize reference genomes?

NCBI Datasets service provides access to all genome assemblies. For example, searching for Homo sapiens results in a table listing over a thousand human genomes. The table provides access to the human reference assembly, GRCh38.p14, at its very top. The assembly is the only one with the prominent checkmark by its name in the first column. NCBI added these checkmarks to all reference genomes for easy recognition. Within the GRCh38.p14 record, you will also see the assembly marked with the term “reference”.

 

Are reference genomes the same as RefSeq genomes?

No. Reference genomes and RefSeq genomes are two separate concepts. Reference genomes are a single representative assembly chosen for each species, while RefSeq includes a broader collection of high-quality annotated genomes. RefSeqs encompass all sequences that NCBI derives from primary GenBank (INSDC ) data. That includes RefSeq genomes. However, a RefSeq genome may not be a reference genome. For example, for human NCBI established two RefSeq assemblies: the aforementioned GRCh38.p14 and the T2T-CHM13v2.0 assembly. However, only GRCh38.p14 serves as the reference genomeMoreover, NCBI derives RefSeq assemblies for any prokaryote genome that meets the RefSeq quality standards. Hence, there can be numerous RefSeq genomes for a bacterial species but only one serves as the reference genome. In rare cases, a reference genome may not even be a part of the RefSeq collection!


Are reference genomes “forever”?

No. Researchers keep on submitting genomes of new species. That results in establishing new reference genomes. Moreover, NCBI may receive a higher-quality assembly for a species that already has a reference genome. In such a case, the current reference genome could get replaced.
 

Where can you learn more?

Knowledge articles:

Blogs relating to reference genomes, including a blog on:  NCBI Datasets documentation:

GenBank and RefSeq: