The term genome assembly can have two different meanings. It can (1) refer to a process in which researchers assemble genome sequences from smaller components, or it can (2) refer to the entire collection of sequences that represent a genome. In this article, we address the latter concept (see the link to the related article on the assembly process below).
But first, what is a genome? The term “genome” can also have different meanings, again depending on the context. Referring to organisms, a genome is the entire DNA (genetic material) contained within a cell.
How many sequences are there for a genome assembly?
The number of sequences in an assembly depends on:
For example, the biological number of nucleotide molecules for human is the 22 (somatic) chromosomes — numbered from 1 to 22— plus the X and Y chromosomes. This totals 24 molecules. The T2T-CHM13v2.0 human genome assembly is complete. Therefore, it contains 24 sequences (24 sequence records) in its collection.
Scientists were only recently able to use sequencing technology that enabled the complete, gapless assembly of the human genome. Throughout more than two decades of pioneering the human reference genome model — currently GRCh38.p14 — the many involved researchers did not have the luxury of the current technology. While you will find 24 assembled chromosomes for GRCh38.p14, note that there are still sequencing gaps in these chromosomes. These exist due to regions of the genome that are highly repetitive, structurally complex, or otherwise challenging to sequence and assemble. Sequencing and assembly challenges also resulted in separate sequence scaffolds that could not be confidently placed or ordered within a chromosome. In addition to the primary assembly, GRCh38.p14 assembly contains separate sequences for alternate representations of certain genomic regions, and for patches containing sequence corrections or novel sequences. Finally, a mitochondrial genome completes the GRCh38.p14 assembly, totaling 705 separate sequence records in the Nucleotide database. If you examine the Datasets record of GRCh38.p14, you can also see that the assembly is marked as haploid. That means only one set of chromosomes, while most human cells contain the diploid genome which is 46 chromosomes (23 pairs). Another human genome may have a diploid representation. That itself would double the number of sequences in the assembly.
NCBI archives sequenced genomes from all domains of life. Genomes vary in size and complexity. Let’s conclude with those that have a single biological molecule for the entire genome. There are many viruses and viroids where the entire genome is a single segment or bacteria with a single circular chromosome. In all such instances, the entire assembly collection can end up being a single sequence record! For example, the rubella virus genome consists of a single molecule that is less than 10 kilobases in size. You can access an entire rubella genome sequence through a single record in the Nucleotide database.
Where can you learn more?
Knowledge articles: