Knowledge Article · NLM Customer Support Center

Print Article: KA-03448

How are GenBank protein sequences determined?

Protein sequences that originate in the Genbank (INSDC) source database are not determined by direct protein sequencing. Rather, they result from computational translation of the coding regions (CDS) that the submitters of the records or, in some cases, NCBI, annotate on the Nucleotide sequence records.

For example, there are 13 CDS regions annotated on the DQ489526.1 Nucleotide sequence. Scroll to the 'FEATURES' section of the record. The first CDS annotated along the sequence is that for NADH dehydrogenase subunit 1 (ND1), spanning bases 2784 to 3740. It is translated into the corresponding protein sequence by using the vertebrate mitochondrial genetic code (transl_table=2). The protein is designated with the ABE99438.1 accession number and represented as a record in the Protein database. An individual Protein record is generated for each translated CDS. Hence, there are 13 individual Protein records in this example.

A Nucleotide sequence can contain a single or multiple CDS regions. Or it may represent only a partial CDS (example), resulting in a partial protein sequence (example). Note the < and/or > symbols in the two example records. These symbols designate partial 5' and/or 3' sequences.