web
You’re offline. This is a read only version of the page.
close
Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at opm.gov.
Print Article: KA-03448

How are GenBank protein sequences determined?

Protein sequences that originate in the Genbank (INSDC) source database are not determined by direct protein sequencing. Rather, they result from computational translation of the coding regions (CDS) that the submitters of the records or, in some cases, NCBI, annotate on the Nucleotide sequence records.

For example, there are 13 CDS regions annotated on the DQ489526.1 Nucleotide sequence. Scroll to the 'FEATURES' section of the record. The first CDS annotated along the sequence is that for NADH dehydrogenase subunit 1 (ND1), spanning bases 2784 to 3740.  It is translated into the corresponding protein sequence by using the vertebrate mitochondrial genetic code (transl_table=2). The protein is designated with the ABE99438.1 accession number and represented as a record in the Protein database. An individual Protein record is generated for each translated CDS. Hence, there are 13 individual Protein records in this example.

A Nucleotide sequence can contain a single or multiple CDS regions. Or it may represent only a partial CDS (example), resulting in a partial protein sequence (example). Note the < and/or > symbols in the two example records. These symbols designate partial 5' and/or 3' sequences.