Nucleotide BLAST (blastn) can help you detect possible poor-quality data at the ends of a sequence. In this article we provide steps for checking sequence from protein-coding genes. We include a step on using the CDS feature display on the BLAST search results pages. See the article on blastn and CDS feature set up. If checking non-coding sequences, skip the step on displaying the CDS feature.
To check for poor-quality data or other errors at the ends of a sequence:
- Perform a blastn search with your sequence.
- On the BLAST search results page, display pairwise alignments by choosing the Alignments tab.
- Display the CDS feature by checking the CDS feature box.
- Select alignments in which Subjects are longer than Query on both ends.
- To learn how to do the above selection, see the article on interpreting pairwise alignments.
The Query ends may contain poor-quality data or other errors if:
- They don’t align to Subject.
- They have more mismatches than the rest of the sequence.
- They contain missing or extra nucleotides (deletions or insertions).
- Extra/missing bases introduce frameshifts in CDS. (See the article on frameshifts for details.)
- The same mismatches, and/or missing and extra bases repeat in Subject sequences of several alignments.
To remedy the problem, check your sequencing reads. Trim the sequence ends if you aren’t confident that the reads are correct.
Figure 1 illustrates a sequence with possible sequencing errors at its 5' end.
Figure 1: Query containing poor-quality sequence at its 5’ end: Query aligns at position 24 (red rectangle) to a Subject at position 4840 (orange rectangle). 23 bases of the Query stay unaligned, even though Subject extends 4839 bases past the 5’ end of the alignment. There are several mismatches at the 5' end (blue rectangle), but not in the rest of the aligned bases between Subject and Query. Unaligned bases and mismatches together indicate possible sequencing errors at the 5' end of the Query sequence.