From Organic Design wiki
Genbank is a flat file database structure for primary nucleotide sequence information and auxillary information. These records can display the ORIGIN information for different nucleotide molecular types, and have no limit on the length of sequence displayed. Entire chromosomes can be stored as a genbank record for an organism of interest, potentially making the disk storage of the record very large.
An example Genbank sample record
Regular expressions matching parts we care about
- The unique ACCESSION number
- The ORIGIN field
- Any amino acid /translation in the FEATURE table