NCBI
NCBI is the National Center for Biotechnology Information. Various sequence file formats have developed for submission into public databases housed at NCBI in collaboration with EBML and DDBJ.
Contents
Genbank format
Genbank is a publicly accessible flat file database structure for primary nucleotide sequence information and auxillary information. The amount of information publically submitted has been growing exponentially. In August 2005 this DNA sequence database reached 100 gigabases. The structure of the record is reasonably consise, each field is a key-value pair where the value information can be quite flexible depending on the key, for example these records can display the ORIGIN information for different nucleotide molecular types, and have no limit on the length of sequence displayed. Entire chromosomes can be stored as a genbank record for an organism of interest, potentially making the disk storage of the record very large. There is one key which displays Information about genes and gene products that is extremely flexible (the FEATURE table), this is because it is made up of optional features which have optional key=value pair qualifiers nested within them.
An example Genbank sample record
Content of interest include all the linkout associations in the examples above;
- The unique ACCESSION number
- Any amino acid /translation field in the FEATURE table
- The ORIGIN field
See also
FASTA
There is a condensed file format called a FASTA format is used to manipulate primary sequence information. FASTA files can be nucleotide or amino acid records. the first row of the record starts with a > and can contain any description information about the record. It is recommended that all lines of text be shorter than 80 characters in length.
An example of an amino acid FASTA record, in this case the description information is a consise format separated by pipe characters |. The first field gi refers to GENBANK IDENTIFIER, a unique identification number. The second field is the GENBANK IDENTIFIER number refering to the amino acid GENPEPT record.
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Note: Rendering of this FASTA file should really be in a monospaced font, where usually there is an 80 character limit in text before a newline.
In the U49845 actual FASTA record the third field is gb which refers to the GENBANK ACCESSION, a unique accession number. The fourth field is the GENBANK ACCESION number refering to the nucleotide GENBANK record.
- DNA example
>gi|33650851|gb|C81881.1|C81881 C81881 Citrus unshiu juice sac and pulp segment maturation stage Citrus unshiu cDNA clone pcMFrM02.04-023 5', mRNA sequence AGGTCAAGATTTGAGGAGAAAATGGGTTCTAGAGTAACACTTAGGACCAAAGGCAAGGGCGTGAAGGGAG CAAAGGCATCAGAGGAGAAATCAATGGTCGATTCTTTCAAAGAGTGGAGCACTTGGACCATGAAAAAGGC TAAAGTGGTCACTCACTATGGATTTATTCCTCTTATCATCATTTATCGGCATGAATTCTGATCCCAAGCC CCAAGTCTATCAGCTCCTCAGCCCCGTTTGATCTCCATACTTGACTCTTCCTTTTCTTTTGATGTCAAAC AAAATAGTTATTATCATGCTGTGCCTTCCTATTTGTCGAATCTACATGAATTGAATGTTTTAGGAGTTTT GGTTTCTTGTGATCGTACTTCCTGCCTAGTTGTAAGCTTATGGATTGACGTAGTATAAAATGTCTGGAAT TTGAATTATATACCGTCTCCATTGAATTGGAGGCNTCTTTTTCTTTTGGTGAATTTGTTTGTATTTTTTT TCCTTTTAGTTTCTTTGTTTACCATAGATCAATATTATAAGCATATTTTAATATT
See also
- NCBI GenBank Overview
- Template:extension (template code constructor)
- Extension:Example (working example)
- Wikipedia FASTA
- Wikipedia Genbank