Difference between revisions of "NCBI"
(FASTA example) |
(→Regular expressions matching parts we care about: FASTA desc) |
||
Line 5: | Line 5: | ||
== Regular expressions matching parts we care about == | == Regular expressions matching parts we care about == | ||
+ | ===Genbank=== | ||
*The unique [http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#AccessionB ACCESSION] number | *The unique [http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#AccessionB ACCESSION] number | ||
*The ORIGIN field | *The ORIGIN field | ||
*Any amino acid [http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#TranslationB /translation] field in the [http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#FeaturesB FEATURE] table | *Any amino acid [http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#TranslationB /translation] field in the [http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#FeaturesB FEATURE] table | ||
− | There is a condensed file format called a [http://www.ncbi.nlm.nih.gov/blast/fasta.shtml FASTA] format is usded to manipulate primary sequence information. FASTA files can be ''nucleotide'' or ''amino acid'' records | + | ===FASTA=== |
− | ''An example of an ''amino acid'' FASTA record'' | + | There is a condensed file format called a [http://www.ncbi.nlm.nih.gov/blast/fasta.shtml FASTA] format is usded to manipulate primary sequence information. FASTA files can be ''nucleotide'' or ''amino acid'' records. the first row of the record starts with a '>' can contain any description information about the record. It is recommended that all lines of text be shorter than 80 characters in length. |
+ | |||
+ | ''An example of an ''amino acid'' FASTA record, in this case the description information is a consise format separated by pipe characters '|'. The second field is the ACCESSION number refering to the amino acid GENPEPT record.'' | ||
<pre> | <pre> | ||
>gi|532319|pir|TVFV2E|TVFV2E envelope protein | >gi|532319|pir|TVFV2E|TVFV2E envelope protein | ||
Line 21: | Line 24: | ||
LAAVEAQQQMLKLTIWGVK | LAAVEAQQQMLKLTIWGVK | ||
</pre> | </pre> | ||
+ | |||
==See also== | ==See also== | ||
* [http://www.ncbi.nlm.nih.gov/Genbank/index.html NCBI GenBank Overview] | * [http://www.ncbi.nlm.nih.gov/Genbank/index.html NCBI GenBank Overview] |
Revision as of 05:01, 5 May 2007
Genbank is a flat file database structure for primary nucleotide sequence information and auxillary information. These records can display the ORIGIN information for different nucleotide molecular types, and have no limit on the length of sequence displayed. Entire chromosomes can be stored as a genbank record for an organism of interest, potentially making the disk storage of the record very large.
An example Genbank sample record
Regular expressions matching parts we care about
Genbank
- The unique ACCESSION number
- The ORIGIN field
- Any amino acid /translation field in the FEATURE table
FASTA
There is a condensed file format called a FASTA format is usded to manipulate primary sequence information. FASTA files can be nucleotide or amino acid records. the first row of the record starts with a '>' can contain any description information about the record. It is recommended that all lines of text be shorter than 80 characters in length.
An example of an amino acid FASTA record, in this case the description information is a consise format separated by pipe characters '|'. The second field is the ACCESSION number refering to the amino acid GENPEPT record.
>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK