Difference between revisions of "NCBI"

From Organic Design wiki
(FASTA: Fasta url)
(FASTA: gi, bg key value pairs: the definition)
Line 12: Line 12:
 
There is a condensed file format called a [http://www.ncbi.nlm.nih.gov/blast/fasta.shtml FASTA] format is used to manipulate primary sequence information. FASTA files can be ''nucleotide'' or ''amino acid'' records. the first row of the record starts with a ''>'' and can contain any description information about the record. It is recommended that all lines of text be shorter than 80 characters in length.
 
There is a condensed file format called a [http://www.ncbi.nlm.nih.gov/blast/fasta.shtml FASTA] format is used to manipulate primary sequence information. FASTA files can be ''nucleotide'' or ''amino acid'' records. the first row of the record starts with a ''>'' and can contain any description information about the record. It is recommended that all lines of text be shorter than 80 characters in length.
  
''An example of an amino acid FASTA record, in this case the description information is a consise format separated by pipe characters |. The second field is the ACCESSION number refering to the amino acid GENPEPT record.''
+
''An example of an amino acid FASTA record, in this case the description information is a consise format separated by pipe characters |. The first field ''gi'' refers to ''GENBANK IDENTIFIER'', a unique identification number. The second field is the GENBANK IDENTIFIER number refering to the amino acid GENPEPT record.''
 
<pre>
 
<pre>
 
>gi|[http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=532319 532319]|pir|TVFV2E|TVFV2E envelope protein
 
>gi|[http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=532319 532319]|pir|TVFV2E|TVFV2E envelope protein
Line 24: Line 24:
 
</pre>
 
</pre>
  
[http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&qty=1&c_start=1&list_uids=1293613&uids=&dopt=fasta&dispmax=5&sendto=&fmt_mask=0&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256 ''(U49845 actual FASTA record)'']
+
In the [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&qty=1&c_start=1&list_uids=1293613&uids=&dopt=fasta&dispmax=5&sendto=&fmt_mask=0&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256 ''U49845 actual FASTA record''] the third field is ''gb'' which refers to the ''GENBANK ACCESSION'', a unique accession number. The fourth field is the GENBANK ACCESION number refering to the nucleotide GENBANK record.
  
 
==See also==
 
==See also==

Revision as of 10:40, 6 May 2007

Genbank is a publically accessible flat file database structure for primary nucleotide sequence information and auxillary information. The amount of information publically submitted has been growing exponentially. In August 2005 this DNA sequence database reached 100 gigabases. These records can display the ORIGIN information for different nucleotide molecular types, and have no limit on the length of sequence displayed. Entire chromosomes can be stored as a genbank record for an organism of interest, potentially making the disk storage of the record very large.

An example Genbank sample record (U49845 actual Genbank record)

Regular expressions matching parts we care about

Genbank

FASTA

There is a condensed file format called a FASTA format is used to manipulate primary sequence information. FASTA files can be nucleotide or amino acid records. the first row of the record starts with a > and can contain any description information about the record. It is recommended that all lines of text be shorter than 80 characters in length.

An example of an amino acid FASTA record, in this case the description information is a consise format separated by pipe characters |. The first field gi refers to GENBANK IDENTIFIER, a unique identification number. The second field is the GENBANK IDENTIFIER number refering to the amino acid GENPEPT record.

>gi|[http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=532319 532319]|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK

In the U49845 actual FASTA record the third field is gb which refers to the GENBANK ACCESSION, a unique accession number. The fourth field is the GENBANK ACCESION number refering to the nucleotide GENBANK record.

See also