Difference between revisions of "Introduction to Microarray analysis"
From Organic Design wiki
m |
|||
(27 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{#security:edit|Sven}} | ||
+ | {{#security:*|Sven}} | ||
+ | [[Category:Sven/Rosaceae]] | ||
__NOTOC__ | __NOTOC__ | ||
− | + | ||
− | [[Image: | + | = Overview of experimental process = |
+ | [[Image:expt.png]] | ||
+ | :<font color="blue">(''Courtesy Mik Black'')</font> | ||
*Competitive hybridization to spotted oligo/cDNA transcripts | *Competitive hybridization to spotted oligo/cDNA transcripts | ||
− | *Interested in genes that change between | + | *Interested in genes that change between treatment conditions |
:<font color="blue">→ ''differential expression versus equivalent expression''</font> | :<font color="blue">→ ''differential expression versus equivalent expression''</font> | ||
---- | ---- | ||
− | + | ||
− | [[Image: | + | = Statistical analysis process = |
+ | [[Image:process.png]] | ||
* Raw data (''GPR file format'') | * Raw data (''GPR file format'') | ||
:''http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html'' | :''http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html'' | ||
− | * Each GPR intensity file is typically >8 megabytes | + | * Each GPR intensity file is typically >8 megabytes |
* Each TIFF image file is typically >30 megabytes | * Each TIFF image file is typically >30 megabytes | ||
* A microarray experiment consists of several → many slides | * A microarray experiment consists of several → many slides | ||
---- | ---- | ||
− | + | = Statistical issues = | |
*In the past statistics was developed for n >>p | *In the past statistics was developed for n >>p | ||
:<font color="blue">''n observations, p variables''</font> | :<font color="blue">''n observations, p variables''</font> | ||
Line 28: | Line 34: | ||
*Data not normally distributed | *Data not normally distributed | ||
:<font color="blue">''log transform highly skewed intensity data''</font> | :<font color="blue">''log transform highly skewed intensity data''</font> | ||
− | [[Image:Graph channels. | + | [[Image:Graph channels.png]] |
---- | ---- | ||
− | + | = Analysis wish list = | |
* Ideally would like unambiguous interpretation of results | * Ideally would like unambiguous interpretation of results | ||
* Large amounts of data to analyse can be overwhelming and make interpretation subjective | * Large amounts of data to analyse can be overwhelming and make interpretation subjective | ||
* Independent reproducibility of results by another collegue | * Independent reproducibility of results by another collegue | ||
− | :<font color="blue">→''Keep a record (''log'') of what was done'' </font> | + | :<font color="blue">→''Keep a record (''log file'') of what was done'' </font> |
---- | ---- | ||
− | + | = Analysis aim = | |
− | * Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff | + | * Obtain a list of genes which we think are differentially expressing |
+ | Block Row Column ID Name M A t P.Value B | ||
+ | 10396 20 15 23 [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?submit=y&db=Nucleotide&term=CN882776 171121_390_49] 171121 5.035364 13.25087 49.62425 3.220044e-05 11.27486 | ||
+ | 4517 9 13 9 20264_118_53 20264 4.396719 11.11976 47.06004 3.220044e-05 11.05671 | ||
+ | 16881 32 21 22 165415_634_53 165415 4.645384 12.65872 43.40359 3.220044e-05 10.70650 | ||
+ | 16086 31 10 9 185903_436_49 185903 5.146504 11.36911 42.75724 3.220044e-05 10.63926 | ||
+ | 6508 13 7 22 197386_457_55 197386 4.621024 13.20426 42.09902 3.220044e-05 10.56899 | ||
+ | 5471 11 8 20 142178_355_53 142178 4.795734 12.07427 41.23346 3.220044e-05 10.47374 | ||
+ | 8395 16 20 23 251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421 | ||
+ | 4330 9 5 6 297409_340_47 297409 4.421922 12.27208 38.52215 3.220044e-05 10.15284 | ||
+ | 12479 24 14 13 163360_396_47 163360 4.367943 11.10478 38.21662 3.220044e-05 10.11439 | ||
+ | 15024 29 10 5 149243_674_53 149243 4.372419 11.36572 37.86362 3.220044e-05 10.06935 | ||
+ | * <font color="blue">Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff</font> | ||
+ | |||
*If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used | *If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used | ||
− | + | :<font color="blue">''FDR threhold is the expected proportion of genes in a list that are likely to be incorrect''</font> | |
− | |||
---- | ---- | ||
− | [[Category: | + | |
+ | [[Category:Microarray]] |
Latest revision as of 21:53, 11 November 2007
{{#security:edit|Sven}} {{#security:*|Sven}}
Overview of experimental process
- (Courtesy Mik Black)
- Competitive hybridization to spotted oligo/cDNA transcripts
- Interested in genes that change between treatment conditions
- → differential expression versus equivalent expression
Statistical analysis process
- Raw data (GPR file format)
- Each GPR intensity file is typically >8 megabytes
- Each TIFF image file is typically >30 megabytes
- A microarray experiment consists of several → many slides
Statistical issues
- In the past statistics was developed for n >>p
- n observations, p variables
- Gene expression data n<<p
- Thousands of measured genes (p)
- Small number of biological replicate slides (n)
- Gene expression data can be highly correlated
- groups of genes are regulated in the same way
- Data not normally distributed
- log transform highly skewed intensity data
Analysis wish list
- Ideally would like unambiguous interpretation of results
- Large amounts of data to analyse can be overwhelming and make interpretation subjective
- Independent reproducibility of results by another collegue
- →Keep a record (log file) of what was done
Analysis aim
- Obtain a list of genes which we think are differentially expressing
Block Row Column ID Name M A t P.Value B 10396 20 15 23 171121_390_49 171121 5.035364 13.25087 49.62425 3.220044e-05 11.27486 4517 9 13 9 20264_118_53 20264 4.396719 11.11976 47.06004 3.220044e-05 11.05671 16881 32 21 22 165415_634_53 165415 4.645384 12.65872 43.40359 3.220044e-05 10.70650 16086 31 10 9 185903_436_49 185903 5.146504 11.36911 42.75724 3.220044e-05 10.63926 6508 13 7 22 197386_457_55 197386 4.621024 13.20426 42.09902 3.220044e-05 10.56899 5471 11 8 20 142178_355_53 142178 4.795734 12.07427 41.23346 3.220044e-05 10.47374 8395 16 20 23 251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421 4330 9 5 6 297409_340_47 297409 4.421922 12.27208 38.52215 3.220044e-05 10.15284 12479 24 14 13 163360_396_47 163360 4.367943 11.10478 38.21662 3.220044e-05 10.11439 15024 29 10 5 149243_674_53 149243 4.372419 11.36572 37.86362 3.220044e-05 10.06935
- Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff
- If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used
- FDR threhold is the expected proportion of genes in a list that are likely to be incorrect