Difference between revisions of "Introduction to Microarray analysis"

From Organic Design wiki
m (Analysis wish list)
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
{{#security:edit|Sven}}
 +
{{#security:*|Sven}}
 +
[[Category:Sven/Rosaceae]]
 
__NOTOC__
 
__NOTOC__
====Overview of experimental process====
+
 
[[Image:expt2.tiff|thumb|500px|''Courtesy Mik Black'']]
+
= Overview of experimental process =
 +
[[Image:expt.png]]
 +
:<font color="blue">(''Courtesy Mik Black'')</font>
 
*Competitive hybridization to spotted oligo/cDNA transcripts
 
*Competitive hybridization to spotted oligo/cDNA transcripts
*Interested in genes that change between treatments
+
*Interested in genes that change between treatment conditions
 
:<font color="blue">&rarr; ''differential expression versus equivalent expression''</font>
 
:<font color="blue">&rarr; ''differential expression versus equivalent expression''</font>
 
----
 
----
====Statistical analysis process====
+
 
[[Image:overview2.tiff|thumb|500px|''Analysis workflow from scanner to results'']]
+
= Statistical analysis process =
 +
[[Image:process.png]]
 
* Raw data (''GPR file format'')
 
* Raw data (''GPR file format'')
 
:''http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html''
 
:''http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html''
* Each GPR intensity file is typically >8 megabytes  
+
* Each GPR intensity file is typically >8 megabytes  
 
* Each TIFF image file is typically >30 megabytes  
 
* Each TIFF image file is typically >30 megabytes  
 
* A microarray experiment consists of several &rarr; many slides
 
* A microarray experiment consists of several &rarr; many slides
 
----
 
----
  
====Statistical issues====
+
= Statistical issues =
 
*In the past statistics was developed for n >>p
 
*In the past statistics was developed for n >>p
 
:<font color="blue">''n observations, p variables''</font>
 
:<font color="blue">''n observations, p variables''</font>
Line 28: Line 34:
 
*Data not normally distributed  
 
*Data not normally distributed  
 
:<font color="blue">''log transform highly skewed intensity data''</font>
 
:<font color="blue">''log transform highly skewed intensity data''</font>
[[Image:Graph channels.tiff|right|thumb|250px|''Density plots from a 16-bit scanner'']]
+
[[Image:Graph channels.png]]
 
----
 
----
  
====Analysis wish list====
+
= Analysis wish list =
 
* Ideally would like unambiguous interpretation of results
 
* Ideally would like unambiguous interpretation of results
 
* Large amounts of data to analyse can be overwhelming and make interpretation subjective
 
* Large amounts of data to analyse can be overwhelming and make interpretation subjective
 
* Independent reproducibility of results by another collegue
 
* Independent reproducibility of results by another collegue
*<font color="blue">''Keep a record (''log'') of what was done'' </font>
+
:<font color="blue">&rarr;''Keep a record (''log file'') of what was done'' </font>
 
----
 
----
  
====Analysis aim====
+
= Analysis aim =
* Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff
+
* Obtain a list of genes which we think are differentially expressing
 +
      Block Row Column            ID  Name        M        A        t      P.Value        B
 +
10396    20  15    23 [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?submit=y&db=Nucleotide&term=CN882776 171121_390_49] 171121  5.035364 13.25087  49.62425 3.220044e-05 11.27486
 +
4517      9  13      9  20264_118_53  20264  4.396719 11.11976  47.06004 3.220044e-05 11.05671
 +
16881    32  21    22 165415_634_53 165415  4.645384 12.65872  43.40359 3.220044e-05 10.70650
 +
16086    31  10      9 185903_436_49 185903  5.146504 11.36911  42.75724 3.220044e-05 10.63926
 +
6508    13  7    22 197386_457_55 197386  4.621024 13.20426  42.09902 3.220044e-05 10.56899
 +
5471    11  8    20 142178_355_53 142178  4.795734 12.07427  41.23346 3.220044e-05 10.47374
 +
8395    16  20    23  251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421
 +
4330      9  5      6 297409_340_47 297409  4.421922 12.27208  38.52215 3.220044e-05 10.15284
 +
12479    24  14    13 163360_396_47 163360  4.367943 11.10478  38.21662 3.220044e-05 10.11439
 +
15024    29  10      5 149243_674_53 149243  4.372419 11.36572  37.86362 3.220044e-05 10.06935
 +
* <font color="blue">Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff</font>
 +
 
 
*If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used
 
*If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used
**FDR threhold is the expected proportion of genes in a list that are likely to be incorrect
+
:<font color="blue">''FDR threhold is the expected proportion of genes in a list that are likely to be incorrect''</font>
TODO: Picie of  a gene list
 
 
----
 
----
[[Category:Sven/Rosaceae]]
+
 
 +
[[Category:Microarray]]

Latest revision as of 21:53, 11 November 2007

{{#security:edit|Sven}} {{#security:*|Sven}}


Overview of experimental process

File:Expt.png

(Courtesy Mik Black)
  • Competitive hybridization to spotted oligo/cDNA transcripts
  • Interested in genes that change between treatment conditions
differential expression versus equivalent expression

Statistical analysis process

File:Process.png

  • Raw data (GPR file format)
http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html
  • Each GPR intensity file is typically >8 megabytes
  • Each TIFF image file is typically >30 megabytes
  • A microarray experiment consists of several → many slides

Statistical issues

  • In the past statistics was developed for n >>p
n observations, p variables
  • Gene expression data n<<p
Thousands of measured genes (p)
Small number of biological replicate slides (n)
  • Gene expression data can be highly correlated
groups of genes are regulated in the same way
  • Data not normally distributed
log transform highly skewed intensity data

File:Graph channels.png


Analysis wish list

  • Ideally would like unambiguous interpretation of results
  • Large amounts of data to analyse can be overwhelming and make interpretation subjective
  • Independent reproducibility of results by another collegue
Keep a record (log file) of what was done

Analysis aim

  • Obtain a list of genes which we think are differentially expressing
      Block Row Column            ID   Name         M        A         t      P.Value        B
10396    20  15     23 171121_390_49 171121  5.035364 13.25087  49.62425 3.220044e-05 11.27486
4517      9  13      9  20264_118_53  20264  4.396719 11.11976  47.06004 3.220044e-05 11.05671
16881    32  21     22 165415_634_53 165415  4.645384 12.65872  43.40359 3.220044e-05 10.70650
16086    31  10      9 185903_436_49 185903  5.146504 11.36911  42.75724 3.220044e-05 10.63926
6508     13   7     22 197386_457_55 197386  4.621024 13.20426  42.09902 3.220044e-05 10.56899
5471     11   8     20 142178_355_53 142178  4.795734 12.07427  41.23346 3.220044e-05 10.47374
8395     16  20     23   251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421
4330      9   5      6 297409_340_47 297409  4.421922 12.27208  38.52215 3.220044e-05 10.15284
12479    24  14     13 163360_396_47 163360  4.367943 11.10478  38.21662 3.220044e-05 10.11439
15024    29  10      5 149243_674_53 149243  4.372419 11.36572  37.86362 3.220044e-05 10.06935
  • Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff
  • If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used
FDR threhold is the expected proportion of genes in a list that are likely to be incorrect