Introduction to Microarray analysis
From Organic Design wiki
{{#security:edit|Sven}} {{#security:*|Sven}}
Overview of experimental process
- (Courtesy Mik Black)
- Competitive hybridization to spotted oligo/cDNA transcripts
- Interested in genes that change between treatment conditions
- → differential expression versus equivalent expression
Statistical analysis process
- Raw data (GPR file format)
- Each GPR intensity file is typically >8 megabytes
- Each TIFF image file is typically >30 megabytes
- A microarray experiment consists of several → many slides
Statistical issues
- In the past statistics was developed for n >>p
- n observations, p variables
- Gene expression data n<<p
- Thousands of measured genes (p)
- Small number of biological replicate slides (n)
- Gene expression data can be highly correlated
- groups of genes are regulated in the same way
- Data not normally distributed
- log transform highly skewed intensity data
Analysis wish list
- Ideally would like unambiguous interpretation of results
- Large amounts of data to analyse can be overwhelming and make interpretation subjective
- Independent reproducibility of results by another collegue
- →Keep a record (log file) of what was done
Analysis aim
- Obtain a list of genes which we think are differentially expressing
Block Row Column ID Name M A t P.Value B 10396 20 15 23 171121_390_49 171121 5.035364 13.25087 49.62425 3.220044e-05 11.27486 4517 9 13 9 20264_118_53 20264 4.396719 11.11976 47.06004 3.220044e-05 11.05671 16881 32 21 22 165415_634_53 165415 4.645384 12.65872 43.40359 3.220044e-05 10.70650 16086 31 10 9 185903_436_49 185903 5.146504 11.36911 42.75724 3.220044e-05 10.63926 6508 13 7 22 197386_457_55 197386 4.621024 13.20426 42.09902 3.220044e-05 10.56899 5471 11 8 20 142178_355_53 142178 4.795734 12.07427 41.23346 3.220044e-05 10.47374 8395 16 20 23 251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421 4330 9 5 6 297409_340_47 297409 4.421922 12.27208 38.52215 3.220044e-05 10.15284 12479 24 14 13 163360_396_47 163360 4.367943 11.10478 38.21662 3.220044e-05 10.11439 15024 29 10 5 149243_674_53 149243 4.372419 11.36572 37.86362 3.220044e-05 10.06935
- Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff
- If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used
- FDR threhold is the expected proportion of genes in a list that are likely to be incorrect