Rosaceae

From Organic Design wiki

Microarray analysis workshop

Time schedule: 8:30 - 10:30am, 11-12:30am (3.5 hours)


Workflow

  1. Introduction to Microarray analysis (15 minutes)
  2. Normalization (15 minutes)
  3. BioConductor/R framework (15 minutes)
  4. R tutorial (60 minutes)
  5. Linear models for Microarray analysis (15 minutes)
  6. BioConductor analysis tutorial ("60 minutes")

{{#security:edit|Sven}} {{#security:*|Sven}}


Overview of experimental process

File:Expt.png

(Courtesy Mik Black)
  • Competitive hybridization to spotted oligo/cDNA transcripts
  • Interested in genes that change between treatment conditions
differential expression versus equivalent expression

Statistical analysis process

File:Process.png

  • Raw data (GPR file format)
http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html
  • Each GPR intensity file is typically >8 megabytes
  • Each TIFF image file is typically >30 megabytes
  • A microarray experiment consists of several → many slides

Statistical issues

  • In the past statistics was developed for n >>p
n observations, p variables
  • Gene expression data n<<p
Thousands of measured genes (p)
Small number of biological replicate slides (n)
  • Gene expression data can be highly correlated
groups of genes are regulated in the same way
  • Data not normally distributed
log transform highly skewed intensity data

File:Graph channels.png


Analysis wish list

  • Ideally would like unambiguous interpretation of results
  • Large amounts of data to analyse can be overwhelming and make interpretation subjective
  • Independent reproducibility of results by another collegue
Keep a record (log file) of what was done

Analysis aim

  • Obtain a list of genes which we think are differentially expressing
      Block Row Column            ID   Name         M        A         t      P.Value        B
10396    20  15     23 171121_390_49 171121  5.035364 13.25087  49.62425 3.220044e-05 11.27486
4517      9  13      9  20264_118_53  20264  4.396719 11.11976  47.06004 3.220044e-05 11.05671
16881    32  21     22 165415_634_53 165415  4.645384 12.65872  43.40359 3.220044e-05 10.70650
16086    31  10      9 185903_436_49 185903  5.146504 11.36911  42.75724 3.220044e-05 10.63926
6508     13   7     22 197386_457_55 197386  4.621024 13.20426  42.09902 3.220044e-05 10.56899
5471     11   8     20 142178_355_53 142178  4.795734 12.07427  41.23346 3.220044e-05 10.47374
8395     16  20     23   251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421
4330      9   5      6 297409_340_47 297409  4.421922 12.27208  38.52215 3.220044e-05 10.15284
12479    24  14     13 163360_396_47 163360  4.367943 11.10478  38.21662 3.220044e-05 10.11439
15024    29  10      5 149243_674_53 149243  4.372419 11.36572  37.86362 3.220044e-05 10.06935
  • Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff
  • If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used
FDR threhold is the expected proportion of genes in a list that are likely to be incorrect


What is BioConductor? (http://www.bioconductor.org)

  • BioConductor is an open source development software project
  • Provides tools for analysis and comprehension of genomic data
  • Extensively for Affymetrix and cDNA microarray technologies
  • The project started in the autumn of 2001
  • Includes 23 core collaborating developers
  • Project Growth
v1.0: May 2nd, 2002, 15 packages
v1.1: November 18th, 2002, 20 packages.
v1.2: May 28th, 2003, 30 packages.
v1.3 Oct 29th 2003, 49 Packages
v1.4 May 18 2004, 82 Packages
v1.5 Nov 1st, 2004, 98 Packages

BioConductor Goals

The broad goals of the project are:

  • To enable sound and powerful statistical analyses in genomics
  • To provide a computing platform that allows the rapid design and deployment of high-quality software
  • To develop a computing environment for both biologists and statisticians
  • Promote high-quality dynamic documentation and reproducible research
  • Using LATEX, the Sweave system and tcl/tk to deliver interactive step by step pdf tutorials, e.g.
library(tkWidgets)
vExplorer()

File:Sweave2.png


Object oriented class method design (OOP)

  • Organized approach to handling large amounts of experimental data
  • Class structure encapsulates the data required for microarray analysis → object
  • Allows efficient representation and manipulation (including subsetting) of data from many microarray slides in an experiment
  • A method is a function that performs an action on data (objects) throughout analysis

Advantages

  • Newest cutting edge statistical methods available
  • Modern programming language
  • Powerful graphical tools available
  • Its freely available

Disadvantages

  • Steep learning curve
  • Need to have experience programming in the R programming environment (http://www.r-project.org)
  • the dreaded command line
  • Like all software there are bugs

Accessing BioConductor

  • BioConductor tools are accessed using the R programming language
  • R is a programming environment for statistical computing and graphics
  • Initially written by Robert Gentleman and Ross Ihaka (Auckland University)
  • Download R from a comprehensive R archive network (CRAN) mirror (http://cran.stat.auckland.ac.nz)
  • Install R (available for Unix, Windows, and Mac OS X)
  • R version 2.2.1 has been released on 2005-12-20
  • R is the environment used to design and distribute software:
    • Locally downloaded files
    • Via the internet e.g. commands in R:
Sys.putenv("http_proxy"="http://proxy.hort.net.nz:8080")   # Setting proxy variable
source("http://www/bioconductor.org/getBioC.R")            # Downloading installation script
getBioC()                                                  # Running script

Installing vs loading packages

  • Packages only need to be installed once onto a computer
  • Packages must be loaded with each new R session
  • The R function library is used to load packages e.g. command in R:
library(limma)     #Installs the limma package 

Documentation and Help

  • R manuals and tutorials are available from the R website or on-line in an R session
  • R on-line help system, detailed high quality on-line documentation, available in text, HTML, PDF, and LATEX formats.
  • Frequently asked questions BioConductor / R
  • Email news lists for BioConductor / R
Note: Read the posting guides before submitting questions

Resources

Contributed guides for the beginner
More comprehensive contributed guides



Obtaining help in R

help.start()           # Browser based help documentation
help()                 # Help on a topic (note: help pages have a set format)
? ls                   # alternative help method on ls function
apropos(mean)          # Find Objects by (Partial) Name
example(mean)          # Run an Examples Section from the Online Help
demo()                 # Demonstrations of R Functionality
demo(graphics)         # Demonstration or graphics Functionality
RSiteSearch()          # Searches web newslist archives and retrieves results using http
There objects are functions, to run them you must put parentheses '()' after the function name

Useful commands in the R environment

search()              # Give Search Path for R Objects
searchpaths()         # Give Full Search Path for R Objects
ls()                  # List objects
objects()             # alternate function to list objects
data()                # Publically available datasets
rm()                  # Remove Objects from a Specified Environment
save.image()          # Save R Objects
q()                   # Terminate an R Session →  prompted to Save workspace image? [y/n/c]:

Command prompt

  • Type commands after the prompt (>) e.g.
> x <- 1:10        # assignment of 1 to 10 to an object called 'x'
> x                # Returning the x object to the screen
 [1]  1  2  3  4  5  6  7  8  9 10
  • Continuation of commands is expected after the plus symbol (+) e.g.
> x <- 1:          # partial command → parser is expecting more information
+  10
> x
 [1]  1  2  3  4  5  6  7  8  9 10
Text following a '#' is commented out

Basic (atomic) data types

  • Logical
T                 # TRUE
F                 # FALSE
  • Numeric
3.141592654       # Any number [0-9\.]
  • Character
"Putative ATPase" # Any character [A-Za-z] must be single or double quoted 
  • Missing values
NA                # Label for missing information in datasets
See also help("NA"), help("NaN")

Assignment of objects

  • objects must start with a letter [A-Z a-z]
  • "<-" The arrow assigns information to the object on the left
x <- 42                # Assignment to the left
x
x = 42                 # Equivalent assignment (not recommended)
x 
42 -> x                # Assignment to the right
x

Saving objects

getwd()                        # Returns the current directory where R is running
setwd("C:/DATA/Microarray")    # Set the working directory to another location
getwd()                        # Check the directory has changed
x <- 42
save.image()                   # Saves a snapshot of objects to file .RData
y <- x * 2                     # Make a new object called 'y'
y                              # Return value of 'y'
q()                            # quit R

Restart R by double clicking on the file .RData in C:/DATA/Microarray

x              # Returns 'x' as it was saved to .RData in "C:/DATA/Microarray"
y              # 'y' should not exist

Object data types

  • Create a scalar (vector of length 1)
a <- 3.14            # Assign pythagorus to object 'a'
length(a)            # The scalar is actually a vector of length 1 
pi                   # Already have a built in object for pythagorus 
search()             # Print the search path for all objects
find("pi")           # "pi" is located in package:base
  • Create a vector
x <- c(2,3,5,2,7,1)  # Numbers put into a vector using 'c' function concatenate
x
y <- c(10,15,12) 
y
names(y) <- c("first","second","third")    # Elements can be given names
z <- c(y,x)
z
  • Create a matrix
zmat <- cbind(x,y)   # cbind joins vectors together by column       
zmat 
Whats going on in the second column → number recycling
mat <- matrix(1:20, nrow=5, ncol=4)                   # Constructing a matrix
mat
colnames(mat) <- c("Col1","Col2", "Col3", "Col4")     # Adding column names
mat
  • Create a list
mylist <- list(1:4,c("a","b","c"), LETTERS[1:10]) 
mylist
mylist <- list("element 1" = 1:4,"second vector" = c("a","b","c"), "Capitals" = LETTERS[1:10]) 
mylist

Indexing

  • Subsetting a vector
x[c(1,2,3)]                     # Selecting the first three elements of 'x'
x[1:3]                          # Same subset using ':' sequence generation → see help(":")
y[2]                            # Selecting the second element of 'y'
y["second"]                     # Selecting the second element of 'y' (by name)
  • Subsetting a matrix
mat[,1:2]                       # Selecting the first two columns of 'mat'
mat[1:2, 2:4]                   # Selecting a subset matrix of 'mat'
  • Subsetting a list
mylist[[1]]                    # Subsetting list 'mylist' by index
mylist[["element 1"]]          # Subsetting list 'mylist' by name 'element 1'
mylist$"element 1"             # Alternate way of subsetting
mylist$Capitals[1:5]           # Selecting the first five elements of 'Capitals' in 'mylist' (case sensitive)

Plotting data

  • See help pages for basic plot functions
help(plot) 
help(par)
example(plot)
par(ask=TRUE)         # Set the printing device to prompt user before displaying next graph
example(hist)

Reading / writing files

Reading data
help(scan)
help(read.table)

Reading a GPR file header using scan
dataDir <- "C:/DATA/Microarray/GPR") 
mydata <-  scan(file.path(dataDir, "BE34.gpr"), what="", nlines=29)  # Get first 29 rows of data
mydata
Reading a GPR file data section using read.table
colClasses <- rep("NULL", 82)
colClasses[c(1:5, 9,12, 18, 21)] <- NA                                      # Set colClasses to ignore unwanted columns
mydata <- read.table(file.path(dataDir,"BE34.gpr"), header=T,  sep="\t", 
                                  nrows=20, skip=31, colClasses=colClasses) # Get first 20 lines of data after 31st row
mydata
Writing data
help(write)
help(write.table)
See also dump, restore, save, load

User defined functions

  • Writing functions provide a means of adding new functionality to the language. A function has the form:
myfun <- function( arglist ){ body }
Functions2.png 
  • Identity function: returns its input arguement
myfun <- function(x){x}        # Creating identity function
myfun("foo")                   # Running the function
myfun()                        # Fails: no input arguement provided
  • A simple function
square <- function(x){x * x}         # Square the input number
square(10)                  # Returns 10 squared
square(1:4)              # Underlying arithmetic is vectorized
  • Graphical example from user defined function
The following function generates data from sine distributions and examines
bias variance tradeoff of a smoothing function using different smoothing parameters. Paste it into R
"biasVar" <- function(df1=4,  df2=15,  N = 100,  seed=1)
{
 set.seed(seed)
 # 1) Data setup
 ylim <- c(-2,2)
 xlim <- c(-3,3)
 par(mfrow=c(2,2), mar=c(5,4,4-2,2)+0.1,mgp=c(2,.5,0) )
 x <- rnorm(80, 0, 1)
 y <- sin(x) + rnorm(80, 0, 1/9)
 xno   <- 500
 sim <- matrix(NA, nc=N, nr=xno)
 xseq <- seq(min(x),max(x), length=xno)
 plot(x, y, main=paste("df=",df1,sep=""), xlim=xlim, ylim=ylim)    # Using Splines
 truex <- seq(min(x), max(x), length = 80)
 lines(truex, sin(truex), lty = 5)
 splineobj <- smooth.spline(x, y, df = df1)
 lines(splineobj, lty = 1)
 plot(x, y, main=paste("df=",df2,sep=""), xlim=xlim, ylim=ylim)    # Using Splines
 truex <- seq(min(x), max(x), length = 80)
 lines(truex, sin(truex), lty = 5)
 splineobj <- smooth.spline(x, y, df = df2)
 lines(splineobj, lty = 1)
 plot(x, y, main=paste("Bias-Variance tradeoff, df=",df1, sep=""), type="n", xlim=xlim, ylim=ylim)
 for(i in seq(N))
   {
     x <- rnorm(80, 0, 1)
     y <- sin(x) + rnorm(80, 0, 1/9)
     splineobj <- smooth.spline(x, y, df = df1)      
     sim[,i] <- predict(splineobj,xseq)$y
   }
 ci <- qt(0.975, N) * sqrt(apply(sim,1, var))
 bias <- apply(sim,1, mean)
 rect(xseq,bias-ci,xseq,bias+ci, border="grey")
 rect(xseq,sin(xseq),xseq,bias, border="black")
 lines(truex, sin(truex))
 plot(x, y, main=paste("Bias-Variance tradeoff, df=",df2,sep=""), type="n", xlim=xlim, ylim=ylim)  
 for(i in seq(N))
   {
     x <- rnorm(80, 0, 1)
     y <- sin(x) + rnorm(80, 0, 1/9)
     splineobj <- smooth.spline(x, y, df = df2)      
     sim[,i] <- predict(splineobj,xseq)$y
   }
 ci <- qt(0.975,N) * sqrt(apply(sim,1, var))
 bias <- apply(sim,1, mean)
 rect(xseq,bias-ci,xseq,bias+ci, border="grey")
 rect(xseq,sin(xseq),xseq,bias, border="black")
 lines(truex, sin(truex))
}
Running the function
biasVar()                # Generates data from a sine curve looking at bias variance tradeoff 
biasVar(df1=2, df2=30)   # Let's change the smoothing parameters in the function arguements

Quiting R

rm(list=ls())         # Cleaning up: Remove Objects from a Specified Environment
q()

Links


Overview of Limma package for R

  • Fits a linear model for each spot (gene)
  • An open source software package for the R programming environment
  • Focus on normalization and statistical analysis of cDNA microarray gene expression data
  • OOP environment for handling information in a microarray experiment
  • Statistical analysis approach can be used for Affymetrix microarray experiments

Origin

  • Written and maintained by Gordon Smyth with contributions From WEHI, Melbourne, Australia
  • Software made public at the Australian Genstat Conference, Perth, in Dec 2002
  • Became available in the Bioconductor open source bioinformatics project April 2003
  • Limma integrates with other Bioconductor software packages, affy, marray, using convert package
  • Active development cycle

File:Limma versions.tiff


Statistical approach

  • Parallel inference for each gene
  • Computationally fast/robust
  • Handles missing information/use defined flag information
  • Linear models are essentially t-statistics for each spot/gene (signal/noise)
  • Also makes use of between gene information (moderated t-statistics)

Object orientated programming environment

File:OOP.png

  • Uploading data into the R programming language automatically populates elements of RGList
    • R (Red foreground)
    • G (Green foreground)
  • Foreground intensities range ~ 1 → 65535
    • Rb (Red background)
    • Gb (Green background)
  • Background intensities range ~ 1 → 1000
    • genes (Spot annotation list)
    • weights (prior weights weights given to each spot)
  • MAList data transformation
    • M = log2(R) - log2(G) (minus)
    • A = (log2(R) + log2(G))/2) (add - abundance)
  • Backtransforming to Normalized R', G' values
    • log2(R') = A + M/2
    • log2(G') = A - M/2

Advantages using Limma

  • Nice organisational framework for handling cDNA expression data using object orientated programming
  • Flexible methods to handle weighting of poor quality spots
  • Encorporates cDNA normalization routines with a proven track record
  • Robust statistical analysis approach
Can analyze cDNA microarray slides possessing large amounts of missing information
  • Analysis methods able to encorporate duplicate spots from either technical or biological sources

Limitations

  • Experiments with different spotting templates cannot easily be combined for analysis
  • Statistical analysis cannot pool information together when there are variable numbers of the same replicated spots
Must analyze spot information about the same transcript independently
  • Linear models cannot encorporate error model structures from time series designs

Microarray workshop experiment

  • Dye swap experiment
  • Directed graph

File:Dyeswap.png

FileName SlideNumber   Cy3   Cy5 Design
BE34.gpr          34  Leaf Fruit     -1
BE35.gpr          35 Fruit  Leaf      1
BE36.gpr     	  36 Fruit  Leaf      1
BE37.gpr   	  37  Leaf Fruit     -1
  • Fruit versus Leaf comparisons M value multipliers -1, 1, 1, -1
  • Determining design questions of interest is the hardest part


{{#security:edit|Sven}} {{#security:*|Sven,Tam,Andy}}

Analysis of dyeswap experiment

  • Read in GenePix GPR files
  • Normalize using print tip loess smoothing
  • Diagnostics summaries and graphs
  • Statistical analysis → obtain a gene list
      Block Row Column            ID   Name         M        A         t      P.Value        B
10396    20  15     23 171121_390_49 171121  5.035364 13.25087  49.62425 3.220044e-05 11.27486
4517      9  13      9  20264_118_53  20264  4.396719 11.11976  47.06004 3.220044e-05 11.05671
16881    32  21     22 165415_634_53 165415  4.645384 12.65872  43.40359 3.220044e-05 10.70650
16086    31  10      9 185903_436_49 185903  5.146504 11.36911  42.75724 3.220044e-05 10.63926
6508     13   7     22 197386_457_55 197386  4.621024 13.20426  42.09902 3.220044e-05 10.56899
5471     11   8     20 142178_355_53 142178  4.795734 12.07427  41.23346 3.220044e-05 10.47374
8395     16  20     23   251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421
4330      9   5      6 297409_340_47 297409  4.421922 12.27208  38.52215 3.220044e-05 10.15284
12479    24  14     13 163360_396_47 163360  4.367943 11.10478  38.21662 3.220044e-05 10.11439
15024    29  10      5 149243_674_53 149243  4.372419 11.36572  37.86362 3.220044e-05 10.06935

+Workshop.R

See also