Microarray analysis workshop

Time schedule: 8:30 - 10:30am, 11-12:30am (3.5 hours)

Workflow

Introduction to Microarray analysis (15 minutes) Normalization (15 minutes) BioConductor/R framework (15 minutes) R tutorial (60 minutes) Linear models for Microarray analysis (15 minutes) BioConductor analysis tutorial ("60 minutes")

Overview of experimental process

File:Expt.png

(Courtesy Mik Black)

Competitive hybridization to spotted oligo/cDNA transcripts
Interested in genes that change between treatment conditions

→ differential expression versus equivalent expression

Statistical analysis process

File:Process.png

Raw data (GPR file format)

http://www.moleculardevices.com/pages/software/gn_gpr_format_history.html

Each GPR intensity file is typically >8 megabytes
Each TIFF image file is typically >30 megabytes
A microarray experiment consists of several → many slides

Statistical issues

In the past statistics was developed for n >>p

n observations, p variables

Gene expression data n<<p

Thousands of measured genes (p)

Small number of biological replicate slides (n)

Gene expression data can be highly correlated

groups of genes are regulated in the same way

Data not normally distributed

log transform highly skewed intensity data

File:Graph channels.png

Analysis wish list

Ideally would like unambiguous interpretation of results
Large amounts of data to analyse can be overwhelming and make interpretation subjective
Independent reproducibility of results by another collegue

→Keep a record (log file) of what was done

Analysis aim

Obtain a list of genes which we think are differentially expressing

      Block Row Column            ID   Name         M        A         t      P.Value        B
10396    20  15     23 171121_390_49 171121  5.035364 13.25087  49.62425 3.220044e-05 11.27486
4517      9  13      9  20264_118_53  20264  4.396719 11.11976  47.06004 3.220044e-05 11.05671
16881    32  21     22 165415_634_53 165415  4.645384 12.65872  43.40359 3.220044e-05 10.70650
16086    31  10      9 185903_436_49 185903  5.146504 11.36911  42.75724 3.220044e-05 10.63926
6508     13   7     22 197386_457_55 197386  4.621024 13.20426  42.09902 3.220044e-05 10.56899
5471     11   8     20 142178_355_53 142178  4.795734 12.07427  41.23346 3.220044e-05 10.47374
8395     16  20     23   251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421
4330      9   5      6 297409_340_47 297409  4.421922 12.27208  38.52215 3.220044e-05 10.15284
12479    24  14     13 163360_396_47 163360  4.367943 11.10478  38.21662 3.220044e-05 10.11439
15024    29  10      5 149243_674_53 149243  4.372419 11.36572  37.86362 3.220044e-05 10.06935

Easier to rank genes in order of evidence of differential expression than it is to select a specific cutoff

If we do select a cutoff, False Discovery Rate (FDR) cutoff is usually used

FDR threhold is the expected proportion of genes in a list that are likely to be incorrect

What is BioConductor? (http://www.bioconductor.org)

BioConductor is an open source development software project
Provides tools for analysis and comprehension of genomic data
Extensively for Affymetrix and cDNA microarray technologies
The project started in the autumn of 2001
Includes 23 core collaborating developers
Project Growth

v1.0: May 2nd, 2002, 15 packages

v1.1: November 18th, 2002, 20 packages.

v1.2: May 28th, 2003, 30 packages.

v1.3 Oct 29th 2003, 49 Packages

v1.4 May 18 2004, 82 Packages

v1.5 Nov 1st, 2004, 98 Packages

BioConductor Goals

The broad goals of the project are:

To enable sound and powerful statistical analyses in genomics
To provide a computing platform that allows the rapid design and deployment of high-quality software
To develop a computing environment for both biologists and statisticians
Promote high-quality dynamic documentation and reproducible research
Using L^AT_EX, the Sweave system and tcl/tk to deliver interactive step by step pdf tutorials, e.g.

library(tkWidgets)
vExplorer()

File:Sweave2.png

Object oriented class method design (OOP)

Organized approach to handling large amounts of experimental data
Class structure encapsulates the data required for microarray analysis → object
Allows efficient representation and manipulation (including subsetting) of data from many microarray slides in an experiment
A method is a function that performs an action on data (objects) throughout analysis

Advantages

Newest cutting edge statistical methods available
Modern programming language
Powerful graphical tools available
Its freely available

Disadvantages

Steep learning curve
Need to have experience programming in the R programming environment (http://www.r-project.org)
the dreaded command line
Like all software there are bugs

Accessing BioConductor

BioConductor tools are accessed using the R programming language
R is a programming environment for statistical computing and graphics
Initially written by Robert Gentleman and Ross Ihaka (Auckland University)
Download R from a comprehensive R archive network (CRAN) mirror (http://cran.stat.auckland.ac.nz)
Install R (available for Unix, Windows, and Mac OS X)
R version 2.2.1 has been released on 2005-12-20
R is the environment used to design and distribute software:
- Locally downloaded files
- Via the internet e.g. commands in R:

Sys.putenv("http_proxy"="http://proxy.hort.net.nz:8080")   # Setting proxy variable
source("http://www/bioconductor.org/getBioC.R")            # Downloading installation script
getBioC()                                                  # Running script

Installing vs loading packages

Packages only need to be installed once onto a computer
Packages must be loaded with each new R session
The R function library is used to load packages e.g. command in R:

library(limma)     #Installs the limma package

Documentation and Help

R manuals and tutorials are available from the R website or on-line in an R session
R on-line help system, detailed high quality on-line documentation, available in text, HTML, PDF, and L^AT_EX formats.
Frequently asked questions BioConductor / R
Email news lists for BioConductor / R

Note: Read the posting guides before submitting questions

Resources

Quickest way to learn R: use the Contributed Documentation (http://cran.stat.auckland.ac.nz/other-docs.html)
Thousands of pages of documentation including short guides / reference cards

Contributed guides for the beginner

A Guide for the Unwilling S User (http://cran.stat.auckland.ac.nz/doc/contrib/Burns-unwilling_S.pdf)
R for Beginners (http://cran.stat.auckland.ac.nz/doc/contrib/Paradis-rdebuts_en.pdf)
Reference card (http://cran.stat.auckland.ac.nz/doc/contrib/refcard.pdf)

More comprehensive contributed guides

Using R for Data Analysis and Graphics (http://cran.stat.auckland.ac.nz/doc/contrib/usingR-2.pdf)
Simple R (http://cran.stat.auckland.ac.nz/doc/contrib/Verzani-SimpleR.pdf)
S programming techniques (http://www.stat.auckland.ac.nz/S-Workshop/Ihaka/lecture.pdf)

Obtaining help in R

help.start()           # Browser based help documentation
help()                 # Help on a topic (note: help pages have a set format)
? ls                   # alternative help method on ls function
apropos(mean)          # Find Objects by (Partial) Name
example(mean)          # Run an Examples Section from the Online Help
demo()                 # Demonstrations of R Functionality
demo(graphics)         # Demonstration or graphics Functionality
RSiteSearch()          # Searches web newslist archives and retrieves results using http

There objects are functions, to run them you must put parentheses '()' after the function name

Useful commands in the R environment

search()              # Give Search Path for R Objects
searchpaths()         # Give Full Search Path for R Objects
ls()                  # List objects
objects()             # alternate function to list objects
data()                # Publically available datasets
rm()                  # Remove Objects from a Specified Environment
save.image()          # Save R Objects
q()                   # Terminate an R Session →  prompted to Save workspace image? [y/n/c]:

Command prompt

Type commands after the prompt (>) e.g.

> x <- 1:10        # assignment of 1 to 10 to an object called 'x'
> x                # Returning the x object to the screen
 [1]  1  2  3  4  5  6  7  8  9 10

Continuation of commands is expected after the plus symbol (+) e.g.

> x <- 1:          # partial command → parser is expecting more information
+  10
> x
 [1]  1  2  3  4  5  6  7  8  9 10

Text following a '#' is commented out

Basic (atomic) data types

Logical

T                 # TRUE
F                 # FALSE

Numeric

3.141592654       # Any number [0-9\.]

Character

"Putative ATPase" # Any character [A-Za-z] must be single or double quoted

Missing values

NA                # Label for missing information in datasets

See also help("NA"), help("NaN")

Assignment of objects

objects must start with a letter [A-Z a-z]
"<-" The arrow assigns information to the object on the left

x <- 42                # Assignment to the left
x
x = 42                 # Equivalent assignment (not recommended)
x 
42 -> x                # Assignment to the right
x

Saving objects

getwd()                        # Returns the current directory where R is running
setwd("C:/DATA/Microarray")    # Set the working directory to another location
getwd()                        # Check the directory has changed
x <- 42
save.image()                   # Saves a snapshot of objects to file .RData
y <- x * 2                     # Make a new object called 'y'
y                              # Return value of 'y'
q()                            # quit R

Restart R by double clicking on the file .RData in C:/DATA/Microarray

x              # Returns 'x' as it was saved to .RData in "C:/DATA/Microarray"
y              # 'y' should not exist

Object data types

Create a scalar (vector of length 1)

a <- 3.14            # Assign pythagorus to object 'a'
length(a)            # The scalar is actually a vector of length 1 
pi                   # Already have a built in object for pythagorus 
search()             # Print the search path for all objects
find("pi")           # "pi" is located in package:base

Create a vector

x <- c(2,3,5,2,7,1)  # Numbers put into a vector using 'c' function concatenate
x
y <- c(10,15,12) 
y
names(y) <- c("first","second","third")    # Elements can be given names
z <- c(y,x)
z

Create a matrix

zmat <- cbind(x,y)   # cbind joins vectors together by column       
zmat

Whats going on in the second column → number recycling

mat <- matrix(1:20, nrow=5, ncol=4)                   # Constructing a matrix
mat
colnames(mat) <- c("Col1","Col2", "Col3", "Col4")     # Adding column names
mat

Create a list

mylist <- list(1:4,c("a","b","c"), LETTERS[1:10]) 
mylist
mylist <- list("element 1" = 1:4,"second vector" = c("a","b","c"), "Capitals" = LETTERS[1:10]) 
mylist

Indexing

Subsetting a vector

x[c(1,2,3)]                     # Selecting the first three elements of 'x'
x[1:3]                          # Same subset using ':' sequence generation → see help(":")
y[2]                            # Selecting the second element of 'y'
y["second"]                     # Selecting the second element of 'y' (by name)

Subsetting a matrix

mat[,1:2]                       # Selecting the first two columns of 'mat'
mat[1:2, 2:4]                   # Selecting a subset matrix of 'mat'

Subsetting a list

mylist[[1]]                    # Subsetting list 'mylist' by index
mylist[["element 1"]]          # Subsetting list 'mylist' by name 'element 1'
mylist$"element 1"             # Alternate way of subsetting
mylist$Capitals[1:5]           # Selecting the first five elements of 'Capitals' in 'mylist' (case sensitive)

Plotting data

See help pages for basic plot functions

help(plot) 
help(par)
example(plot)
par(ask=TRUE)         # Set the printing device to prompt user before displaying next graph
example(hist)

Reading / writing files

Reading data

help(scan)
help(read.table)

Reading a GPR file header using scan

dataDir <- "C:/DATA/Microarray/GPR") 
mydata <-  scan(file.path(dataDir, "BE34.gpr"), what="", nlines=29)  # Get first 29 rows of data
mydata

Reading a GPR file data section using read.table

colClasses <- rep("NULL", 82)
colClasses[c(1:5, 9,12, 18, 21)] <- NA                                      # Set colClasses to ignore unwanted columns
mydata <- read.table(file.path(dataDir,"BE34.gpr"), header=T,  sep="\t", 
                                  nrows=20, skip=31, colClasses=colClasses) # Get first 20 lines of data after 31st row
mydata

Writing data

help(write)
help(write.table)

See also dump, restore, save, load

User defined functions

Writing functions provide a means of adding new functionality to the language. A function has the form:

myfun <- function( arglist ){ body }

Identity function: returns its input arguement

myfun <- function(x){x}        # Creating identity function
myfun("foo")                   # Running the function
myfun()                        # Fails: no input arguement provided

A simple function

square <- function(x){x * x}         # Square the input number
square(10)                  # Returns 10 squared
square(1:4)              # Underlying arithmetic is vectorized

Graphical example from user defined function

The following function generates data from sine distributions and examines bias variance tradeoff of a smoothing function using different smoothing parameters. Paste it into R

"biasVar" <- function(df1=4,  df2=15,  N = 100,  seed=1)
{
 set.seed(seed)
 # 1) Data setup
 ylim <- c(-2,2)
 xlim <- c(-3,3)
 par(mfrow=c(2,2), mar=c(5,4,4-2,2)+0.1,mgp=c(2,.5,0) )
 x <- rnorm(80, 0, 1)
 y <- sin(x) + rnorm(80, 0, 1/9)
 xno   <- 500
 sim <- matrix(NA, nc=N, nr=xno)
 xseq <- seq(min(x),max(x), length=xno)
 plot(x, y, main=paste("df=",df1,sep=""), xlim=xlim, ylim=ylim)    # Using Splines
 truex <- seq(min(x), max(x), length = 80)
 lines(truex, sin(truex), lty = 5)
 splineobj <- smooth.spline(x, y, df = df1)
 lines(splineobj, lty = 1)
 plot(x, y, main=paste("df=",df2,sep=""), xlim=xlim, ylim=ylim)    # Using Splines
 truex <- seq(min(x), max(x), length = 80)
 lines(truex, sin(truex), lty = 5)
 splineobj <- smooth.spline(x, y, df = df2)
 lines(splineobj, lty = 1)
 plot(x, y, main=paste("Bias-Variance tradeoff, df=",df1, sep=""), type="n", xlim=xlim, ylim=ylim)
 for(i in seq(N))
   {
     x <- rnorm(80, 0, 1)
     y <- sin(x) + rnorm(80, 0, 1/9)
     splineobj <- smooth.spline(x, y, df = df1)      
     sim[,i] <- predict(splineobj,xseq)$y
   }
 ci <- qt(0.975, N) * sqrt(apply(sim,1, var))
 bias <- apply(sim,1, mean)
 rect(xseq,bias-ci,xseq,bias+ci, border="grey")
 rect(xseq,sin(xseq),xseq,bias, border="black")
 lines(truex, sin(truex))
 plot(x, y, main=paste("Bias-Variance tradeoff, df=",df2,sep=""), type="n", xlim=xlim, ylim=ylim)  
 for(i in seq(N))
   {
     x <- rnorm(80, 0, 1)
     y <- sin(x) + rnorm(80, 0, 1/9)
     splineobj <- smooth.spline(x, y, df = df2)      
     sim[,i] <- predict(splineobj,xseq)$y
   }
 ci <- qt(0.975,N) * sqrt(apply(sim,1, var))
 bias <- apply(sim,1, mean)
 rect(xseq,bias-ci,xseq,bias+ci, border="grey")
 rect(xseq,sin(xseq),xseq,bias, border="black")
 lines(truex, sin(truex))
}

Running the function

biasVar()                # Generates data from a sine curve looking at bias variance tradeoff 
biasVar(df1=2, df2=30)   # Let's change the smoothing parameters in the function arguements

Quiting R

rm(list=ls())         # Cleaning up: Remove Objects from a Specified Environment
q()

Links

Advanced tutorial

Overview of Limma package for R

Fits a linear model for each spot (gene)
An open source software package for the R programming environment
Focus on normalization and statistical analysis of cDNA microarray gene expression data
OOP environment for handling information in a microarray experiment
Statistical analysis approach can be used for Affymetrix microarray experiments

Origin

Written and maintained by Gordon Smyth with contributions From WEHI, Melbourne, Australia
Software made public at the Australian Genstat Conference, Perth, in Dec 2002
Became available in the Bioconductor open source bioinformatics project April 2003
Limma integrates with other Bioconductor software packages, affy, marray, using convert package
Active development cycle

File:Limma versions.tiff

Statistical approach

Parallel inference for each gene
Computationally fast/robust
Handles missing information/use defined flag information
Linear models are essentially t-statistics for each spot/gene (signal/noise)
Also makes use of between gene information (moderated t-statistics)

Object orientated programming environment

File:OOP.png

Uploading data into the R programming language automatically populates elements of RGList
- R (Red foreground)
- G (Green foreground)
Foreground intensities range ~ 1 → 65535
- Rb (Red background)
- Gb (Green background)
Background intensities range ~ 1 → 1000
- genes (Spot annotation list)
- weights (prior weights weights given to each spot)

MAList data transformation
- M = log2(R) - log2(G) (minus)
- A = (log2(R) + log2(G))/2) (add - abundance)

Backtransforming to Normalized R', G' values
- log2(R') = A + M/2
- log2(G') = A - M/2

Advantages using Limma

Nice organisational framework for handling cDNA expression data using object orientated programming
Flexible methods to handle weighting of poor quality spots
Encorporates cDNA normalization routines with a proven track record
Robust statistical analysis approach

Can analyze cDNA microarray slides possessing large amounts of missing information

Analysis methods able to encorporate duplicate spots from either technical or biological sources

Limitations

Experiments with different spotting templates cannot easily be combined for analysis
Statistical analysis cannot pool information together when there are variable numbers of the same replicated spots

Must analyze spot information about the same transcript independently

Linear models cannot encorporate error model structures from time series designs

Microarray workshop experiment

Dye swap experiment
Directed graph

File:Dyeswap.png

FileName SlideNumber   Cy3   Cy5 Design
BE34.gpr          34  Leaf Fruit     -1
BE35.gpr          35 Fruit  Leaf      1
BE36.gpr     	  36 Fruit  Leaf      1
BE37.gpr   	  37  Leaf Fruit     -1

Fruit versus Leaf comparisons M value multipliers -1, 1, 1, -1
Determining design questions of interest is the hardest part

Analysis of dyeswap experiment

Read in GenePix GPR files
Normalize using print tip loess smoothing
Diagnostics summaries and graphs
Statistical analysis → obtain a gene list

      Block Row Column            ID   Name         M        A         t      P.Value        B
10396    20  15     23 171121_390_49 171121  5.035364 13.25087  49.62425 3.220044e-05 11.27486
4517      9  13      9  20264_118_53  20264  4.396719 11.11976  47.06004 3.220044e-05 11.05671
16881    32  21     22 165415_634_53 165415  4.645384 12.65872  43.40359 3.220044e-05 10.70650
16086    31  10      9 185903_436_49 185903  5.146504 11.36911  42.75724 3.220044e-05 10.63926
6508     13   7     22 197386_457_55 197386  4.621024 13.20426  42.09902 3.220044e-05 10.56899
5471     11   8     20 142178_355_53 142178  4.795734 12.07427  41.23346 3.220044e-05 10.47374
8395     16  20     23   251706_1_53 251706 -5.003475 13.04571 -38.61325 3.220044e-05 10.16421
4330      9   5      6 297409_340_47 297409  4.421922 12.27208  38.52215 3.220044e-05 10.15284
12479    24  14     13 163360_396_47 163360  4.367943 11.10478  38.21662 3.220044e-05 10.11439
15024    29  10      5 149243_674_53 149243  4.372419 11.36572  37.86362 3.220044e-05 10.06935

+Workshop.R

Rosaceae