Difference between revisions of "Limma analysis"

From Organic Design wiki
(Header changes and new content)
(More on weights)
Line 28: Line 28:
 
  topTable(fit2)  
 
  topTable(fit2)  
  
=Details=
+
==Details==
 
The linear model reduces to effectively estimating average M values using a design matrix of discrete values.  
 
The linear model reduces to effectively estimating average M values using a design matrix of discrete values.  
==Technical correlation==
+
===Technical correlation===
 
If correlation between rows (spots) is estimated, then the function ''duplicateCorrelation'' is called. This fits a reml model on all genes to estimate a ''rho'' correlation matrix. A ''[[Wikipedia:Fisher transformation|fisher transformation]]'' (identical to ''atanh(x)'') is then applied to the ''rho'' matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a ''consensus correlation''. This correlation can be utilised in lmFit by calling ''gls.series'' which fits a generalized least squares model.
 
If correlation between rows (spots) is estimated, then the function ''duplicateCorrelation'' is called. This fits a reml model on all genes to estimate a ''rho'' correlation matrix. A ''[[Wikipedia:Fisher transformation|fisher transformation]]'' (identical to ''atanh(x)'') is then applied to the ''rho'' matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a ''consensus correlation''. This correlation can be utilised in lmFit by calling ''gls.series'' which fits a generalized least squares model.
  
==Weights==
+
===Weights===
If a weight matrix is present in the ''MAList'' object to be analysed, then a weighted linear model will be fitted using ''mrlm'', ''lm.series'', or ''gls.series'' depending on the chosen ''method'' arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA.  
+
If a weight matrix is present in the ''MAList'' object to be analysed, then a weighted linear model will be fitted using ''mrlm'', ''lm.series'', or ''gls.series'' depending on the chosen ''method'' arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA. Changing the weights for porr quality spots will influence the weighted estimates of ''M'', and within gene variation ''sigma'', which will effect downstream calculations of ''t'', ''P.Value'', adj.P.Val, and ''B''.
 +
 
 
===List elements from lmFit===  
 
===List elements from lmFit===  
 
<table class=document-code><tr><td>
 
<table class=document-code><tr><td>

Revision as of 23:37, 24 August 2006


Linear models for microarray analysis

Linear models for microarray analysis (Limma) is a R and Bioconductor package for organising and analysing cDNA and Affymetrix microarray data. It is written by Gordon Smyth at WEHI.

Overview

For a p * n matrix of expression intensities, Limma is fitting p linear models (one for each row/spot). The lmFit function does this by calling functions such as lm.series which use lm.fit in Package:Stats to fit a linear model (see also ?mrlm ?gls.series);

y = Xβ + ε

For cDNA/oligo two spotted technologies the matrix of expression intensities is usually the matrix of M values with respect to treatments. For Affymetrix single channel arrays the expression intensities for each Affymetrix slide are directly analysed comparing two treatments. Moderated t-statistics are effectively similiar in approach (plus the additional between gene variance component) to paired t-statistics for the analysis cDNA/oligo technologies, and unpaired t-statistics for Affymetrix single channel technologies.

The code to perform an analysis for a simple comparison is;

fit <- lmFit(MA, design) # or an exprSet object
fit2 <- eBayes(fit) 
topTable(fit2) 

For a reference or loop design, there is an extra step via contrasts of interest;

fit <- lmFit(MA, design) # or an exprSet object
fit2 <- contrasts.fit(fit, cont.matrix) 
fit2 <- eBayes(fit2) 
topTable(fit2) 

Details

The linear model reduces to effectively estimating average M values using a design matrix of discrete values.

Technical correlation

If correlation between rows (spots) is estimated, then the function duplicateCorrelation is called. This fits a reml model on all genes to estimate a rho correlation matrix. A fisher transformation (identical to atanh(x)) is then applied to the rho matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a consensus correlation. This correlation can be utilised in lmFit by calling gls.series which fits a generalized least squares model.

Weights

If a weight matrix is present in the MAList object to be analysed, then a weighted linear model will be fitted using mrlm, lm.series, or gls.series depending on the chosen method arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA. Changing the weights for porr quality spots will influence the weighted estimates of M, and within gene variation sigma, which will effect downstream calculations of t, P.Value, adj.P.Val, and B.

List elements from lmFit

> names(fit)
[1] "coefficients"     "rank"             "assign"           "qr"              
[5] "df.residual"      "sigma"            "cov.coefficients" "stdev.unscaled"  
[9] "pivot"            "method"           "design"           "Amean"           
[13] "genes" 
  • coefficients= estimated M values
  • qr = qr decomposition
  • df.residual=residual degrees of freedom for each gene
  • Amean = Estimated unweighted A values
  • sigma = row/spot standard deviations (effected by missing values)
  • cov.coefficients = covariance matrix
  • stdev.unscaled = scaling required to caculate se(x_bar)
  • method = model fitting method used
  • design=the design matrix X used
  • genes = list of gene names

Note: The standard errors are given by stdev.unscaled * sigma' see ? lm.series.

The addition of a correlation term, and specification of technical or biological blocking will alter sigma, stdev.unscaled, cov.oefficients, and remove rank and qr from the list. Different correlations combined with spot weighting will effect coefficient estimates.

Complex designs

The design matrix modifies the way the estimated error structure is calculated. Consider small experiment of 10 slides, with two treatments and five slides per treatment. A direct comparison could be used to estimate the within/between gene error structure and differences between treatments. If this experiment was part of a larger experiment containing 20 slides then the information from all the slides would be used to obtain more precise estimates of the within/between gene error structure. The M value differences between treatments would be identical but the error structure slightly different obtining lightly different answers depending on what slides were incorporated in the experiment.

For specific comparisons of interest contrasts.fit modifes the lmFit object, the fit$coef terms are updated by adding or subtracting appropriate columns identified by cont.matrix. Variance terms for fit$stdev.unscaled are added, and rexpressed as standard deviations (see Variance), and fit$cov.coefficients terms are added together. All other elements remain unchanged.

Empirical Bayes using conjugate priors are used to calculate posterior probability measures such as moderated t-statistics and B statistics. Prior information for the moderated t-statistics comes from the data, while B statistics also depend on an assumed value of p, the proportion of genes which change.

The wrapper function eBayes for ebayes calculates these statistics for each row (spot).

List elements from ebayes

> names(ebayes(fit))
[1] "df.prior"  "s2.prior"  "s2.post"   "t"         "p.value"   "var.prior"
[7] "lods"
  • df.prior= dg
  • s2.prior=(so)2
  • s2.post=(sg)2
  • t=moderated t-statistics
  • p.value = moderated t-statistics p values
  • lods = log odds B statistics

Moderated t-statistics

The function ebayes calculates two statistical measures, moderated t-statistics and log odds ratio B statistics. The moderated t-statistic part calls squeezeVar which calls fitFDist. This assumes an inverse gamma distribution prior, and gamma distribution likelihood function for the variances (see ?sqeezeVar & Smyth 2002 p11). The prior parameters are estimated from the data (within gene variances). This estimates the scale (var.prior) and degrees of freedom for the prior distribution. the posterior distribution is empirically defined as:

shat2g = (do s2o + dg s2g) / (do + dg)

B statistics

the B statistics are empirically calculated assuming p (Smyth 2002, page 10), the proportion of genes that change is known:

Ogj  = ( (p) / (1-p) ) * ( vgj/(vgj+voj) )^0.5 * ( (that2gj + to + tg) / (that2 * (vgj/(vgj+voj) ) + do + dg)^(1+do+dg)/2

where

B = log(Ogj)

See also

References

  • Lonnstedt, I., and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31–46.
Note: The simulation study for this paper was identical to the underlying parametric assumptions in the B statistic
  • Smyth G.K. (2002). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments