Limma analysis

From Organic Design wiki
Revision as of 20:56, 21 September 2009 by Sven (talk | contribs) (List elements from eBayes: typo)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

{{#security:edit|Mik Black,Sven,Bjorn}} {{#security:*|Mik Black,Sven,Bjorn}}

Normalization

The limma package offers various standard normalization techniques for cDNA two channel microarray data. The method printtiploess is sensitive to situations where there are many missing observations and spots with zero weights. In this case predicted M values can be greater than the theoretical maximum of ≈ -8 to +8.

This occurs whether you background subtract or not, for the following normalization arguments; <R>

 MA <- normalizeWithinArrays(RG, method="printtiploess", bc.method="none")
 # or
 MA <- normalizeWithinArrays(RG, method="printtiploess", bc.method="none", weights = RG$weights)

apply(MA$M, 2, max, na.rm=TRUE) # M values potentially effected by loess oversmoothing apply(MA$A, 2, max, na.rm=TRUE) </R>

Linear models for microarray analysis

Linear models for microarray analysis (Limma) is a R and Bioconductor package for organising and analysing cDNA and Affymetrix microarray data. It is written by Gordon Smyth at WEHI. The package encorporates a Bayesian analysis approach which take into account between gene information to improve variance estimates over within gene variance methods such as gene by gene t-tests and ANOVA's.

Overview

For a p * n matrix of expression intensities, Limma is fitting p linear models (one for each row/spot). The lmFit function does this by calling functions such as lm.series which use lm.fit in Package:Stats to fit a linear model (see also ?mrlm ?gls.series);

y = Xβ + ε

For cDNA/oligo two spotted technologies the matrix of expression intensities is usually the matrix of M values with respect to treatments. For Affymetrix single channel arrays the expression intensities for each Affymetrix slide are directly analysed comparing two treatments. Moderated t-statistics are effectively similiar in approach (plus the additional between gene variance component) to paired t-statistics for the analysis cDNA/oligo technologies, and unpaired t-statistics for Affymetrix single channel technologies.

The code to perform an analysis for a simple comparison is; <R>

fit <- lmFit(MA, design) # or an exprSet object
fit2 <- eBayes(fit) 
topTable(fit2) 

</R> For a reference or loop design, there is an extra step via contrasts of interest; <R>

fit <- lmFit(MA, design) # or an exprSet object
fit2 <- contrasts.fit(fit, cont.matrix) 
fit2 <- eBayes(fit2) 
topTable(fit2) 

</R> In practise, formulating the linear model design matrix for reference and loop designs is the most difficult process while taking into account biological and technical sources of variation.

List elements from lmFit

<R>

> names(fit)
[1] "coefficients"     "rank"             "assign"           "qr"              
[5] "df.residual"      "sigma"            "cov.coefficients" "stdev.unscaled"  
[9] "pivot"            "method"           "design"           "Amean"           
[13] "genes" 

</R>

  • coefficients= estimated M values
  • qr = qr decomposition
  • df.residual=residual degrees of freedom for each gene
  • Amean = Estimated unweighted A values
  • sigma = row/spot standard deviations (effected by missing values)
  • cov.coefficients = covariance matrix
  • stdev.unscaled = scaling required to calculate se(x_bar)
  • method = model fitting method used
  • design=the design matrix X used
  • genes = list of gene names

Note: The standard errors are given by stdev.unscaled * sigma' see ? lm.series.

List elements from eBayes

<R>

> names(ebayes(fit))
[1] "df.prior"  "s2.prior"  "s2.post"   "t"         "p.value"   "var.prior"
[7] "lods"

</R>

  • df.prior= do
  • s2.prior=(so)2
  • s2.post=(sg)2
  • t=moderated t-statistics
  • p.value = moderated t-statistics p values
  • lods = log odds B statistics

Note: empty or unwanted spots in analysis can be downweighted by setting their sigma component as an NA before eBayes. They will however still have a meaningful t, and lods ratio that may still be significant as the s2.post uses information from other gene variances.

Details

The linear model reduces to effectively estimating average M values using a design matrix of discrete values.

Technical correlation

If correlation between rows (spots) is estimated, then the function duplicateCorrelation is called. This fits a reml model on all genes to estimate a rho correlation matrix. A fisher transformation (identical to atanh(x)) is then applied to the rho matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a consensus correlation. This correlation can be utilised in lmFit by calling gls.series which fits a generalized least squares model.

The addition of a correlation term, and specification of technical or biological blocking will alter sigma, stdev.unscaled, cov.oefficients, and remove rank and qr from the list. Different correlations combined with spot weighting will effect coefficient estimates.

Weights

If a weight matrix is present in the MAList object to be analysed, then a weighted linear model will be fitted using mrlm, lm.series, or gls.series depending on the chosen method arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA. Changing the weights for poor quality spots will influence the weighted estimates of M, and within gene variation sigma, which will effect downstream calculations of t, P.Value, adj.P.Val, and B.

See also lmFit-Weights.R.

Complex designs

The design matrix modifies the way the estimated error structure is calculated. Consider a small experiment of 10 slides, with two treatments and five slides per treatment. A direct comparison could be used to estimate the within/between gene error structure and differences between treatments. If this experiment was part of a larger experiment containing 20 slides then the information from all the slides would be used to obtain more precise estimates of the within/between gene error structure. A plot of sigma methods obtained from the full model and the sub model should be highly correlated. The M value differences between treatments would be identical in the sub model, but the error structure slightly different obtaining slightly different answers depending on what slides were incorporated in the experiment.

For specific comparisons of interest contrasts.fit modifes the lmFit object, the fit$coef terms are updated by adding or subtracting appropriate columns identified by cont.matrix. Variance terms for fit$stdev.unscaled are added, and rexpressed as standard deviations (see Variance), and fit$cov.coefficients terms are added together. All other elements remain unchanged.

Empirical Bayes assuming conjugate priors are used to calculate posterior probability measures such as moderated t-statistics and B statistics. Prior information for the moderated t-statistics comes from the data, while B statistics also depend on an assumed value of p, the proportion of genes which change.

The wrapper function eBayes for ebayes calculates these statistics for each row (spot).

Moderated t-statistics

The function ebayes calculates two statistical measures, moderated t-statistics and log odds ratio B statistics. The moderated t-statistic part calls squeezeVar which calls fitFDist. This assumes an inverse gamma distribution prior, and gamma distribution likelihood function for the variances (see ?sqeezeVar & Smyth 2002 p11). The prior parameters are estimated from the data (within gene variances). This estimates the scale (var.prior) and degrees of freedom for the prior distribution. the posterior distribution is empirically defined as:

shat2g = (do s2o + dg s2g) / (do + dg)

B statistics

the B statistics are empirically calculated assuming p (Smyth 2002, page 10), the proportion of genes that change is known:

Ogj  = ( (p) / (1-p) ) * ( vgj/(vgj+voj) )^0.5 * ( (that2gj + to + tg) / (that2 * (vgj/(vgj+voj) ) + do + dg)^(1+do+dg)/2

where

B = log(Ogj)

See also

References

  • Lonnstedt, I., and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31–46.
Note: The simulation study for this paper was identical to the underlying parametric assumptions in the B statistic
  • Smyth G.K. (2002). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments