Difference between revisions of "Limma analysis"

From Organic Design wiki
(Header changes and new content)
m (List elements from eBayes: typo)
 
(26 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:BioC]]
+
{{#security:edit|Mik Black,Sven,Bjorn}}
[[Category:R]]
+
{{#security:*|Mik Black,Sven,Bjorn}}
[[Category:Limma]]
+
[[Category:BioC]][[Category:Limma]][[Category:Nutrigenomics]][[Category:R]]
[[Category:Nutrigenomics]]
+
= Normalization =
+
The [http://bioinf.wehi.edu.au/limma/ ''limma''] package offers various standard normalization techniques for cDNA two channel microarray data. The method ''printtiploess'' is sensitive to
=Linear models for microarray analysis=
+
situations where there are many missing observations and spots with zero weights. In this case predicted M values can be greater than the theoretical maximum of ≈ -8 to +8.
Linear models for microarray analysis (''Limma'') is a [http://cran.stat.auckland.ac.nz/src/contrib/Descriptions/limma.html R] and [http://www.bioconductor.org/packages/1.8/bioc/html/limma.html Bioconductor] package for organising and analysing cDNA and Affymetrix microarray data. It is written by Gordon Smyth at WEHI.
 
  
==Overview==
+
This occurs whether you background subtract or not, for the following normalization arguments;
 +
<R>
 +
  MA <- normalizeWithinArrays(RG, method="printtiploess", bc.method="none")
 +
  # or
 +
  MA <- normalizeWithinArrays(RG, method="printtiploess", bc.method="none", weights = RG$weights)
 +
 
 +
apply(MA$M, 2, max, na.rm=TRUE)  # M values potentially effected by loess oversmoothing
 +
apply(MA$A, 2, max, na.rm=TRUE)
 +
</R>
 +
 
 +
= Linear models for microarray analysis =
 +
Linear models for microarray analysis (''Limma'') is a [http://cran.stat.auckland.ac.nz/src/contrib/Descriptions/limma.html R] and [http://www.bioconductor.org/packages/1.8/bioc/html/limma.html Bioconductor] package for organising and analysing cDNA and Affymetrix microarray data. It is written by Gordon Smyth at WEHI. The package encorporates a Bayesian analysis approach which take into account between gene information to improve variance estimates over within gene variance methods such as gene by gene t-tests and ANOVA's.
 +
 
 +
== Overview ==
 
For a ''p'' * ''n'' matrix of expression intensities, ''Limma'' is fitting ''p'' linear models (one for each row/spot). The lmFit function does this by calling functions such as ''lm.series''
 
For a ''p'' * ''n'' matrix of expression intensities, ''Limma'' is fitting ''p'' linear models (one for each row/spot). The lmFit function does this by calling functions such as ''lm.series''
 
which use ''lm.fit'' in ''Package:Stats'' to fit a linear model (see also ?mrlm ?gls.series);
 
which use ''lm.fit'' in ''Package:Stats'' to fit a linear model (see also ?mrlm ?gls.series);
Line 16: Line 28:
  
 
The code to perform an analysis for a simple comparison is;
 
The code to perform an analysis for a simple comparison is;
 
+
<R>
 
  fit <- lmFit(MA, design) # or an exprSet object
 
  fit <- lmFit(MA, design) # or an exprSet object
 
  fit2 <- eBayes(fit)  
 
  fit2 <- eBayes(fit)  
 
  topTable(fit2)  
 
  topTable(fit2)  
 
+
</R>
 
For a reference or loop design, there is an extra step via contrasts of interest;
 
For a reference or loop design, there is an extra step via contrasts of interest;
 
+
<R>
 
  fit <- lmFit(MA, design) # or an exprSet object
 
  fit <- lmFit(MA, design) # or an exprSet object
 
  fit2 <- contrasts.fit(fit, cont.matrix)  
 
  fit2 <- contrasts.fit(fit, cont.matrix)  
 
  fit2 <- eBayes(fit2)  
 
  fit2 <- eBayes(fit2)  
 
  topTable(fit2)  
 
  topTable(fit2)  
 +
</R>
 +
In practise, formulating the linear model design matrix for reference and loop designs is the most difficult process while taking into account biological and technical sources of variation.
  
=Details=
 
The linear model reduces to effectively estimating average M values using a design matrix of discrete values.
 
==Technical correlation==
 
If correlation between rows (spots) is estimated, then the function ''duplicateCorrelation'' is called. This fits a reml model on all genes to estimate a ''rho'' correlation matrix. A ''[[Wikipedia:Fisher transformation|fisher transformation]]'' (identical to ''atanh(x)'') is then applied to the ''rho'' matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a ''consensus correlation''. This correlation can be utilised in lmFit by calling ''gls.series'' which fits a generalized least squares model.
 
 
==Weights==
 
If a weight matrix is present in the ''MAList'' object to be analysed, then a weighted linear model will be fitted using ''mrlm'', ''lm.series'', or ''gls.series'' depending on the chosen ''method'' arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA.
 
 
===List elements from lmFit===  
 
===List elements from lmFit===  
<table class=document-code><tr><td>
+
<R>
 
  > names(fit)
 
  > names(fit)
 
  [1] "coefficients"    "rank"            "assign"          "qr"               
 
  [1] "coefficients"    "rank"            "assign"          "qr"               
Line 42: Line 49:
 
  [9] "pivot"            "method"          "design"          "Amean"           
 
  [9] "pivot"            "method"          "design"          "Amean"           
 
  [13] "genes"  
 
  [13] "genes"  
 
+
</R>
 
*coefficients= ''estimated M values''
 
*coefficients= ''estimated M values''
 
*qr = ''qr decomposition''
 
*qr = ''qr decomposition''
Line 49: Line 56:
 
*sigma = row/spot standard deviations (effected by missing values)
 
*sigma = row/spot standard deviations (effected by missing values)
 
*cov.coefficients = covariance matrix
 
*cov.coefficients = covariance matrix
*stdev.unscaled = ''scaling required to caculate se(x_bar)''
+
*stdev.unscaled = ''scaling required to calculate se(x_bar)''
 
*method = ''model fitting method used''
 
*method = ''model fitting method used''
 
*design=''the design matrix X used''
 
*design=''the design matrix X used''
Line 55: Line 62:
  
 
Note: ''The standard errors are given by stdev.unscaled * sigma' see ? lm.series''.
 
Note: ''The standard errors are given by stdev.unscaled * sigma' see ? lm.series''.
</table>
 
 
The addition of a correlation term, and specification of technical or biological blocking will alter ''sigma'', ''stdev.unscaled'', cov.oefficients, and remove ''rank'' and ''qr'' from the list. Different correlations combined with spot weighting will effect coefficient estimates.
 
 
===Complex designs===
 
The design matrix modifies the way the estimated error structure is calculated. Consider small experiment of 10 slides, with two treatments and five slides per treatment. A direct comparison could be used to estimate the within/between gene error structure and differences between treatments. If this experiment was part of a larger experiment containing 20 slides then the information from all the slides would be used to obtain more precise estimates of the within/between gene error structure. The M value differences between treatments would be identical but the error structure slightly different obtining lightly different answers depending on what slides were incorporated in the experiment.
 
 
For specific comparisons of interest ''contrasts.fit'' modifes the ''lmFit'' object, the ''fit$coef'' terms are updated by adding or subtracting appropriate columns identified by ''cont.matrix''. Variance terms for fit$stdev.unscaled are  added, and rexpressed as standard deviations (''see [[Wikipedia:Variance|Variance]]''), and fit$cov.coefficients terms are added together. All other elements remain unchanged.
 
 
Empirical Bayes using conjugate priors are used to calculate posterior probability measures such as moderated t-statistics and B statistics. Prior information for the moderated t-statistics comes from the data, while B statistics also depend on an assumed value of ''p'', the proportion of genes which change.
 
 
The wrapper function ''eBayes'' for ''ebayes'' calculates these statistics for each row (''spot'').
 
  
===List elements from ebayes===  
+
===List elements from eBayes===  
<table class=document-code><tr><td>
+
<R>
 
  > names(ebayes(fit))
 
  > names(ebayes(fit))
 
  [1] "df.prior"  "s2.prior"  "s2.post"  "t"        "p.value"  "var.prior"
 
  [1] "df.prior"  "s2.prior"  "s2.post"  "t"        "p.value"  "var.prior"
 
  [7] "lods"
 
  [7] "lods"
 
+
</R>
*df.prior= d<sub>g</sub>  
+
*df.prior= d<sub>o</sub>  
 
*s2.prior=(s<sub>o</sub>)<sup>2</sup>
 
*s2.prior=(s<sub>o</sub>)<sup>2</sup>
 
*s2.post=(s<sub>g</sub>)<sup>2</sup>
 
*s2.post=(s<sub>g</sub>)<sup>2</sup>
Line 80: Line 75:
 
*p.value = ''moderated t-statistics p values''
 
*p.value = ''moderated t-statistics p values''
 
*lods = ''log odds B statistics''
 
*lods = ''log odds B statistics''
</table>
+
''Note: empty or unwanted spots in analysis can be downweighted by setting their ''sigma'' component as an NA before eBayes. They will however still have a meaningful t, and lods ratio that may still be significant as the s2.post uses information from other gene variances.''
  
====Moderated t-statistics====
+
== Details ==
 +
The linear model reduces to effectively estimating average M values using a design matrix of discrete values.
 +
 
 +
=== Technical correlation ===
 +
If correlation between rows (spots) is estimated, then the function ''duplicateCorrelation'' is called. This fits a reml model on all genes to estimate a ''rho'' correlation matrix. A ''[[Wikipedia:Fisher transformation|fisher transformation]]'' (identical to ''atanh(x)'') is then applied to the ''rho'' matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a ''consensus correlation''. This correlation can be utilised in lmFit by calling ''gls.series'' which fits a generalized least squares model.
 +
 
 +
The addition of a correlation term, and specification of technical or biological blocking will alter ''sigma'', ''stdev.unscaled'', cov.oefficients, and remove ''rank'' and ''qr'' from the list. Different correlations combined with spot weighting will effect coefficient estimates.
 +
 
 +
=== Weights ===
 +
If a weight matrix is present in the ''MAList'' object to be analysed, then a weighted linear model will be fitted using ''mrlm'', ''lm.series'', or ''gls.series'' depending on the chosen ''method'' arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA. Changing the weights for poor quality spots will influence the weighted estimates of ''M'', and within gene variation ''sigma'', which will effect downstream calculations of ''t'', ''P.Value'', adj.P.Val, and ''B''.
 +
 
 +
See also [[lmFit-Weights.R]].
 +
 
 +
=== Complex designs ===
 +
The design matrix modifies the way the estimated error structure is calculated. Consider a small experiment of 10 slides, with two treatments and five slides per treatment. A direct comparison could be used to estimate the within/between gene error structure and differences between treatments. If this experiment was part of a larger experiment containing 20 slides then the information from all the slides would be used to obtain more precise estimates of the within/between gene error structure.
 +
A plot of sigma methods obtained from the full model and the sub model should be highly correlated.
 +
The M value differences between treatments would be identical in the sub model, but the error structure slightly different obtaining slightly different answers depending on what slides were incorporated in the experiment.
 +
 
 +
For specific comparisons of interest ''contrasts.fit'' modifes the ''lmFit'' object, the ''fit$coef'' terms are updated by adding or subtracting appropriate columns identified by ''cont.matrix''. Variance terms for fit$stdev.unscaled are  added, and rexpressed as standard deviations (''see [[Wikipedia:Variance|Variance]]''), and fit$cov.coefficients terms are added together. All other elements remain unchanged.
 +
 
 +
Empirical Bayes assuming conjugate priors are used to calculate posterior probability measures such as moderated t-statistics and B statistics. Prior information for the moderated t-statistics comes from the data, while B statistics also depend on an assumed value of ''p'', the proportion of genes which change.
 +
 
 +
The wrapper function ''eBayes'' for ''ebayes'' calculates these statistics for each row (''spot'').
 +
 
 +
==== Moderated t-statistics ====
 
The function ''ebayes'' calculates two statistical measures, moderated ''t-statistics'' and log odds ratio ''B'' statistics. The moderated ''t-statistic'' part calls ''squeezeVar''  which calls ''fitFDist''. This assumes an inverse gamma distribution prior, and gamma distribution likelihood function for the variances (see ?sqeezeVar & Smyth 2002 p11). The prior parameters are estimated from the data (within gene variances). This estimates the scale (''var.prior'') and [[Wikipedia:Degrees of freedom|degrees of freedom]] for the prior distribution. the posterior distribution is empirically defined as:
 
The function ''ebayes'' calculates two statistical measures, moderated ''t-statistics'' and log odds ratio ''B'' statistics. The moderated ''t-statistic'' part calls ''squeezeVar''  which calls ''fitFDist''. This assumes an inverse gamma distribution prior, and gamma distribution likelihood function for the variances (see ?sqeezeVar & Smyth 2002 p11). The prior parameters are estimated from the data (within gene variances). This estimates the scale (''var.prior'') and [[Wikipedia:Degrees of freedom|degrees of freedom]] for the prior distribution. the posterior distribution is empirically defined as:
  
 
  shat<sup>2</sup><sub>g</sub> = (d<sub>o</sub> s<sup>2</sup><sub>o</sub> + d<sub>g</sub> s<sup>2</sup><sub>g</sub>) / (d<sub>o</sub> + d<sub>g</sub>)
 
  shat<sup>2</sup><sub>g</sub> = (d<sub>o</sub> s<sup>2</sup><sub>o</sub> + d<sub>g</sub> s<sup>2</sup><sub>g</sub>) / (d<sub>o</sub> + d<sub>g</sub>)
  
====B statistics====
+
==== B statistics ====
 
the B statistics are empirically calculated assuming p (''Smyth 2002, page 10''), the proportion of genes that change is known:
 
the B statistics are empirically calculated assuming p (''Smyth 2002, page 10''), the proportion of genes that change is known:
  
Line 96: Line 115:
 
  B = log(O<sub>gj</sub>)
 
  B = log(O<sub>gj</sub>)
  
====See also====
+
== See also ==
 
*[[firstPrinciples.R]]
 
*[[firstPrinciples.R]]
 
*[[lmFit-Weights.R]]
 
*[[lmFit-Weights.R]]
 
*[[LmFit-design.R]]
 
*[[LmFit-design.R]]
 +
*[http://www.bioconductor.org/packages/1.9/bioc/vignettes/limma/inst/doc/usersguide.pdf limmaUsersGuide.pdf]
 +
*[http://www.bioconductor.org/packages/1.9/bioc/html/limma.html BioConductor version 1.9 limma stable package]
 +
*[http://cran.stat.auckland.ac.nz/src/contrib/Descriptions/limma.html Limma development version]
 +
*[http://www.organicdesign.co.nz/wiki/index.php?title=Limma_analysis&action=edit&section=12 Comprehensive paper detailing limma analysis]
  
====References====
+
==== References ====
 
*Lonnstedt, I., and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31–46.  
 
*Lonnstedt, I., and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31–46.  
 
: ''Note: The simulation study for this paper was identical to the underlying parametric assumptions in the B statistic''   
 
: ''Note: The simulation study for this paper was identical to the underlying parametric assumptions in the B statistic''   
 
*Smyth G.K. (2002). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments
 
*Smyth G.K. (2002). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments
 +
 +
[[Category:Microarray]]

Latest revision as of 20:56, 21 September 2009

{{#security:edit|Mik Black,Sven,Bjorn}} {{#security:*|Mik Black,Sven,Bjorn}}

Normalization

The limma package offers various standard normalization techniques for cDNA two channel microarray data. The method printtiploess is sensitive to situations where there are many missing observations and spots with zero weights. In this case predicted M values can be greater than the theoretical maximum of &approx; -8 to +8.

This occurs whether you background subtract or not, for the following normalization arguments; <R>

 MA <- normalizeWithinArrays(RG, method="printtiploess", bc.method="none")
 # or
 MA <- normalizeWithinArrays(RG, method="printtiploess", bc.method="none", weights = RG$weights)

apply(MA$M, 2, max, na.rm=TRUE) # M values potentially effected by loess oversmoothing apply(MA$A, 2, max, na.rm=TRUE) </R>

Linear models for microarray analysis

Linear models for microarray analysis (Limma) is a R and Bioconductor package for organising and analysing cDNA and Affymetrix microarray data. It is written by Gordon Smyth at WEHI. The package encorporates a Bayesian analysis approach which take into account between gene information to improve variance estimates over within gene variance methods such as gene by gene t-tests and ANOVA's.

Overview

For a p * n matrix of expression intensities, Limma is fitting p linear models (one for each row/spot). The lmFit function does this by calling functions such as lm.series which use lm.fit in Package:Stats to fit a linear model (see also ?mrlm ?gls.series);

y = Xβ + ε

For cDNA/oligo two spotted technologies the matrix of expression intensities is usually the matrix of M values with respect to treatments. For Affymetrix single channel arrays the expression intensities for each Affymetrix slide are directly analysed comparing two treatments. Moderated t-statistics are effectively similiar in approach (plus the additional between gene variance component) to paired t-statistics for the analysis cDNA/oligo technologies, and unpaired t-statistics for Affymetrix single channel technologies.

The code to perform an analysis for a simple comparison is; <R>

fit <- lmFit(MA, design) # or an exprSet object
fit2 <- eBayes(fit) 
topTable(fit2) 

</R> For a reference or loop design, there is an extra step via contrasts of interest; <R>

fit <- lmFit(MA, design) # or an exprSet object
fit2 <- contrasts.fit(fit, cont.matrix) 
fit2 <- eBayes(fit2) 
topTable(fit2) 

</R> In practise, formulating the linear model design matrix for reference and loop designs is the most difficult process while taking into account biological and technical sources of variation.

List elements from lmFit

<R>

> names(fit)
[1] "coefficients"     "rank"             "assign"           "qr"              
[5] "df.residual"      "sigma"            "cov.coefficients" "stdev.unscaled"  
[9] "pivot"            "method"           "design"           "Amean"           
[13] "genes" 

</R>

  • coefficients= estimated M values
  • qr = qr decomposition
  • df.residual=residual degrees of freedom for each gene
  • Amean = Estimated unweighted A values
  • sigma = row/spot standard deviations (effected by missing values)
  • cov.coefficients = covariance matrix
  • stdev.unscaled = scaling required to calculate se(x_bar)
  • method = model fitting method used
  • design=the design matrix X used
  • genes = list of gene names

Note: The standard errors are given by stdev.unscaled * sigma' see ? lm.series.

List elements from eBayes

<R>

> names(ebayes(fit))
[1] "df.prior"  "s2.prior"  "s2.post"   "t"         "p.value"   "var.prior"
[7] "lods"

</R>

  • df.prior= do
  • s2.prior=(so)2
  • s2.post=(sg)2
  • t=moderated t-statistics
  • p.value = moderated t-statistics p values
  • lods = log odds B statistics

Note: empty or unwanted spots in analysis can be downweighted by setting their sigma component as an NA before eBayes. They will however still have a meaningful t, and lods ratio that may still be significant as the s2.post uses information from other gene variances.

Details

The linear model reduces to effectively estimating average M values using a design matrix of discrete values.

Technical correlation

If correlation between rows (spots) is estimated, then the function duplicateCorrelation is called. This fits a reml model on all genes to estimate a rho correlation matrix. A fisher transformation (identical to atanh(x)) is then applied to the rho matrix calculating a mean correlation (with trim=0.15 by default), which is then backtransformed to give a consensus correlation. This correlation can be utilised in lmFit by calling gls.series which fits a generalized least squares model.

The addition of a correlation term, and specification of technical or biological blocking will alter sigma, stdev.unscaled, cov.oefficients, and remove rank and qr from the list. Different correlations combined with spot weighting will effect coefficient estimates.

Weights

If a weight matrix is present in the MAList object to be analysed, then a weighted linear model will be fitted using mrlm, lm.series, or gls.series depending on the chosen method arguement. A weight of 1 incorporates all the information for that spot, a weight of 0 is equivalent to an NA. Changing the weights for poor quality spots will influence the weighted estimates of M, and within gene variation sigma, which will effect downstream calculations of t, P.Value, adj.P.Val, and B.

See also lmFit-Weights.R.

Complex designs

The design matrix modifies the way the estimated error structure is calculated. Consider a small experiment of 10 slides, with two treatments and five slides per treatment. A direct comparison could be used to estimate the within/between gene error structure and differences between treatments. If this experiment was part of a larger experiment containing 20 slides then the information from all the slides would be used to obtain more precise estimates of the within/between gene error structure. A plot of sigma methods obtained from the full model and the sub model should be highly correlated. The M value differences between treatments would be identical in the sub model, but the error structure slightly different obtaining slightly different answers depending on what slides were incorporated in the experiment.

For specific comparisons of interest contrasts.fit modifes the lmFit object, the fit$coef terms are updated by adding or subtracting appropriate columns identified by cont.matrix. Variance terms for fit$stdev.unscaled are added, and rexpressed as standard deviations (see Variance), and fit$cov.coefficients terms are added together. All other elements remain unchanged.

Empirical Bayes assuming conjugate priors are used to calculate posterior probability measures such as moderated t-statistics and B statistics. Prior information for the moderated t-statistics comes from the data, while B statistics also depend on an assumed value of p, the proportion of genes which change.

The wrapper function eBayes for ebayes calculates these statistics for each row (spot).

Moderated t-statistics

The function ebayes calculates two statistical measures, moderated t-statistics and log odds ratio B statistics. The moderated t-statistic part calls squeezeVar which calls fitFDist. This assumes an inverse gamma distribution prior, and gamma distribution likelihood function for the variances (see ?sqeezeVar & Smyth 2002 p11). The prior parameters are estimated from the data (within gene variances). This estimates the scale (var.prior) and degrees of freedom for the prior distribution. the posterior distribution is empirically defined as:

shat2g = (do s2o + dg s2g) / (do + dg)

B statistics

the B statistics are empirically calculated assuming p (Smyth 2002, page 10), the proportion of genes that change is known:

Ogj  = ( (p) / (1-p) ) * ( vgj/(vgj+voj) )^0.5 * ( (that2gj + to + tg) / (that2 * (vgj/(vgj+voj) ) + do + dg)^(1+do+dg)/2

where

B = log(Ogj)

See also

References

  • Lonnstedt, I., and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31–46.
Note: The simulation study for this paper was identical to the underlying parametric assumptions in the B statistic
  • Smyth G.K. (2002). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments