two-way analysis of variance

{{short description|Statistical test examining influence of two categorical variables on one continuous variable}}

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect of each independent variable but also if there is any interaction between them.

History

In 1925, Ronald Fisher mentions the two-way ANOVA in his celebrated book, Statistical Methods for Research Workers (chapters 7 and 8). In 1934, Frank Yates published procedures for the unbalanced case.{{cite journal |last=Yates |first=Frank |date=March 1934 |title=The analysis of multiple classifications with unequal numbers in the different classes |jstor=2278459 |journal=Journal of the American Statistical Association |volume=29 |issue=185 |pages=51–66 |doi=10.1080/01621459.1934.10502686}} Since then, an extensive literature has been produced. The topic was reviewed in 1993 by Yasunori Fujikoshi.{{cite journal |last=Fujikoshi |first=Yasunori |date=1993 |title=Two-way ANOVA models with unbalanced data |journal=Discrete Mathematics |volume=116 |issue=1 |pages=315–334 |doi=10.1016/0012-365X(93)90410-U |doi-access=free }} In 2005, Andrew Gelman proposed a different approach of ANOVA, viewed as a multilevel model.{{cite journal |last=Gelman |first=Andrew |date=February 2005 |title=Analysis of variance? why it is more important than ever |journal=The Annals of Statistics |volume=33 |issue=1 |pages=1–53 | arxiv=math/0504499|doi=10.1214/009053604000001048 |s2cid=125025956 }}

Data set

Let us imagine a data set for which a dependent variable may be influenced by two factors which are potential sources of variation. The first factor has I levels {{nowrap|(i \in \{1,\ldots,I\})}} and the second has J levels {{nowrap|(j \in \{1,\ldots,J\})}}. Each combination (i,j) defines a treatment, for a total of I \times J treatments. We represent the number of replicates for treatment (i,j) by n_{ij}, and let k be the index of the replicate in this treatment {{nowrap|(k \in \{1,\ldots,n_{ij}\})}}.

From these data, we can build a contingency table, where n_{i+} = \sum_{j=1}^J n_{ij} and n_{+j} = \sum_{i=1}^I n_{ij}, and the total number of replicates is equal to n = \sum_{i,j} n_{ij} = \sum_i n_{i+} = \sum_j n_{+j}.

The experimental design is balanced if each treatment has the same number of replicates, K. In such a case, the design is also said to be orthogonal, allowing to fully distinguish the effects of both factors. We hence can write \forall i,j \; n_{ij} = K, and \forall i,j \; n_{ij} = \frac{n_{i+} \cdot n_{+j}}{n}.

Model

Upon observing variation among all n data points, for instance via a histogram, "probability may be used to describe such variation".{{cite journal |last=Kass |first=Robert E |date=1 February 2011 |title=Statistical inference: The big picture |journal=Statistical Science |volume=26 |issue=1 |pages=1–9 |doi=10.1214/10-sts337|pmid=21841892 |pmc=3153074 |arxiv=1106.2895 }} Let us hence denote by Y_{ijk} the random variable which observed value y_{ijk} is the k-th measure for treatment (i,j). The two-way ANOVA models all these variables as varying independently and normally around a mean, \mu_{ij}, with a constant variance, \sigma^2 (homoscedasticity):

Y_{ijk} \, | \, \mu_{ij}, \sigma^2 \; \overset{\mathrm{i.i.d.}}{\sim} \; \mathcal{N}(\mu_{ij}, \sigma^2).

Specifically, the mean of the response variable is modeled as a linear combination of the explanatory variables:

\mu_{ij} = \mu + \alpha_i + \beta_j + \gamma_{ij},

where \mu is the grand mean, \alpha_i is the additive main effect of level i from the first factor (i-th row in the contingency table), \beta_j is the additive main effect of level j from the second factor (j-th column in the contingency table) and \gamma_{ij} is the non-additive interaction effect of treatment (i,j) for samples k=1,...,n_{ij} from both factors (cell at row i and column j in the contingency table).

Another equivalent way of describing the two-way ANOVA is by mentioning that, besides the variation explained by the factors, there remains some statistical noise. This amount of unexplained variation is handled via the introduction of one random variable per data point, \epsilon_{ijk}, called error. These n random variables are seen as deviations from the means, and are assumed to be independent and normally distributed:

Y_{ijk} = \mu_{ij} + \epsilon_{ijk} \text{ with } \epsilon_{ijk} \overset{\mathrm{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2).

Assumptions

Following Gelman and Hill, the assumptions of the ANOVA, and more generally the general linear model, are, in decreasing order of importance:{{cite book |last1=Gelman |first1=Andrew |last2=Hill |first2=Jennifer|author2-link=Jennifer Hill |date=18 December 2006 |title= Data Analysis Using Regression and Multilevel/Hierarchical Models |url=http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/data-analysis-using-regression-and-multilevelhierarchical-models |publisher=Cambridge University Press |pages=45–46 |isbn=978-0521867061 }}

  1. the data points are relevant with respect to the scientific question under investigation;
  2. the mean of the response variable is influenced additively (if not interaction term) and linearly by the factors;
  3. the errors are independent;
  4. the errors have the same variance;
  5. the errors are normally distributed.

Parameter estimation

To ensure identifiability of parameters, we can add the following "sum-to-zero" constraints:

\sum_i \alpha_i = \sum_j \beta_j = \sum_i \gamma_{ij} =\sum_j \gamma_{ij}= 0

Hypothesis testing

In the classical approach, testing null hypotheses (that the factors have no effect) is achieved via their significance which requires calculating sums of squares.

Testing if the interaction term is significant can be difficult because of the potentially-large number of degrees of freedom.{{cite journal |author=Yi-An Ko|date=September 2013 |title=Novel Likelihood Ratio Tests for Screening Gene-Gene and Gene-Environment Interactions with Unbalanced Repeated-Measures Data |journal=Genetic Epidemiology |volume=37 |issue=6 |pages=581–591 |doi=10.1002/gepi.21744 |pmid=23798480 |display-authors=etal|pmc=4009698}}

Example

The following hypothetical example gives the yields of 15 plants subject to two different environmental variations, and three different fertilisers.

class="wikitable"
! Extra CO2

! Extra humidity

No fertiliser

| 7, 2, 1

| 7, 6

Nitrate

| 11, 6

| 10, 7, 3

Phosphate

| 5, 3, 4

| 11, 4

Five sums of squares are calculated:

class="wikitable"
Factor

! Calculation

! Sum

! N

Individual

| 7^2+2^2+1^2 + 7^2+6^2 + 11^2+6^2 + 10^2+7^2+3^2 + 5^2+3^2+4^2 + 11^2+4^2

| 641

| 15

Fertilizer × Environment

| \frac{(7+2+1)^2}{3} + \frac{(7+6)^2}{2} + \frac{(11+6)^2}{2} + \frac{(10+7+3)^2}{3} + \frac{(5+3+4)^2}{3} + \frac{(11+4)^2}{2}

| 556.1667

| 6

Fertilizer

| \frac{(7+2+1+7+6)^2}{5} + \frac{(11+6+10+7+3)^2}{5} + \frac{(5+3+4+11+4)^2}{5}

| 525.4

| 3

Environment

| \frac{(7+2+1+11+6+5+3+4)^2}{8} + \frac{(7+6+10+7+3+11+4)^2}{7}

| 519.2679

| 2

Composite

| \frac{(7+2+1+11+6+5+3+4+7+6+10+7+3+11+4)^2}{15}

| 504.6

| 1

Finally, the sums of squared deviations required for the analysis of variance can be calculated.{{cite book|last=Mecklin|first=Christopher|title=STA 265 Notes (Methods of Statistics and Data Science)|date=20 October 2020|access-date=6 December 2024|chapter-url=https://bookdown.org/cmecklin/sta265notes/anova-with-interaction.html|chapter=Chapter 7: ANOVA with Interaction|via=bookdown.org}}

class="wikitable"
Factor

! Sum

! N

! Total

! Environment

! Fertiliser

! Fertiliser × Environment

! Residual

Individual

| 641

| 15

| 1

|

|

|

| 1

Fertiliser × Environment

| 556.1667

| 6

|

|

|

| 1

| −1

Fertiliser

| 525.4

| 3

|

|

| 1

| −1

|

Environment

| 519.2679

| 2

|

| 1

|

| −1

|

Composite (correction factor{{cite book|chapter-url=https://iastate.pressbooks.pub/quantitativeplantbreeding/chapter/the-analysis-of-variance-anova/|title=Quantitative Methods for Plant Breeding|chapter=Chapter 8: The Analysis of Variance (ANOVA)|last1=Moore|first1=Ken|last2=Mowers|first2=Ron|last3=Harbur|first3=M.L.|last4=Merrick|first4=Laura|last5=Mahama|first5=Anthony Assibi|publisher=Iowa State University Digital Press|editor-last1=Suza|editor-first1=W.P.|editor-last2=Lamkey|editor-first2=K.R.|year=2023|access-date=6 December 2024}})

| 504.6

| 1

| −1

| −1

| −1

| 1

|

|

|

|

|

|

|

|

Squared deviations (\sigma^2)

|

|

| 136.4

| 14.668

| 20.8

| 16.099

| 84.833

Degrees of freedom

|

|

| 14

| 1

| 2

| 2

| 9

Mean square variance

|

|

|

| 14.668

| 10.4

| 8.0495

| 9.426

See also

Notes

{{Reflist}}

References

  • {{cite book |author=George Casella |date=18 April 2008 |title=Statistical design |url=https://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75964-7 |publisher=Springer |isbn=978-0-387-75965-4 |series=Springer Texts in Statistics |author-link=George Casella }}

Category:Analysis of variance