9 ANOVA

9.1 Analysis of variance

ANOVA is a method for testing the hypothesis that there is no difference in means of subsets of measurements grouped by factors. Essentially, this is a generalization of linear regression to categorical explanatory variables instead of numeric variables, and it is based on very similar assumptions.

ANOVA perform at its best when we have a particular experimental design: a) we divide the population into groups of equal size (balanced design); b) we assign “treatments” to the subjects at random (randomized design); in case of multiple treatment combinations, we perform an experiment for each combination (factorial design); in most cases, we have a “null” treatment (e.g., placebo).

We speak of one-way ANOVA when there is a single axis of variation to our treatment (e.g., no intervention, option A, option B), two-way ANOVA when we apply two treatments for each group (e.g., no treatment, Ab, AB, aB), and so forth. Extensions include ANCOVA (ANalysis of COVAriance) and MANOVA (Multivariate ANalysis Of VAriance).

For example, we want to test whether a COVID vaccine protects against infection. We can assign at random a population of volunteers to two classes (vaccine/placebo) and contrast the number of people who got sick within 3 months from treatment in the two classes. Of course, we can simply perform a t-test. But what if we assign people to different classes (e.g., M/F, under/over 65 y/o), and want to contrast the mean infection rate across all classes?

9.1.1 ANOVA assumptions

ANOVA tests whether samples are taken from distributions with the same mean:

Null hypothesis: the means of the different groups are the same.
Alternative hypothesis: At least one sample mean is not equal to the others.

Let \(Y\) indicate the response variable, and study the simplest case of one-way ANOVA. We have divided the samples in \(k\) classes of size \(J_1, \ldots, J_k\) such that \(n=\sum_i J_i\). We write an equation for \(Y_{ij}\) (the \(j\)th observation in group \(i\)):

\[ Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \]

where \(\mu\) is the overall mean; \(\mu + \alpha_i\) is the mean of group \(i\)—we can always choose the parameters such that \(\sum_i \alpha_i = 0\); and, finally, \(\epsilon_{ij}\) is the deviation from the group mean. Practically, we are testing whether at least one of the \(\alpha_i \neq 0\).

Note that we are making the same assumptions as for linear regression:

The observations are obtained independently (and randomly) from the population defined by the factor levels (groups)
The measurements for each factor level are independent and normally distributed
These normal distributions have the same variance

9.1.2 How one-way ANOVA works

We have \(k\) groups, and define the overall mean \(\bar{y} = \sum_i \sum_j Y_{ij} / J_{i}\), where \(J_i\) is the sample size for group \(i\). Then the total sum of squared deviations (SSD) is simply:

\[ SSD = \sum_{i = 1}^k \sum_{j = 1}^{J_{i}} \left(Y_{ij} - \bar{y}\right)^2 \]

and the associated degrees of freedom \(n-1\). We car rewrite this as:

\[ SSD = \sum_{i = 1}^k J_{i} \left(\bar{y}_i - \bar{y} \right)^2 + \sum_{j = 1}^{J_{i}} \left(Y_{ij} - \bar{y}_i \right)^2 \]

where now \(\bar{y}_i\) is the mean for the samples in treatment (factor, group) \(i\). We call the first term in the r.h.s. the between treatment sum of squares (SST) and the second term the within treatment (or residual) ssq (SSE). As such \(SSD = SST + SSE\). Similarly, we can decompose SSD as:

\[ SSD = \sum_{j = 1}^{J_{i}} Y_{ij}^2 - n \bar{y}^2 = TSS - SSA \] where now TSS is the total sum of squares and SSA is the sum of squares due to the average. Combining the two equations, we can rewrite TSS as the sum of three components:

\[ TSS = SSA + SST + SSE \]

Note that the degrees of freedom associated with each term are \(1\), \(k-1\) and \(n-k\), respectively. What remains to be proved is how to conduct inference.

9.2 Inference in one-way ANOVA

If the null hypothesis were true, then we would expect the between-treatment “variance” SST, divided by the degrees of freedom (\(k-1\)) to be the same as the within-treatment “variance” divided by \(n-k\).

Let’s look at this hypothesis more closely. If the null hypothesis were true, then \(SST\) would be the sum of the squares of independent, normally distributed random variables with mean zero and variance \(\sigma^2\). If you remember, the distribution of:

\[ Q = \sum_{i=1}^d Z_i^2 \sim \chi^2(d) \]

is called the \(\chi^2\) distribution with \(d\) degrees of freedom. Then, under the null hypothesis, \(SST \sim \chi^2(k-1)\) Similarly, \(SSE \sim \chi^2(n-k)\). Taking the ratio (having normalized using the degrees of freedom), we obtain:

\[ \frac{SST}{k-1} \frac{n-k}{SSE} = \frac{MST}{MSE} \sim F(k-1, n-k) \] where \(F\) is the F-distribution (in R, you can sample from this distribution using df(x, deg1, deg2)).

9.2.1 Example of comparing diets

For example, the following data contains measurements of weights of individuals before starting a diet, after 6 weeks of dieting, the type of diet (1, 2, 3), and other variables.

Write a script below using ggplot to generate boxplots for the weights after three different diets.

We can see that there weight loss outcomes vary for each diet, but diet 3 seems to produce a larger effect on average. But is the difference between the means/medians actually due to the diet, or could it have been produced by sampling from the same distribution, given that we see substantial variation within each diet group?

Here is the result of running ANOVA on the given data set:

9.2.2 Comparison of theory and ANOVA output

Let’s compare this with the calculations from the data set:

Now that we have all the numbers in place, we can compute our F-statistics, and the associated p-value:

Contrast these with the output of aov:

At first glance, this process is not the same as fitting parameters for linear regression, but it is based on exactly the same assumptions: additive noise and additive effect of the factors, with the only difference being that factors are not numeric, so the effect of each one is added separately. One can run linear regression and calculate coefficients that are identical to the mean and the differences between means computed by ANOVA (and note the p-values too!)

9.3 Further steps

9.3.1 Post-hoc analysis

The ANOVA F-test tells us whether there is any difference in values of the response variable between the groups, but does not specify which group(s) are different. For this, a post-hoc test is used (Tukey’s “Honest Significant Difference”):

This compares the three pairs of groups and reports the p-value for the hypothesis that this particular pair has no difference in the response variable.

9.3.2 Example of plant growth data

Example taken from: One-Way ANOVA Test in R

Exercise: perform ANOVA and Tukey’s HSD and interpret the results.

9.3.3 Two-way ANOVA

One can compare the effect of two different factors simultaneously and see if considering both explains more of the variance than of one. This is equivalent to the multiple linear regression with two interacting variables. How would you interpret these results?

9.4 Investigate the UC salaries dataset

Plot the distributions of pay by location and title. Is it approximately normal? If not, transform the data.

Run ANOVA for pay as dependent on the two factors separately, report the variance between means and the variance within groups, and the p-value for the null hypothesis.

Run Tukey’s test for multiple comparison of means to report which group(s) are substantially different from the rest, if any.

Run a two-way ANOVA for both location and title and provide interpretation.

9.4.1 A word of caution about unbalanced designs

When we have a different number of samples in each category, we might encounter some problems, as the order in which we enter the terms might matter: for example, run aov(pay ~ title + loc) vs. aov(pay ~ loc + title), and see that the sum of squares for the two models differ. In some cases, this might lead to the puzzling results—depending on how we enter the model, we might determine that a treatment has an effect or not. Turns out, there are three different ways to account for the sum-of-squares in ANOVA, all testing slightly different hypotheses. For balanced designs, they all return the same answer, but if you have different sizes, please read here.