10  ANOVA

library(tidyverse) # our friend the tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

10.1 Analysis of variance

ANOVA is a method for testing the hypothesis that there is no difference in means of subsets of measurements grouped by factors. Essentially, this is a generalization of linear regression to categorical explanatory variables instead of numeric variables, and it is based on very similar assumptions.

ANOVA perform at its best when we have a particular experimental design: a) we divide the population into groups of equal size (balanced design); b) we assign “treatments” to the subjects at random (randomized design); in case of multiple treatment combinations, we perform an experiment for each combination (factorial design); in most cases, we have a “null” treatment (e.g., placebo).

We speak of one-way ANOVA when there is a single axis of variation to our treatment (e.g., no intervention, option A, option B), two-way ANOVA when we apply two treatments for each group (e.g., no treatment, Ab, AB, aB), and so forth. Extensions include ANCOVA (ANalysis of COVAriance) and MANOVA (Multivariate ANalysis Of VAriance).

For example, we want to test whether a COVID vaccine protects against infection. We can assign at random a population of volunteers to two classes (vaccine/placebo) and contrast the number of people who got sick within 3 months from treatment in the two classes. Of course, we can simply perform a t-test. But what if we assign people to different classes (e.g., M/F, under/over 65 y/o), and want to contrast the mean infection rate across all classes?

10.1.1 ANOVA assumptions

ANOVA tests whether samples are taken from distributions with the same mean:

  • Null hypothesis: the means of the different groups are the same.
  • Alternative hypothesis: At least one sample mean is not equal to the others.

Let \(Y\) indicate the response variable, and study the simplest case of one-way ANOVA. We have divided the samples in \(k\) classes of size \(J_1, \ldots, J_k\) such that \(n=\sum_i J_i\). We write an equation for \(Y_{ij}\) (the \(j\)th observation in group \(i\)):

\[ Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \]

where \(\mu\) is the overall mean; \(\mu + \alpha_i\) is the mean of group \(i\)—we can always choose the parameters such that \(\sum_i \alpha_i = 0\); and, finally, \(\epsilon_{ij}\) is the deviation from the group mean. Practically, we are testing whether at least one of the \(\alpha_i \neq 0\).

Note that we are making the same assumptions as for linear regression:

  • The observations are obtained independently (and randomly) from the population defined by the factor levels (groups)
  • The measurements for each factor level are independent and normally distributed
  • These normal distributions have the same variance

10.1.2 How one-way ANOVA works

We have \(k\) groups, and define the overall mean \(\bar{y} = \sum_i \sum_j Y_{ij} / J_{i}\), where \(J_i\) is the sample size for group \(i\). Then the total sum of squared deviations (SSD) is simply:

\[ SSD = \sum_{i = 1}^k \sum_{j = 1}^{J_{i}} \left(Y_{ij} - \bar{y}\right)^2 \]

and the associated degrees of freedom \(n-1\). We car rewrite this as:

\[ SSD = \sum_{i = 1}^k J_{i} \left(\bar{y}_i - \bar{y} \right)^2 + \sum_{j = 1}^{J_{i}} \left(Y_{ij} - \bar{y}_i \right)^2 \]

where now \(\bar{y}_i\) is the mean for the samples in treatment (factor, group) \(i\). We call the first term in the r.h.s. the between treatment sum of squares (SST) and the second term the within treatment (or residual) ssq (SSE). As such \(SSD = SST + SSE\). Similarly, we can decompose SSD as:

\[ SSD = \sum_{j = 1}^{J_{i}} Y_{ij}^2 - n \bar{y}^2 = TSS - SSA \] where now TSS is the total sum of squares and SSA is the sum of squares due to the average. Combining the two equations, we can rewrite TSS as the sum of three components:

\[ TSS = SSA + SST + SSE \]

Note that the degrees of freedom associated with each term are \(1\), \(k-1\) and \(n-k\), respectively. What remains to be proved is how to conduct inference.

10.2 Inference in one-way ANOVA

If the null hypothesis were true, then we would expect the between-treatment “variance” SST, divided by the degrees of freedom (\(k-1\)) to be the same as the within-treatment “variance” divided by \(n-k\).

Let’s look at this hypothesis more closely. If the null hypothesis were true, then \(SST\) would be the sum of the squares of independent, normally distributed random variables with mean zero and variance \(\sigma^2\). If you remember, the distribution of:

\[ Q = \sum_{i=1}^d Z_i^2 \sim \chi^2(d) \]

is called the \(\chi^2\) distribution with \(d\) degrees of freedom. Then, under the null hypothesis, \(SST \sim \chi^2(k-1)\) Similarly, \(SSE \sim \chi^2(n-k)\). Taking the ratio (having normalized using the degrees of freedom), we obtain:

\[ \frac{SST}{k-1} \frac{n-k}{SSE} = \frac{MST}{MSE} \sim F(k-1, n-k) \] where \(F\) is the F-distribution (in R, you can sample from this distribution using df(x, deg1, deg2)).

10.2.1 Example of comparing diets

For example, the following data contains measurements of weights of individuals before starting a diet, after 6 weeks of dieting, the type of diet (1, 2, 3), and other variables.

library(tidyverse)
# Original URL: "https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv"
diet <- read_csv("data/Diet.csv") 
diet <- diet %>% mutate(weight.loss = pre.weight - weight6weeks) 
glimpse(diet)
Rows: 78
Columns: 8
$ Person       <dbl> 25, 26, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 27…
$ gender       <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Age          <dbl> 41, 32, 22, 46, 55, 33, 50, 50, 37, 28, 28, 45, 60, 48, 4…
$ Height       <dbl> 171, 174, 159, 192, 170, 171, 170, 201, 174, 176, 165, 16…
$ pre.weight   <dbl> 60, 103, 58, 60, 64, 64, 65, 66, 67, 69, 70, 70, 72, 72, …
$ Diet         <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, …
$ weight6weeks <dbl> 60.0, 103.0, 54.2, 54.0, 63.3, 61.1, 62.2, 64.0, 65.0, 60…
$ weight.loss  <dbl> 0.0, 0.0, 3.8, 6.0, 0.7, 2.9, 2.8, 2.0, 2.0, 8.5, 1.9, 3.…
# make diet into factors
diet <- diet %>% mutate(Diet = factor(Diet))

Write a script below using ggplot to generate boxplots for the weights after three different diets.

diet %>% ggplot() + 
  aes(y = weight.loss, 
      x = Diet, 
      fill = Diet) + 
  geom_boxplot()

We can see that there weight loss outcomes vary for each diet, but diet 3 seems to produce a larger effect on average. But is the difference between the means/medians actually due to the diet, or could it have been produced by sampling from the same distribution, given that we see substantial variation within each diet group?

Here is the result of running ANOVA on the given data set:

diet_anova  <-  aov(weight.loss ~ Diet, data=diet) # note that this looks like lm!
summary(diet_anova)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         2   71.1   35.55   6.197 0.00323 **
Residuals   75  430.2    5.74                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(diet_anova)
Call:
   aov(formula = weight.loss ~ Diet, data = diet)

Terms:
                    Diet Residuals
Sum of Squares   71.0937  430.1793
Deg. of Freedom        2        75

Residual standard error: 2.394937
Estimated effects may be unbalanced

10.2.2 Comparison of theory and ANOVA output

Let’s compare this with the calculations from the data set:

# 1) compute the overall mean
bar_y <- diet %>% 
  summarise(bar_y = mean(weight.loss)) %>% 
  as.numeric()
# 2) compute means by diet and sample size by diet
bar_y_i <- diet %>% 
  group_by(Diet) %>% 
  summarise(bar_y_i = mean(weight.loss),
            J_i = n())
#(NB: almost balanced!)

# 3) compute degrees of freedom
n <- nrow(diet)
k <- nrow(diet %>% dplyr::select(Diet) %>% distinct())
deg_freedom <- c(1, k - 1, n - k)
# 4) compute SSA, SST and SSE
SSA <- n * bar_y^2
SST <- bar_y_i %>% 
  mutate(tmp = J_i * (bar_y_i - bar_y)^2) %>% 
  summarise(sst = sum(tmp)) %>% 
  as.numeric()
SSE <- diet %>% 
  inner_join(bar_y_i, by = "Diet") %>% 
  mutate(tmp = (weight.loss - bar_y_i)^2) %>% 
  summarise(sse = sum(tmp)) %>% 
  as.numeric()
# 5) show that TSS = SSA + SST + SSE
TSS <- diet %>% 
  summarise(tss = sum(weight.loss^2)) %>% 
  as.numeric()
print(c(TSS, SSA + SST + SSE))
[1] 1654.35 1654.35

Now that we have all the numbers in place, we can compute our F-statistics, and the associated p-value:

Fs <- (SST / (deg_freedom[2])) / (SSE / (deg_freedom[3]))
pval <- 1 - pf(Fs, deg_freedom[2], deg_freedom[3])

Contrast these with the output of aov:

print(deg_freedom[-1])
[1]  2 75
print(c(SST, SSE))
[1]  71.09369 430.17926
print(c(Fs, pval))
[1] 6.197447453 0.003229014
print(summary(aov(weight.loss ~ Diet, data = diet)))
            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         2   71.1   35.55   6.197 0.00323 **
Residuals   75  430.2    5.74                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At first glance, this process is not the same as fitting parameters for linear regression, but it is based on exactly the same assumptions: additive noise and additive effect of the factors, with the only difference being that factors are not numeric, so the effect of each one is added separately. One can run linear regression and calculate coefficients that are identical to the mean and the differences between means computed by ANOVA (and note the p-values too!)

diet.lm <- lm(weight.loss ~ Diet, data=diet)
summary(diet.lm)

Call:
lm(formula = weight.loss ~ Diet, data = diet)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1259 -1.3815  0.1759  1.6519  5.7000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.3000     0.4889   6.750 2.72e-09 ***
Diet2        -0.2741     0.6719  -0.408  0.68449    
Diet3         1.8481     0.6719   2.751  0.00745 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.395 on 75 degrees of freedom
Multiple R-squared:  0.1418,    Adjusted R-squared:  0.1189 
F-statistic: 6.197 on 2 and 75 DF,  p-value: 0.003229
print(diet.lm$coefficients)
(Intercept)       Diet2       Diet3 
  3.3000000  -0.2740741   1.8481481 

10.3 Further steps

10.3.1 Post-hoc analysis

The ANOVA F-test tells us whether there is any difference in values of the response variable between the groups, but does not specify which group(s) are different. For this, a post-hoc test is used (Tukey’s “Honest Significant Difference”):

tuk <- TukeyHSD(diet_anova)
tuk
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight.loss ~ Diet, data = diet)

$Diet
          diff        lwr      upr     p adj
2-1 -0.2740741 -1.8806155 1.332467 0.9124737
3-1  1.8481481  0.2416067 3.454690 0.0201413
3-2  2.1222222  0.5636481 3.680796 0.0047819

This compares the three pairs of groups and reports the p-value for the hypothesis that this particular pair has no difference in the response variable.

10.3.2 Example of plant growth data

Example taken from: One-Way ANOVA Test in R

my_data <- PlantGrowth # import built-in data
head(my_data)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl
# Show the levels
my_data %>% dplyr::select(group) %>% distinct()
  group
1  ctrl
2  trt1
3  trt2
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 3 × 4
  group count  mean    sd
  <fct> <int> <dbl> <dbl>
1 ctrl     10  5.03 0.583
2 trt1     10  4.66 0.794
3 trt2     10  5.53 0.443
my_data %>% 
  ggplot() + 
  aes(y = weight, x = group, 
      fill = group) + 
  geom_boxplot()

Exercise: perform ANOVA and Tukey’s HSD and interpret the results.

10.3.3 Two-way ANOVA

One can compare the effect of two different factors simultaneously and see if considering both explains more of the variance than of one. This is equivalent to the multiple linear regression with two interacting variables. How would you interpret these results?

diet.fisher <- aov(weight.loss ~ Diet * gender, data = diet)
summary(diet.fisher)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         2   60.5  30.264   5.629 0.00541 **
gender       1    0.2   0.169   0.031 0.85991   
Diet:gender  2   33.9  16.952   3.153 0.04884 * 
Residuals   70  376.3   5.376                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness

10.4 Investigate the UC salaries dataset

# read the data
# Original URL
dt <- read_csv("https://raw.githubusercontent.com/dailybruin/uc-salaries/master/data/uc_salaries.csv", 
col_names = c("first_name", "last_name", "title", "a", "pay", "loc", "year", "b", "c", "d")) %>%  dplyr::select(first_name, last_name, title, loc, pay)
Rows: 175000 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): first_name, last_name, title, loc
dbl (6): a, pay, year, b, c, d

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# get only profs
dt <- dt %>% filter(title %in% c("PROF-AY", "ASSOC PROF-AY", "ASST PROF-AY", 
                                 "PROF-AY-B/E/E", "PROF-HCOMP", "ASST PROF-AY-B/E/E", 
                                 "ASSOC PROF-AY-B/E/E", "ASSOC PROF-HCOMP", "ASST PROF-HCOMP"))
# simplify titles
dt <- dt %>% mutate(title = ifelse(grepl("ASST", title), "Assistant", title))
dt <- dt %>% mutate(title = ifelse(grepl("ASSOC", title), "Associate", title))
dt <- dt %>% mutate(title = ifelse(grepl("PROF", title), "Full", title))
# remove those making less than 50k (probably there only for a period)
dt <- dt %>% filter(pay > 50000)
glimpse(dt)
Rows: 4,795
Columns: 5
$ first_name <chr> "CHRISTOPHER U", "HENRY DON ISAAC", "ADAM R", "KEVORK N.", …
$ last_name  <chr> "ABANI", "ABARBANEL", "ABATE", "ABAZAJIAN", "ABBAS", "ABBAS…
$ title      <chr> "Full", "Full", "Assistant", "Assistant", "Full", "Full", "…
$ loc        <chr> "Riverside", "San Diego", "San Francisco", "Irvine", "Irvin…
$ pay        <dbl> 151200.00, 160450.08, 85305.01, 82400.04, 168699.96, 286823…
  1. Plot the distributions of pay by location and title. Is it approximately normal? If not, transform the data.
  1. Run ANOVA for pay as dependent on the two factors separately, report the variance between means and the variance within groups, and the p-value for the null hypothesis.
  1. Run Tukey’s test for multiple comparison of means to report which group(s) are substantially different from the rest, if any.
  1. Run a two-way ANOVA for both location and title and provide interpretation.

10.4.1 A word of caution about unbalanced designs

When we have a different number of samples in each category, we might encounter some problems, as the order in which we enter the terms might matter: for example, run aov(pay ~ title + loc) vs. aov(pay ~ loc + title), and see that the sum of squares for the two models differ. In some cases, this might lead to the puzzling results—depending on how we enter the model, we might determine that a treatment has an effect or not. Turns out, there are three different ways to account for the sum-of-squares in ANOVA, all testing slightly different hypotheses. For balanced designs, they all return the same answer, but if you have different sizes, please read here.