ANOVA for Comparing the Group Means

In this note, we introduce several commands in R that can be used to perform ANOVA for comparing group means. It should be pointed out that there are many options in these commands that can be used to do more sophisticated analyses, but we will only cover topics that have been introduced in Stat 200.

Two-Sample t-Test

In Spring 2014, 858 Stat 100 students responded to a survey in which one of the questions was:

On a scale of 0–10 rate where you fall: “0” means you strongly favor gay marriage… and “10” means you strongly oppose gay marriage.

We want to know if there is any difference between males and females towards gay marriage.

First, let’s load the data to R. The csv file has been uploaded to our website here. We can load it directly to R using the command

survey <- read.csv("http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/RProgramming/data/Stat100_2014spring_survey02.csv")

The description of the data can be found on this webpage. The response on the gay marriage question is in the column labelled ‘gayMarriage’ and the gender information is in the column ‘gender’. We can plot histograms for males and females to see if there is any obvious difference:

library(lattice)
histogram(~gayMarriage | gender, data=survey, layout=c(1,2))

Here we use R’s lattice graphics function histogram() to create conditional plots. The layout=c(1,2) option specifies the histograms should be plotted on a single column with two rows.

Another useful plots are box plots. If you don’t know/forget what a box plot is, here is a pdf file of the box plot note from Stat 100. You can also watch a Stat 100 video lecture on box plots. Box plots can be generated by the command plot()

plot(gayMarriage ~ gender, data=survey,las=1)

When the x-axis variable is a factor, the plot() command generates a box plot for each category¹. Note that the thick lines are medians. It is useful to indicate the means om the plots as well since the t-test compares the group means not medians. We can use tapply() to calculate the group means and points() to add them to the box plots:

means <- tapply(survey$gayMarriage, survey$gender, mean)
plot(gayMarriage ~ gender, data=survey,las=1)
points(means, col="red", pch="+")

The red +‘s denote the means in the box plots. It appears from the histograms and box plots that females have lower values in the ’gayMarriage’ variable, suggesting that females tend to favor gay marriage compared to males.

We can perform a t-test to see if the group means for ‘gayMarriage’ show a significant difference between males and females. The command t.test() can be used for this analysis. In particular, if x is a binary factor variable and y is a numeric/integer vector,

t.test(y ~ x, data=data_name, alternative=c("two-sided","less","greater"), var.equal=TRUE)

compares the means of y in the two groups specified by the factor variable x. The optional parameter alternative specifies what alternative hypothesis is. The default is “two-sided”. In this case, the hypotheses are

null H₀: the means are the same for the two groups
alternative H_a: the means are different for the two groups

The option var.equal=TRUE assumes that x and y have the same variances, which is the assumption adopted in Stat 200 and most introductory statistics courses. If the option var.equal=TRUE is omitted, t.test() performs a more complicated analysis assuming the two variances are not equal. We will always use the option var.equal=TRUE in this course. Let’s apply the function to our data:

t.test(gayMarriage ~ gender, data=survey, var.equal=TRUE)


    Two Sample t-test

data:  gayMarriage by gender
t = -5.5244, df = 856, p-value = 4.388e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.7074607 -0.8122462
sample estimates:
mean in group Female   mean in group Male 
            2.140820             3.400673

Here gayMarriage is y and gender is x. Since ‘gender’ is a two-level factor variable with ‘Female’ as the reference level, the difference between the means is $mean(gayMarriage_{Female})-mean(gayMarriage_{Male})$. The result shows that the difference between the means for females and males are highly significant. The p-value is 4.4×10^-8, and the 95% confidence interval of $mean(gayMarriage_{Female})-mean(gayMarriage_{Male})$ is (-1.707,-0.812). This means that females tend to favor gay marriage compared to males. The p-value is two-sided since we don’t specify the alternative parameter and it takes the default value.

If we want to test whether the mean for females is less than males, we set the alternative parameter to “less”:

t.test(gayMarriage ~ gender, data=survey, alternative="less", var.equal=TRUE)


    Two Sample t-test

data:  gayMarriage by gender
t = -5.5244, df = 856, p-value = 2.194e-08
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
       -Inf -0.8843343
sample estimates:
mean in group Female   mean in group Male 
            2.140820             3.400673

When we set alternative=“less”, the alternative hypothesis becomes H_a: the mean for females is less than males. We see that the p-value is half of the previous value.

Another syntax of using the t.test() function for a two-sample t-test is

t.test(y1,y2, alternative=c("two.sided","less","greater"), var.equal=TRUE)

Here y1 and y2 are the integer/numeric vectors whose means are to be compared. For example, the command t.test(gayMarriage ~ gender, data=survey, var.equal=TRUE) is the same as

with(survey, t.test(gayMarriage[gender=="Female"], gayMarriage[gender=="Male"], 
       var.equal=TRUE))


    Two Sample t-test

data:  gayMarriage[gender == "Female"] and gayMarriage[gender == "Male"]
t = -5.5244, df = 856, p-value = 4.388e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.7074607 -0.8122462
sample estimates:
mean of x mean of y 
 2.140820  3.400673

Note that we use the with(survey, …) syntax in order to access the variables ‘gayMarriage’ and ‘gender’ in the ‘survey’ data frame without the need to attach the prefix ‘survey$’.

The t.test() function described above performs the calculation as stated on P.108 of the Stat 200 note: \[t = \frac{{\rm ObsDiff}}{SE^+_{\rm difference}}\] \[SE^+_{\rm difference} = SD^+_{\rm errors}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\] \[SD^+_{\rm errors}=\sqrt{\frac{SSE}{n-g}}\] \[SSE = \sum_{i=1}^n (y_i-\hat{y}_i)^2= \sum_{i=1}^{n_1} (y_{1i}-\bar{y}_1)^2 + \sum_{i=1}^{n_2} (y_{2i}-\bar{y}_2)^2\] where $n=n_1+n_2$ is the total number of observations and $g$ is the number of groups. In our case, $n_1$ is the number of females and $n_2$ is the number of males and $g=2$. It is instructive to write our own code for the t-test and confirm that we get the exact same answer as the t.test() result:

n1 <- sum(survey$gender=="Female")
n2 <- sum(survey$gender=="Male")
n <- n1+n2
means <- tapply(survey$gayMarriage, survey$gender, mean)
ObsDiff <- means["Female"] - means["Male"]
SSE <- with(survey, sum( (gayMarriage[gender=="Female"]-means["Female"])^2) 
            + sum( (gayMarriage[gender=="Male"]-means["Male"])^2) )
SDplus_err <- sqrt(SSE/(n-2))
SEplus_diff <- SDplus_err*sqrt(1/n1 + 1/n2)
t <- unname(ObsDiff/SEplus_diff)
t

[1] -5.524405

As expected, we obtain the same t-value as returned by the t.test() function. The one-sided and two-sided p-values can be computed by the pt() function:

pval1 <- pt(t,n-2)
pval2 <- 2*pt(t,n-2)
c(pval1,pval2)

[1] 2.193757e-08 4.387514e-08

which are again the same as those returned by t.test().

F-Test

The F-test is used to compare the means among more than two groups. Consider Example 2 on P.104 of the Stat 200 note, where we want to know if there are any differences between ethnic groups on their attitudes towards gay marriage. Let’s first make a few plots.

histogram(~gayMarriage | ethnicity, data=survey)

means <- tapply(survey$gayMarriage, survey$ethnicity, mean)
plot(gayMarriage ~ ethnicity, data=survey, las=1)
points(means, col="red", pch="+")

There appears to be differences among the groups. R’s aov() function can be used to perform an F-test. The syntax is

aov(y ~ x, data=data_name)

Here x is a factor variable that can contain more than two levels. For our problem, ‘gayMarriage’ is y and ‘ethnicity’ is x:

result <- aov(gayMarriage ~ ethnicity, data=survey)
summary(result)

             Df Sum Sq Mean Sq F value   Pr(>F)    
ethnicity     4    330   82.54   8.165 1.84e-06 ***
Residuals   853   8623   10.11                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the F value is 8.165 and the corresponding p-value is 1.84×10^-6. Hence the difference is highly significant. We conclude that the mean of at least one group is different from the others.

In the summary output, the ‘Sum Sq’ column displays the sum of square quantities. In the ‘ethnicity’ row, it is the sum of square between groups, or SSB in the Stat 200 notation. In the ‘Residual’ row, it is the sum of square within groups, or SSW in the Stat 200 notation. The ‘Mean Sq.’ column is calculated by dividing the value in ‘Sum Sq.’ column by the Df. In stat 200 notation, the ‘Mean Sq.’ column in the ‘ethnicity’ row is MSB; the ‘Mean Sq.’ column in the ‘Residuals’ row is MSE. The F-value is calculated by F = MSB/MSW.

The F-test calculation is described on P.106 of the Stat 200 note: \[F = \frac{MSB}{MSW}=\frac{SSB/(g-1)}{SSW/(n-g)}\] \[SSB = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\] \[SSW = \sum_{i=1}^n (y_i-\hat{y}_i)^2\] Here $n$ is the total number of observations, $g$ is the number of groups, $\bar{y}$ is the overall mean and $\hat{y}_i$ is the predicted value for the ith observation. The value of $\hat{y}_i$ is simply the mean of the group to which the ith observation belongs. We now go through the coding exercise to reproduce the aov() result from the formulae:

y <- survey$gayMarriage
x <- survey$ethnicity
n <- nrow(survey)
g <- length(levels(x))  # number of groups
bar_y <- mean(y)
means <- tapply(y,x,mean)
# Set hat_y[i] to the average of the group to which i belongs
hat_y <- rep(NA, n)
for (group in levels(x)) {
  hat_y[x==group] <- means[group]
}
SSB <- sum((hat_y - bar_y)^2)
SSW <- sum((y - hat_y)^2)
SST <- (n-1)*var(y)
c(SSB,SSW,SST)

[1]  330.1631 8623.2600 8953.4231

# Sanity check: SST = SSW + SSB ? 
(SST - SSW - SSB)/SST

[1] -6.348792e-18

F <- (SSB/(g-1)) / (SSW/(n-g))
F

[1] 8.164809

As expected, We obtain the same F value. We have also verified that the equality SST=SSW+SSB holds, as expected. In addition, SSB and SSW are the same numbers as in the ‘Sum Sq’ column in the output of summary(result) above. To compute the associated p-value, we note that the numerator and denominator degrees of freedom for the F statistics are $g-1$ and $n-g$, respectively. The p-value is the area under the F curve to the right of the F value, which can be calculated using the pf() function:

pf(F,g-1,n-g, lower.tail=FALSE)

[1] 1.83638e-06

This is the same p-value returned by the aov() function. Finally, the square of the correlation between $\hat{y}$ and $y$ is given by $R^2=SSB/SST$, which is 0.0368756 in our example.

From the box plots of gayMarriage versus ethnicity, it doesn’t seem that there is significant difference between ‘Asian’ and ‘Black’ nor among ‘Hispanic’, ‘Mixed/Other’ and ‘White’. We can test the significances using the aov() function on subsets of the data.

We first create two subset conditions:

AB <- with(survey, ethnicity=="Asian" | ethnicity=="Black")
HMW <- !AB

Here AB and HMW are logical vectors. AB is TRUE if ethnicity is ‘Asian’ or ‘Black’ and is FALSE otherwise. HMW is the reverse of AB: it is TRUE if ethnicity is ‘Hispanic’, ‘Mixed_Other’ or ‘White’ and FALSE otherwise. Like lm(), the optional parameter subset can be used to specify a subset of data to perform the analysis. For the Asian-Black groups, we have

result_AB <- aov(gayMarriage ~ ethnicity, data=survey, subset=AB)
summary(result_AB)

             Df Sum Sq Mean Sq F value Pr(>F)
ethnicity     1    3.6   3.649   0.327  0.568
Residuals   266 2969.1  11.162

The p-value is 0.568, suggesting that there is no significant difference between ‘Asian’ and ‘Black’ on the attitude towards gay marriage. For the Hispanic-Mixed-White groups, we have

result_HMW <- aov(gayMarriage ~ ethnicity, data=survey, subset=HMW)
summary(result_HMW)

             Df Sum Sq Mean Sq F value Pr(>F)
ethnicity     2     16   7.781   0.808  0.446
Residuals   587   5654   9.632

This large p-value (0.446) also indicates that there is no significant difference between the Hispanic, Mixed, and White groups on the attitude towards gay marriage.

t-Tests for Multiple Post Hoc Comparisons

From the previous calculation, we conclude that at least one ethnic group is different from the other on the attitude towards gay marriage. We next want to see which pairs of groups show significant differences. We need to do t-tests between pairs of groups. Before we do the tests, let’s guess what the answers may be. Looking at the box plots of gayMarriage versus ethnicity, it appears that Asians and Blacks show no significant difference; Whites, Hispanics and Mixed/Other are also similar. But Asians and Blacks appear to be significantly different from Whites, Hispanics and Mixed/Other.

We can use the pairwise.t.test() function to perform pairwise t-tests. The syntax is

pairwise.t.test(y, x, p.adjust=method, alternative=c("two.sided", "less", "greater"))

Here y is a numeric/integer vector, x is a factor variable, method is the name of method you want to adjust the p-value. There are 8 methods, but we will only use “bonferroni” for the Bonferroni correction. You can look at the other 7 methods by typing ?p.adjust. Just like t.test(), the default value of the alternative parameter is “two.sided”, but you can change it to “less” or “greater”. For our problem, y is ‘survey$gayMarriage’ and x is ‘survey$ethnicity’ and we will use the default “two.sided” for the alternative parameter. The result is

pairwise.t.test(survey$gayMarriage, survey$ethnicity, p.adjust="bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  survey$gayMarriage and survey$ethnicity 

            Asian   Black   Hispanic Mixed_Other
Black       1.00000 -       -        -          
Hispanic    0.00107 0.00152 -        -          
Mixed_Other 0.08779 0.05313 1.00000  -          
White       0.00054 0.00273 1.00000  1.00000    

P value adjustment method: bonferroni

The adjusted p-values given here are the same as those on P.109 of the Stat 200 note. The adjusted p-value between Asians and Blacks is 1 (not significant), between Asians and Hispanics is 0.00107 (significant), between Asians and Mixed/Other is 0.08779 (not significant), between Asians and Whites is 0.00054 (significant), between Blacks and Hispanics is 0.00152 (significant), between Blacks and Mixed/Other is 0.05313 (not significant), between Blacks and Whites is 0.00273 (significant), between Hispanics and Mixed/Other is 1 (not significant), between Hispanics and Whites is 1 (not significant), between Mixed/Other and White is 1 (not significant). So we see that the guesses above are mostly correct. The guesses are wrong on the Black-Mixed/Other and Asian-Mixed/Other where the adjusted p-values exceed the 5% cutoff.

ANOVA and Linear Regression

It can be shown that performing an ANOVA analysis for comparing group means is mathematically equivalent to doing a linear regression on a factor variable. For example, let’s look at the result of the linear model predicting ‘gayMarriage’ from ‘gender’:

fit <- lm(gayMarriage ~ gender, data=survey)
summary(fit)


Call:
lm(formula = gayMarriage ~ gender, data = survey)

Residuals:
   Min     1Q Median     3Q    Max 
-3.401 -2.141 -1.771  1.599  7.859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.1408     0.1342  15.956  < 2e-16 ***
genderMale    1.2599     0.2281   5.524 4.39e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.178 on 856 degrees of freedom
Multiple R-squared:  0.03443,   Adjusted R-squared:  0.0333 
F-statistic: 30.52 on 1 and 856 DF,  p-value: 4.388e-08

Since the reference level of ‘gender’ is “Female”, the intercept is the mean of ‘gayMarriage’ for females. The slope is the difference between the means of males and female, i.e. $mean(gayMarriage_{Male})-mean(gayMarriage_{Female})$. The difference is 1.2599 with a t-value of 5.524. The corresponding p-value is 4.39×10^-8. This is the same (two-sided) p-value returned by t.test() shown above. The 95% confidence interval of $mean(gayMarriage_{Male})-mean(gayMarriage_{Female})$ is the 95% CI for the slope:

confint(fit, "genderMale")

               2.5 %   97.5 %
genderMale 0.8122462 1.707461

This is again consistent with the t.test() result.

The lm() function can also be used to compare group means between more than two groups. For example, gayMarriage ~ ethnicity:

fit <- lm(gayMarriage ~ ethnicity, data=survey)
summary(fit)


Call:
lm(formula = gayMarriage ~ ethnicity, data = survey)

Residuals:
   Min     1Q Median     3Q    Max 
-3.643 -2.261 -1.849  1.609  8.152 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.3913     0.2344  14.468  < 2e-16 ***
ethnicityBlack         0.2516     0.4187   0.601 0.548115    
ethnicityHispanic     -1.5428     0.3963  -3.893 0.000107 ***
ethnicityMixed_Other  -1.4146     0.5386  -2.627 0.008779 ** 
ethnicityWhite        -1.1301     0.2784  -4.059 5.37e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.18 on 853 degrees of freedom
Multiple R-squared:  0.03688,   Adjusted R-squared:  0.03236 
F-statistic: 8.165 on 4 and 853 DF,  p-value: 1.836e-06

We see that the F statistics and the associated p-value are the same as the aov() result above. In fact, the help page in ?aov clearly states that aov() provides a wrapper to lm for fitting linear models. This means that the aov() function actually calls the lm() function. The main difference between lm() and aov() is in the output: the aov() output is expressed in the traditional language of the analysis of variance rather than that of linear models.

Since ‘Asian’ is the reference level in the ‘ethnicity’ variable, the slopes in the linear model represent the differences between the means of other groups and ‘Asian’. The p-value associated with each slope is the unadjusted p-value between ‘Asian’ and each group. To get the adjusted p-values for Bonferroni correction, the p-values of the slopes need to be multiplied by $g(g-1)/2$ or 10 for $g=5$. Indeed, the adjusted p-values returned by the pairwise.t.test() function shown above in the ‘Asian’ column is exactly 10 times the p-values of the slopes in the linear model (except the Asian-Black comparison where multiplying the p-value of the slope by 10 exceeds 1 and so the adjusted p-value is capped at 1).

Randomization Test

Here we demonstrate how to use R to perform randomization tests described in Chapter 27 of the Stat 200 note. It involves scrambling the y variable while keeping the factor variable x fixed. Repeat the experiment many times and calculate R² of the scrambled data. In the end, we calculate the fraction of times the values of R = $\sqrt{R^2}$ of the scrambled data exceed the R of the original data. This fraction is an estimate of the p-value from the randomization test.

There are probably packages in R that can be used to perform randomization tests, but we will write our own code here for demonstration. We first write a function to compute R² for any given x and y. The easiest way is to call the lm() function:

computeR2 <- function(y,x) {
  summary(lm(y~x))$r.squared
}

Recall that the summary(lm(…)) function returns a number of diagnostics for the linear model. Among them is R², which can be extracted using the command summary(lm(…))$r.squared.

Next we apply computeR2() on the survey data with y being ‘gayMarriage’ and x being ‘ethnicity’ and check that we reproduce the R² calculated above:

R20 <- computeR2(survey$gayMarriage, survey$ethnicity)
R20

[1] 0.03687562

This is the value we calculated above. The randomization test involves scrambling the y variable and calculating R², and we have to repeat it many times. We can use the sample() function to scramble the y variable, and replicate() to repeat the experiment many times. We do it 5000 times here (in order to save time since lm() is very slow for the task):

set.seed(63741891)
R2 <- replicate(5000, computeR2(sample(survey$gayMarriage), survey$ethnicity))

This gives us 5000 values of R². The fraction of times R exceeds the original R is the same as the fraction of times R² exceeds the original R². It can be calculated by

mean(R2 > R20)

[1] 0

This means that for the 5000 experiments we tried, none of the R in the scrambled data exceeds the original R. Thus we conclude that the p-value is less than 1/5000 or 2×10^-4. This is consistent with the p-value calculated from the F statistics, which is 1.84×10^-6. The following is a histogram of R for the scrambled data. The vertical red line is the value of the original R.

hist(sqrt(R2),freq=FALSE,breaks=50, xlim=c(0,0.2), xlab="R")
abline(v=sqrt(R20), col="red")

This indicates that getting the original R by random sampling is highly unlikely, and hence we conclude that at least one group is significantly different from others on the attitude towards gay marriage.

We now do the randomization test for the Hispanic-Mixed-White groups, which we find from the calculation above that the p-value if 0.446. This can be easily done by subsetting the data:

x <- survey$ethnicity[HMW]
y <- survey$gayMarriage[HMW]

We first compute the original R²:

R2_HMW <- computeR2(y,x)
R2_HMW

[1] 0.002744753

Next we calculate R² of the scrambled data. We do it 5000 times:

set.seed(32859013)
R2 <- replicate(5000, computeR2(sample(y),x))

Here is the distribution of R:

hist(sqrt(R2), freq=FALSE, breaks=50, xlab="R")
abline(v=sqrt(R2_HMW), col="red")

The p-value is estimated by the fraction of times the values of R² are greater than R2_HMW:

mean(R2 > R2_HMW)

[1] 0.4562

We see that this estimated p-value is close to that calculated from the F statistics. We have only done the calculation with 5000 experiments. The accuracy improves as the number of experiments increases. With 10⁶ experiments, we find that the estimated p-value is 0.447, which is very close to the value from the F statistics.

The computeR2() function shown above uses the lm() function to compute R². As you have learned from a Lon Capa homework problem, the lm() function is very slow for randomization tests because it calculates many other unnecessary quantities. We can write a much faster code by computing R² directly from the formula R²=SSB/SST. We discuss the detail of our code optimization in this additional note for students who are interested in this topic.

Summary

Even though there seems to be many calculations involved in this note. We have only introduced a few R commands. Here is a quick summary of the commands and ideas.

In plot(y ~ x), if x is a factor variable and y is a numeric/integer vector, a box plot of y for each category in x is generated.
The command t.test() is used to perform a two-sample t test. The syntax is
```
t.test(y ~ x, data=data_name, alternative=c("two-sided","less","greater"), 
       var.equal=TRUE)
```
Here y is a numeric/integer vector and x is a factor variable with two levels. The default value of alternative is “two-sided”. The option var.equal=TRUE assumes the variance is equal in the two groups, which is always assumed in introductory Stat courses.

Another syntax for the t.test() function is
```
t.test(y1,y2, alternative=c("two-sided","less","greater"), var.equal=TRUE)
```
Here y1 and y2 are the numeric/integer vectors for the two groups whose means are to be compared.
The command aov() can be used to perform F-tests. The syntax is
```
aov(y ~ x, data=data_name)
```
Here y is a numeric/integer vector and x is a factor variable.
Pairwise t-tests can be done using the pairwise.t.test() function, the syntax is
```
pairwise.t.test(y, x, p.adjust="bonferroni", 
                alternative=c("two.sided", "less", "greater"))
```
Here y is a numeric/integer vector and x is a factor variable. The option p.adjust=“bonferroni” means that the p-values will be adjusted for Bonferroni correction.
The lm() function can be used for ANOVA comparing group means in place of t.test() and aov().
Randomization tests can be used as an alternative method to estimate the p-value.

To make a box plot on ‘gayMarriage’ without splitting it into categories, use boxplot(survey$gayMarriage).↩