Factor Variables

In this note, we demonstrate using the lm() function on categorical variables. We again use the Stat 100 Survey 2, Fall 2015 (combined) data we have been working on for demonstration. First, let’s reload the data from the saved RData file.

load('Stat100_Survey2_Fall2015.RData')

The data frame survey is loaded, which stores the survey data. As mentioned before, R’s factor variables are designed to represent categorical data. In our data set, the gender column is a categorical variable: it is either male or female. Right now, the column is a character vector as you can see if you type the command class(survey$gender). We want to change it to a factor variable, with “male” set to the first level:

survey$gender <- factor(survey$gender, levels=c("male","female"))

We want to preserve this change so that we won’t have to do this again the next time we load the file, so we overwrite the ‘Stat100_Survey2_Fall2015.RData’:

save(survey, file='Stat100_Survey2_Fall2015.RData')

Recall that last time we fit a linear model predicting student’s party hours/week from the average number of drinks/week:

fit <- lm(partyHr ~ drinks, data=survey)
summary(fit)

Call:
lm(formula = partyHr ~ drinks, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.609  -2.315  -0.774   1.685  47.685 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.31505    0.16409   14.11   <2e-16 ***
drinks       0.54588    0.01397   39.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.492 on 1135 degrees of freedom
Multiple R-squared:  0.5737,    Adjusted R-squared:  0.5733 
F-statistic:  1527 on 1 and 1135 DF,  p-value: < 2.2e-16

Now we want to know if there is any difference between male and female students. We can include gender in the model by the command

fit1 <- lm(partyHr ~ drinks + gender, data=survey)
summary(fit1)

Call:
lm(formula = partyHr ~ drinks + gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.483  -2.514  -0.734   1.601  48.260 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.74033    0.27000   6.446  1.7e-10 ***
drinks        0.55486    0.01433  38.724  < 2e-16 ***
genderfemale  0.77379    0.28915   2.676  0.00756 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.48 on 1134 degrees of freedom
Multiple R-squared:  0.5764,    Adjusted R-squared:  0.5756 
F-statistic: 771.4 on 2 and 1134 DF,  p-value: < 2.2e-16

We see that in addition to the intercept and slope for ‘drinks’, there is a third variable ‘genderfemale’. When lm() encounters a factor variable with two levels, it creates a new variable based on the second level. In our example, the second level is female, and genderfemale is created. It is a binary variable that takes the value 1 if the value of ‘gender’ is female, and 0 if the value of ‘gender’ is not female. The fitted model can be written as \[{\rm party\ hours/week} = 1.74033 + 0.55486\, ({\rm drinks/week}) + 0.77379\, ({\rm female})\] where for simplicity we use female instead of genderfemale to denote the binary variable. In this model, male and female have the same slope (0.55486) for drinks but different intercepts. For male, the intercept is 1.74033. For female, the intercept is 1.74033 + 0.77379 = 2.51412.

Note that since we set “male” to be the first level (also called the base level or reference level), the intercept corresponds to the intercept for “male”. We can change the base level by the relevel() command:

survey$gender2 <- relevel(survey$gender,"female")
survey$gender2[1:10]
 [1] female female female male   female female male   female female male  
Levels: female male

We see that in survey$gender2, “female” is the base level. Let’s fit a linear model using gender2:

fit2 <- lm(partyHr ~ drinks + gender2, data=survey)
summary(fit2)

Call:
lm(formula = partyHr ~ drinks + gender2, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.483  -2.514  -0.734   1.601  48.260 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.51412    0.17976  13.986  < 2e-16 ***
drinks       0.55486    0.01433  38.724  < 2e-16 ***
gender2male -0.77379    0.28915  -2.676  0.00756 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.48 on 1134 degrees of freedom
Multiple R-squared:  0.5764,    Adjusted R-squared:  0.5756 
F-statistic: 771.4 on 2 and 1134 DF,  p-value: < 2.2e-16

Since “female” is the base level in gender2, the intercept returned by the lm() is the intercept for female. The binary variable is now called gender2male, which is 1 if the value is “male” in gender2 and 0 if the value is not “male” in gender2. We see that the slope for drinks remains the same as before. The value of the intercept is 2.51412, which is the intercept for female we calculated above. The intercept for male is 2.51412 + (-0.77379) = 1.74033, consistent with the previous result.

Interaction

In R, the interaction term is represented by a colon ‘:’. To fit a model for partyHr versus drinks, gender and interaction between drink and gender, we can use the command

fit3 <- lm(partyHr ~ drinks + gender + drinks:gender, data=survey)
summary(fit3)

Call:
lm(formula = partyHr ~ drinks + gender + drinks:gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.969  -2.047  -0.860   1.517  47.953 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.36823    0.29222   8.104 1.37e-15 ***
drinks               0.49202    0.01851  26.584  < 2e-16 ***
genderfemale        -0.32163    0.35330  -0.910    0.363    
drinks:genderfemale  0.15161    0.02875   5.274 1.60e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.428 on 1133 degrees of freedom
Multiple R-squared:  0.5865,    Adjusted R-squared:  0.5854 
F-statistic: 535.7 on 3 and 1133 DF,  p-value: < 2.2e-16

Actually, R has a shorthand notation to represent drinks + gender + drinks:gender, which is drinks*gender. So we can simply type

fit3 <- lm(partyHr ~ drinks*gender, data=survey)
summary(fit3)

Call:
lm(formula = partyHr ~ drinks * gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.969  -2.047  -0.860   1.517  47.953 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.36823    0.29222   8.104 1.37e-15 ***
drinks               0.49202    0.01851  26.584  < 2e-16 ***
genderfemale        -0.32163    0.35330  -0.910    0.363    
drinks:genderfemale  0.15161    0.02875   5.274 1.60e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.428 on 1133 degrees of freedom
Multiple R-squared:  0.5865,    Adjusted R-squared:  0.5854 
F-statistic: 535.7 on 3 and 1133 DF,  p-value: < 2.2e-16

to get the same fitting formula. The coefficients in the above formula mean that the fitted model is given by the equation \[ {\rm party\ hours/week} = 2.3682 + 0.492\, ({\rm drinks/week}) -0.3216\, ({\rm female}) + 0.1516\, ({\rm drinks/week}) \cdot ({\rm female})\] Because of the presence of the interaction term, both the slope and intercept are different for male and female. Also note that in this fit, the coefficient of genderfemale, having a p-value of 0.363, is not statistically significant, meaning that the intercepts for male and female are probably not very different. However, the slopes for drinks for male and female are different.

As you have learned in Stat 200, the regression equation above can be split into separate equations for male and female. The male equation is obtained by substituting ‘female=0’ into the equation: \[{\rm party\ hours/week} = 2.3682 + 0.492\, ({\rm drinks/week}) \ \ \ {\rm for\ male}.\] The female equation is obtained by substituting ‘female=1’ into the equation: \[{\rm party\ hours/week} = 2.3682 + 0.492\, ({\rm drinks/week})-0.3216 + 0.1516\, ({\rm drinks/week}) \ \ \ {\rm for\ female},\] which is simplified to \[{\rm party\ hours/week} =2.0466 + 0.6436\, ({\rm drinks/week}) \ \ \ {\rm for\ female}.\] Combining the two equations, we can write \[ {\rm party\ hours/week} = \left\{ \begin{array}{ll} 2.3682 + 0.492\, ({\rm drinks/week}) & {\rm for\ male} \\ 2.0466 + 0.6436\, ({\rm drinks/week}) & {\rm for\ female}\end{array} \right. \] Including the interaction term is equivalent to splitting the data into male and female and fit a linear model for each group. We can confirm that this is the result we will get by actually splitting the data into two groups and fit a linear model for each group. Here is the code:

survey_male <- survey[survey$gender=="male",]
survey_female <- survey[survey$gender=="female",]
fit_male <- lm(partyHr ~ drinks, data=survey_male)
fit_female <- lm(partyHr ~ drinks, data=survey_female)
fit_male$coefficients
(Intercept)      drinks 
   2.368229    0.492022 
fit_female$coefficients
(Intercept)      drinks 
  2.0466026   0.6436317 

As expected, the regression coefficients for each group are the same as what we find above.

Let’s now plot the data with regression lines:

plot(partyHr ~ drinks, data=survey, col=gender, xlab="Drinks/week", 
     ylab="Party hours/week", pch=19, las=1)
abline(fit_male$coefficients, col=1, lwd=2)
abline(fit_female$coefficients, col=2, lwd=2)

Since the levels in gender are “male” and “female”, with “male” as the first level, male data are plotted with col=1, which is black; female data are plotted with col=2, which is red.

We can also make separate plots for male and female. As we mentioned before, this type of plots is easily made using lattice graphics’s xyplot. We can add regression lines to the split plots using a panel function. The detail is described in this video. Here we just show you the command:

library(lattice)
xyplot(partyHr ~ drinks | gender, data=survey, pch=16,
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
  })

We can also plot residuals. To do it with xyplot, it is better to add the residuals to the data frame before plotting:

survey$residuals <- fit3$residuals
xyplot(residuals ~ drinks | gender, data=survey, pch=16,
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.abline(h=0)
  })

Since survey did not have a column named “residuals”, the command survey$residuals <- fit3$residuals created a new column to the data frame. Then the residuals can be plotted just like other columns.

Does including gender and an interaction term make it a better model? If you compare the summary statistics of fit and fit3, you will see that the new model doesn’t improve much. For example, the R2 for the new model is only slightly larger and the residual standard error is only slightly smaller. The plots above also show that the residuals for the new models are not much different from those of the simple regression model. So adding gender and interaction term does not seem to substantially reduce the residuals.

Factor Variables are Smart

Here is another demonstration that factor variables can be used to fit two groups of data without splitting the data. We are going to work backward here. We will create two groups of data and then combine them.

First we create two groups of artificial data x1, y1 and x2, y2:

x1 <- seq(0,1,length.out=100)
x2 <- seq(2.5,3.5,length.out=100)
set.seed(5619783)
y1 <- 5 + 2.2*x1 + rnorm(100,sd=0.4)
y2 <- 11.2 -2.2*x2 + rnorm(100,sd=0.4)

Each variable x1, x2, y1, and y2 is a numeric vector of length 100; y1 is related to x1 by a linear relationship with a noise rnorm(100,sd=0.4) added to it; y2 is also related to x2 by a linear relationship with a noise added to it. Let’s now fit a linear model for each group:

fit1 <- lm(y1 ~ x1)
fit2 <- lm(y2 ~ x2)
fit1$coefficients
(Intercept)          x1 
   4.947955    2.327810 
fit2$coefficients
(Intercept)          x2 
  11.392317   -2.265344 

Now we combine the two data set:

x <- c(x1,x2)
y <- c(y1,y2)

The first 100 elements in x is x1 and the next 100 elements is x2, similarly for y. To label the two group, we create a factor vector group of length 200, with the first 100 elements labeled “1” and the second 100 elements labeled “2”. There are at least two ways to create the group variable. The first way is

group <- rep(c(1,2), each=100)
group <- as.factor(group)

Another, a simpler, way is to use the gl() function:

group <- gl(2,100)
group
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Levels: 1 2

Briefly, gl(n,k) creates a factor vector with n levels and each level has k replications. By default, the levels are labeled ‘1’, ‘2’, …, ‘n’ but it can be changed by setting the labels parameter. See ?gl for detail.

The group variable sets the first 100 elements to be in level ‘1’ and the next 100 elements to be in level ‘2’. We can plot the combined data:

plot(y ~ x, col=as.integer(group), pch=19, las=1)

Here group 1 data are plotted with col=1, which is black. Group 2 data are plotted with col=2, which is red. Clearly the two groups are widely separated and they each have different intercept and slope when we fit a linear model to them. If we simply fit a linear model to the combined data, the fit won’t be good:

fit_combined <- lm(y ~ x)
summary(fit_combined)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.84096 -0.55405 -0.02143  0.51098  1.83781 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.35741    0.09791   64.93   <2e-16 ***
x           -0.57333    0.04511  -12.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8189 on 198 degrees of freedom
Multiple R-squared:  0.4493,    Adjusted R-squared:  0.4465 
F-statistic: 161.5 on 1 and 198 DF,  p-value: < 2.2e-16
plot(y ~ x, col=as.integer(group), pch=19, las=1)
abline(fit_combined, col="green", lwd=2)

When we include the factor variable group with the interaction term, we have

fit_group <- lm(y ~ x*group)
summary(fit_group)

Call:
lm(formula = y ~ x * group)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.02744 -0.30744 -0.00761  0.25988  1.33553 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.94796    0.08571   57.73   <2e-16 ***
x            2.32781    0.14809   15.72   <2e-16 ***
group2       6.44436    0.45451   14.18   <2e-16 ***
x:group2    -4.59315    0.20942  -21.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4318 on 196 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.8461 
F-statistic: 365.7 on 3 and 196 DF,  p-value: < 2.2e-16

We see that the intercept and slope for ‘x’ is exactly the same as those for group 1 data as we calculated above with fit1. To find the intercept for group 2, we add the intercept to the coefficient of ‘group2’. To find the slope for group 2, we add the coefficients of ‘x’ and ‘x:group2’. We can now extract the coefficients for each group:

intercept_group1 <- fit_group$coefficients[1]
slope_group1 <- fit_group$coefficients[2]
intercept_group2 <- fit_group$coefficients[1] + fit_group$coefficients[3]
slope_group2 <- fit_group$coefficients[2] + fit_group$coefficients[4]
c(intercept_group1, slope_group1)
(Intercept)           x 
   4.947955    2.327810 
c(intercept_group2, slope_group2)
(Intercept)           x 
  11.392317   -2.265344 

As expected, we get the same numbers as calculated by fit1 and fit2. Let’s plot the regression lines:

plot(y ~ x, col=as.integer(group), pch=19, las=1)
abline(fit_combined, col="green", lwd=2)
abline(intercept_group1, slope_group1, col=1, lwd=2)
abline(intercept_group2, slope_group2, col=2, lwd=2)

Clearly, the model with ‘group’ and interaction term (black and red lines) is better than the simple lm(y ~ x) fit used by fit_combined (green line). Comparing the summary statistics of fit_combined and fit_group, we clearly see that the residual standard error reduces and R2 increases substantially.

Factor Variables with More Than Two Levels

Recall that religion is a factor variable in the survey data frame:

survey$religion[1:10]
 [1] Other Religion Christian      Christian      Christian     
 [5] Christian      Other Religion Agnostic       Christian     
 [9] Christian      Agnostic      
8 Levels: Christian Jewish Muslim Hindu Buddhist ... Atheist

It has 8 levels with Christian as the base level. Suppose we fit ‘drinks’ versus ‘religion’:

fit_drinks <- lm(drinks ~ religion, data=survey)
summary(fit_drinks)

Call:
lm(formula = drinks ~ religion, data = survey)

Residuals:
   Min     1Q Median     3Q    Max 
-9.723 -6.086 -2.923  2.464 44.201 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              7.5361     0.3824  19.706  < 2e-16 ***
religionJewish           2.1870     1.2324   1.775  0.07622 .  
religionMuslim          -5.3542     1.6881  -3.172  0.00156 ** 
religionHindu           -3.5664     1.6881  -2.113  0.03485 *  
religionBuddhist        -4.6130     1.5600  -2.957  0.00317 ** 
religionOther Religion   0.1862     1.1770   0.158  0.87435    
religionAgnostic        -1.4500     0.8585  -1.689  0.09151 .  
religionAtheist         -1.7376     0.9011  -1.928  0.05407 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.445 on 1129 degrees of freedom
Multiple R-squared:  0.02619,   Adjusted R-squared:  0.02015 
F-statistic: 4.338 on 7 and 1129 DF,  p-value: 9.372e-05

The formula drinks ~ religion looks like a simple regression with one variable. You might expect one intercept and one slope. However, the output has one intercept and 7 slopes! This is because religion is a factor variable with 8 levels. When the lm() is applied to a factor variable with k levels, it creates k-1 binary variables corresponding to the last k-1 levels. In the example above, we have 7 binary variables religionJewish, religionMuslim, religionHindu, religionBuddhist, religionOther Religion, religionAgnostic and religionAtheist. The value of religionJewish is 0 if the value of ‘religion’ is not ‘Jewish’ and 1 if ‘religion’ is ‘Jewish’, and similarly for the other binary variables. Note that there is no binary variable for ‘Christian’ because it is the base level. If ‘religion’ is ‘Christian’, all of the 7 binary variables are 0.

The regression equation is \[ {\rm drinks/week} = 7.5361 + 2.187 f_{\rm Jewish} - 5.3542 f_{\rm Muslim} - 3.5664 f_{\rm Hindu} - 4.613 f_{\rm Buddhist}\] \[+ 0.1862 f_{\rm other}- 1.45 f_{\rm Agnostic} - 1.7376 f_{\rm Atheist}\] Note that we use \(f_{\rm Jewish}\) to denote the binary variable religionJewish and so on.

Suppose ‘religion’ is ‘Christian’, all of the \(f\) variables are 0 and the regression equation predicts that drinks/week = 7.5361. This is the average value of drinks/week for students who are Christians. Suppose ‘religion’ is ‘Hindu’, we have \(f_{\rm Hindu}=1\) and all other \(f\)’s are 0, and the regression equation predicts that \[{\rm drinks/week} = 7.5361 - 3.5664 = 3.9697\] This is the average value of drinks/week for students who are Hindus. We see that the slopes of the binary variables is the difference between the predicted value for that group and the predicted value for ‘Christian’. If the difference is 0, it means that the average value of drinks/week for that group is the same as that for Christians. Recall that the t-values and p-values are based on the null hypothesis that the slope is 0. We see that the p-value of the slope for ‘Other Religion’ is 0.87, suggesting that the average value of drinks/week for students in ‘Other Religion’ is the same as that of Christians.

In this particular example, we see that the slopes of the binary variables indicate the difference between the average of the base level and the groups. If we change the base level, we change the meaning of the slopes. For example, we define another factor variable with the base level set to ‘Atheist’

rel <- relevel(survey$religion, "Atheist")

In the model

fit_drinks_rel <- lm(survey$drinks ~ rel)

the slopes now compare the difference between the average of drinks/week to that of Atheists.

summary(fit_drinks_rel)

Call:
lm(formula = survey$drinks ~ rel)

Residuals:
   Min     1Q Median     3Q    Max 
-9.723 -6.086 -2.923  2.464 44.201 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.7985     0.8159   7.107  2.1e-12 ***
relChristian        1.7376     0.9011   1.928  0.05407 .  
relJewish           3.9246     1.4277   2.749  0.00607 ** 
relMuslim          -3.6167     1.8355  -1.970  0.04904 *  
relHindu           -1.8288     1.8355  -0.996  0.31929    
relBuddhist        -2.8754     1.7185  -1.673  0.09456 .  
relOther Religion   1.9237     1.3801   1.394  0.16363    
relAgnostic         0.2876     1.1210   0.257  0.79757    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.445 on 1129 degrees of freedom
Multiple R-squared:  0.02619,   Adjusted R-squared:  0.02015 
F-statistic: 4.338 on 7 and 1129 DF,  p-value: 9.372e-05

What if we don’t want to compare the group average with any particular group? The trick is to remove intercept from the linear model. This can be done with the command

fit_drinks_nointercept <- lm(drinks ~ religion - 1, data=survey)

The -1 in the formula tells the lm() function not to include an intercept. The result is that 8 binary variables are created:

summary(fit_drinks_nointercept)

Call:
lm(formula = drinks ~ religion - 1, data = survey)

Residuals:
   Min     1Q Median     3Q    Max 
-9.723 -6.086 -2.923  2.464 44.201 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
religionChristian        7.5361     0.3824  19.706  < 2e-16 ***
religionJewish           9.7231     1.1715   8.300 2.95e-16 ***
religionMuslim           2.1818     1.6442   1.327   0.1848    
religionHindu            3.9697     1.6442   2.414   0.0159 *  
religionBuddhist         2.9231     1.5124   1.933   0.0535 .  
religionOther Religion   7.7222     1.1131   6.938 6.71e-12 ***
religionAgnostic         6.0861     0.7686   7.918 5.73e-15 ***
religionAtheist          5.7985     0.8159   7.107 2.10e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.445 on 1129 degrees of freedom
Multiple R-squared:  0.3581,    Adjusted R-squared:  0.3536 
F-statistic: 78.73 on 8 and 1129 DF,  p-value: < 2.2e-16

Now each slope represents the group average, and the p-value is based on the null hypothesis that the average is 0. We can confirm that the slopes indeed represent the group averages:

mean(survey$drinks[survey$religion=="Jewish"])
[1] 9.723077
mean(survey$drinks[survey$religion=="Hindu"])
[1] 3.969697


If you remember, there is an easy method to calculate the group average using tapply():

tapply(survey$drinks, survey$religion, mean)
     Christian         Jewish         Muslim          Hindu       Buddhist 
      7.536066       9.723077       2.181818       3.969697       2.923077 
Other Religion       Agnostic        Atheist 
      7.722222       6.086093       5.798507 

As expected, we get the same numbers as the slopes.

As a final example, let’s fit a model predicting party hours/week from drinks/week and religion, including the interaction terms:

fit4 <- lm(partyHr ~ drinks*religion, data=survey)
summary(fit4)

Call:
lm(formula = partyHr ~ drinks * religion, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.441  -2.252  -0.724   1.577  47.748 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    2.25219    0.22566   9.980  < 2e-16 ***
drinks                         0.57867    0.01796  32.220  < 2e-16 ***
religionJewish                 2.19415    0.87679   2.502  0.01247 *  
religionMuslim                -0.25330    0.94845  -0.267  0.78947    
religionHindu                 -0.67182    0.94365  -0.712  0.47665    
religionBuddhist               0.79895    0.90360   0.884  0.37679    
religionOther Religion         1.10213    0.68665   1.605  0.10876    
religionAgnostic               0.29806    0.50146   0.594  0.55237    
religionAtheist               -0.82884    0.49998  -1.658  0.09765 .  
drinks:religionJewish         -0.17047    0.06840  -2.492  0.01285 *  
drinks:religionMuslim          0.07461    0.22802   0.327  0.74357    
drinks:religionHindu           0.09955    0.12392   0.803  0.42192    
drinks:religionBuddhist       -0.33301    0.17390  -1.915  0.05574 .  
drinks:religionOther Religion -0.15693    0.05237  -2.996  0.00279 ** 
drinks:religionAgnostic       -0.10869    0.04670  -2.327  0.02012 *  
drinks:religionAtheist        -0.03392    0.04276  -0.793  0.42776    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.46 on 1121 degrees of freedom
Multiple R-squared:  0.585, Adjusted R-squared:  0.5795 
F-statistic: 105.4 on 15 and 1121 DF,  p-value: < 2.2e-16

The reference level in ‘religion’ is ‘Christian’. The regression equation can be written as \[\mbox{party hours/week} = 2.2522 + 0.5787 (\mbox{drinks/week}) + 2.1941 f_{\rm Jewish} - 0.2533 f_{\rm Muslim} - 0.6718 f_{\rm Hindu}\] \[+ 0.799 f_{\rm Buddhist} + 1.1021 f_{\rm other} + 0.2981 f_{\rm Agnostic} - 0.8288 f_{\rm Atheist}\] \[- 0.1705 f_{\rm Jewish}\cdot (\mbox{drinks/week}) + 0.0746 f_{\rm Muslim}\cdot (\mbox{drinks/week})\] \[+ 0.0996 f_{\rm Hindu}\cdot (\mbox{drinks/week})- 0.333 f_{\rm Buddhist}\cdot (\mbox{drinks/week})\] \[-0.1569 f_{\rm other}\cdot (\mbox{drinks/week}) -0.1087 f_{\rm Agnostic}\cdot (\mbox{drinks/week})\] \[-0.0339 f_{\rm Atheist}\cdot (\mbox{drinks/week})\] Simple regression equations for each of the 8 groups can be reconstructed from this equation. For example, for Christians, \[f_{\rm Jewish}=f_{\rm Muslim}=f_{\rm Hindu}=f_{\rm Buddhist}=f_{\rm other}=f_{\rm Agnostic}=f_{\rm Atheist}=0\] and the regression equation becomes \[\mbox{party hours/week} = 2.2522 + 0.5787 (\mbox{drinks/week}) \ \ \ \ \ \mbox{for Christians}\] For Atheists, \[f_{\rm Atheist}=1 \ \ \ , \ \ \ f_{\rm Jewish}=f_{\rm Muslim}=f_{\rm Hindu}=f_{\rm Buddhist}=f_{\rm other}=f_{\rm Agnostic}=0\] and the regression equation becomes \[\mbox{party hours/week} = 2.2522 + 0.5787 (\mbox{drinks/week}) -0.8288 - 0.0339(\mbox{drinks/week}) \ \ \ \ \ \mbox{for Atheists}\] or \[\mbox{party hours/week} = 1.4234 + 0.5448 (\mbox{drinks/week}) \ \ \ \ \ \mbox{for Atheists}\] Regression equations for other groups can be reconstructed in the same way. We see that the slope -0.8288 in \(-0.8288 f_{\rm Atheist}\) is the difference of the intercept between Christians and Atheists, and the slope -0.0339 in the interaction term \(-0.0339 f_{\rm Atheist}\cdot (\mbox{drinks/week})\) is the difference of the slope of (drinks/week) between Christians and Atheists. The slopes associated with other binary variables and interaction terms can be interpretted in the same way.

Looking at the summary statistics, it doesn’t seem that this model is better than the simple regression model fit <- lm(partyHr ~ drinks, data=survey) we had before. We can make a few plots to gain more insight:

xyplot(partyHr ~ drinks | religion, data=survey, pch=16,
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
  })

survey$residuals <- fit4$residuals
xyplot(residuals ~ drinks | religion, data=survey, pch=16,
       panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.abline(h=0)
  })

It appears that there are still a lot of scatters even after we split the data into the religion groups. This shows that while the number of drinks/week correlates with party hours/week. The correlation is not perfect even after we split the data into religion groups. There are other important factors we have not included in our model that contribute to the scatters.

Prediction and Confidence Intervals

As in simple linear regression, we can use the predict() function to predict the outcome of a new data set. For example,

predict(fit4, newdata=data.frame(drinks=c(1,5,3,8), religion=c("Hindu", "Atheist", "Christian", "Jewish")))
       1        2        3        4 
2.258601 4.147103 3.988215 7.712010 

As before, the new data have to been put into a data frame. In fit4, we are predicting partyHr from drinks and religion, so we have to supply both drinks and religion in the new data frame.

The confint() function can be used to calculate the confidence intervals just as in simple regression:

confint(fit4)
                                   2.5 %       97.5 %
(Intercept)                    1.8094315  2.694958039
drinks                         0.5434341  0.613912469
religionJewish                 0.4738206  3.914478748
religionMuslim                -2.1142264  1.607632446
religionHindu                 -2.5233460  1.179705850
religionBuddhist              -0.9739865  2.571888029
religionOther Religion        -0.2451310  2.449388097
religionAgnostic              -0.6858482  1.281975765
religionAtheist               -1.8098322  0.152158549
drinks:religionJewish         -0.3046802 -0.036249942
drinks:religionMuslim         -0.3727786  0.521997885
drinks:religionHindu          -0.1435821  0.342687872
drinks:religionBuddhist       -0.6742098  0.008184675
drinks:religionOther Religion -0.2596937 -0.054168254
drinks:religionAgnostic       -0.2003240 -0.017064044
drinks:religionAtheist        -0.1178269  0.049978439

The default confidence level is 95%, but it can be changed by the optional parameter level. For example, to calculate the 99% confidence interval for the ‘drinks’ coefficient, we use the command

confint(fit4, 'drinks', level=0.99)
           0.5 %    99.5 %
drinks 0.5323323 0.6250143

Model Formulae

Note that the formula in the lm() syntax is somewhat different from the regression formula. For example, the command

lm(y ~ x)

means that a linear model of the form \(y=\beta_0 + \beta_1 x\) is to be fitted (if x is not a factor variable). The command

lm(y ~ x-1)

means that a linear model of the form \(y=\beta_0 x\) is to be fitted. The -1 means to exclude the intercept. Another expression to exclude the intercept is

lm(y ~ 0+x)

It means exactly the same as lm(y ~ x-1).

The following table is a summary of the formulae and model fits.

Formula Model
lm(y ~ x) \(y=\beta_0+\beta_1 x\)
lm(y ~ x-1) or lm(y ~ 0+x) \(y=\beta_0 x\)
lm(y ~ x1 + x2) \(y=\beta_0 + \beta_1 x1 + \beta_2 x2\)
lm(y ~ x1 + x1:x2) \(y=\beta_0 + \beta_1 x1 + \beta_2 x1\cdot x2\)
lm(y ~ x1*x2) \(y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x1\cdot x2\)
lm(y ~ x1*x2*x3) \(y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x3 + \beta_4 x_1\cdot x_2\)
\(+ \beta_5 x_1\cdot x_3 + \beta_6 x_2 \cdot x_3 + \beta_7 x_1\cdot x_2 \cdot x_3\)

It is assumed that x, x1 and x2 above are not factor variables. If x1 is a factor variable with, say, 3 levels, two binary variables associated with x1 will be created and there will be extra terms.

You may wonder what if we want to fit a model of the form \(y=\beta_0+\beta_1 (x1+x2)\). We clearly can’t use lm(y~x1+x2) because it means something different. Other than creating a new variable for x1+x2, one way to do it is to use the I() function:

lm(y ~ I(x1+x2))

This tells lm() that x1+x2 should be considered as one variable. Similarly, if we want to fit a model \[y=\beta_1 (x1 + 4 x1\cdot x2) + \beta_2 (x2)^3\] We should type

lm(y ~ I(x1 + 4*x1*x2) + I(x2^3) - 1)

or

lm(y ~ 0 + I(x1 + 4*x1*x2) + I(x2^3))

Again the -1 or 0+ is to exclude the intercept. If it is omitted, the model \(y=\beta_0 + \beta_1 (x1 + 4 x1\cdot x2) + \beta_2 (x2)^3\) will be fitted.