Regression With Factor Variables

Factor Variables

In this note, we demonstrate using the lm() function on categorical variables. We again use the Stat 100 Survey 2, Fall 2015 (combined) data we have been working on for demonstration. First, let’s reload the data from the saved rda file.

load('Stat100_Survey2_Fall2015.rda')

The data frame survey is loaded, which stores the survey data. As mentioned before, R’s factor variables are designed to represent categorical data. In our data set, the gender column is a categorical variable: it is either male or female. Right now, the column is a character vector as you can see if you type the command class(survey$gender). We want to change it to a factor variable, with “male” set to the first level:

survey$gender <- factor(survey$gender, levels=c("male","female"))

We want to preserve this change so that we won’t have to do this again the next time we load the file, so we overwrite the ‘Stat100_Survey2_Fall2015.rda’:

save(survey, file='Stat100_Survey2_Fall2015.rda')

Recall that last time we fit a linear model predicting student’s party hours/week from the average number of drinks/week:

fit <- lm(partyHr ~ drinks, data=survey)
summary(fit)


Call:
lm(formula = partyHr ~ drinks, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.623  -2.315  -0.777   1.685  47.685 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.31522    0.16404   14.11   <2e-16 ***
drinks       0.54615    0.01395   39.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.491 on 1136 degrees of freedom
Multiple R-squared:  0.5742,    Adjusted R-squared:  0.5738 
F-statistic:  1532 on 1 and 1136 DF,  p-value: < 2.2e-16

Now we want to know if there is any difference between male and female students. We can include gender in the model by the command

fit1 <- lm(partyHr ~ drinks + gender, data=survey)
summary(fit1)


Call:
lm(formula = partyHr ~ drinks + gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.494  -2.515  -0.736   1.601  48.262 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.73761    0.26983   6.440 1.77e-10 ***
drinks        0.55513    0.01431  38.794  < 2e-16 ***
genderfemale  0.77765    0.28891   2.692  0.00721 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.478 on 1135 degrees of freedom
Multiple R-squared:  0.5769,    Adjusted R-squared:  0.5762 
F-statistic: 773.8 on 2 and 1135 DF,  p-value: < 2.2e-16

We see that in addition to the intercept and slope for ‘drinks’, there is a third variable ‘genderfemale’. When lm() encounters a factor variable with two levels, it creates a new variable based on the second level. In our example, the second level is female, and ‘genderfemale’ is created. It is a binary variable that takes the value 1 if the value of ‘gender’ is female, and 0 if the value of ‘gender’ is not female. The fitted model can be written as \[{\rm party\ hours/week} = 1.73761 + 0.55513\, ({\rm drinks/week}) + 0.77765\, ({\rm female})\] where for simplicity we use ‘female’ instead of ‘genderfemale’ to denote the binary variable. In this model, male and female have the same slope (0.55513) for ‘drinks’ but different intercepts. For male, the intercept is 1.73761. For female, the intercept is 1.73761 + 0.77765 = 2.51526.

Note that since we set “male” to be the first level (also called the base level or reference level), the intercept corresponds to the intercept for “male”. We can change the base level by the relevel() command:

gender2 <- relevel(survey$gender,"female")
gender2[1:10]

 [1] female female female male   female female male   female female male  
Levels: female male

We see that in gender2, “female” is the base level. Let’s fit a linear model using gender2:

fit2 <- lm(survey$partyHr ~ survey$drinks + gender2)
summary(fit2)


Call:
lm(formula = survey$partyHr ~ survey$drinks + gender2)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.494  -2.515  -0.736   1.601  48.262 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.51525    0.17967  13.999  < 2e-16 ***
survey$drinks  0.55513    0.01431  38.794  < 2e-16 ***
gender2male   -0.77765    0.28891  -2.692  0.00721 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.478 on 1135 degrees of freedom
Multiple R-squared:  0.5769,    Adjusted R-squared:  0.5762 
F-statistic: 773.8 on 2 and 1135 DF,  p-value: < 2.2e-16

Since “female” is the base level in gender2, the intercept returned by the lm() is the intercept for female. The binary variable is now called ‘gendermale’, which is 1 if the value is “male” in ‘gender2’ and 0 if the value is not “male” in ‘gender2’. We see that the slope for ‘drinks’ remains the same as before. The value of the intercept is 2.51525, which is the intercept for female we calculated above. The intercept for male is 2.51525 + (-0.77765) = 1.7376, consistent with the previous result.

Interaction

In R, the interaction term is represented by a colon ‘:’. To fit a model for ‘partyHr’ versus ‘drinks’, ‘gender’ and interaction between ‘drink’ and ‘gender’, we can use the command

fit3 <- lm(partyHr ~ drinks + gender + drinks:gender, data=survey)
summary(fit3)


Call:
lm(formula = partyHr ~ drinks + gender + drinks:gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.969  -2.046  -0.860   1.515  47.954 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           2.3682     0.2921   8.108 1.33e-15 ***
drinks                0.4920     0.0185  26.595  < 2e-16 ***
genderfemale         -0.3220     0.3531  -0.912    0.362    
drinks:genderfemale   0.1519     0.0287   5.292 1.45e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.426 on 1134 degrees of freedom
Multiple R-squared:  0.5871,    Adjusted R-squared:  0.586 
F-statistic: 537.5 on 3 and 1134 DF,  p-value: < 2.2e-16

Actually, R has a shorthand notation to represent drinks + gender + drinks:gender, which is drinks*gender. So we can simply type

fit3 <- lm(partyHr ~ drinks*gender, data=survey)
summary(fit3)


Call:
lm(formula = partyHr ~ drinks * gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.969  -2.046  -0.860   1.515  47.954 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           2.3682     0.2921   8.108 1.33e-15 ***
drinks                0.4920     0.0185  26.595  < 2e-16 ***
genderfemale         -0.3220     0.3531  -0.912    0.362    
drinks:genderfemale   0.1519     0.0287   5.292 1.45e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.426 on 1134 degrees of freedom
Multiple R-squared:  0.5871,    Adjusted R-squared:  0.586 
F-statistic: 537.5 on 3 and 1134 DF,  p-value: < 2.2e-16

to get the same fitting formula. The coefficients in the above formula mean that the fitted model is given by the equation \[ {\rm party\ hours/week} = 2.3682 + 0.492\, ({\rm drinks/week}) -0.322\, ({\rm female}) + 0.1519\, ({\rm drinks/week}) \cdot ({\rm female})\] Because of the presence of the interaction term, both the slope and intercept are different for male and female. Also note that in this fit, the coefficient of ‘genderfemale’, having a p-value of 0.362, is not statistically significant, meaning that the intercepts for male and female are probably not very different. However, the slopes for ‘drinks’ for male and female are different.

As you have learned in Stat 200, the regression equation above can be split into separate equations for male and female: \[ {\rm party\ hours/week} = \left\{ \begin{array}{ll} 2.3682 + 0.492\, ({\rm drinks/week}) & {\rm for\ male} \\ 2.0462 + 0.6439\, ({\rm drinks/week}) & {\rm for\ female}\end{array} \right. \] Including the interaction term is equivalent to splitting the data into male and female and fit a linear model for each group. We can confirm that this is the result we will get by actually splitting the data into two groups and fit a linear model for each group. Here is the code:

survey_male <- survey[survey$gender=="male",]
survey_female <- survey[survey$gender=="female",]
fit_male <- lm(partyHr ~ drinks, data=survey_male)
fit_female <- lm(partyHr ~ drinks, data=survey_female)
fit_male$coefficients

(Intercept)      drinks 
   2.368229    0.492022

fit_female$coefficients

(Intercept)      drinks 
  2.0462722   0.6438758

As expected, the regression coefficients for each group are the same as what we find above.

Let’s now plot the data with regression lines:

plot(partyHr ~ drinks, data=survey, col=as.integer(gender), xlab="Drinks/week", 
     ylab="Party hours/week", pch=19, las=1)
abline(fit_male$coefficients, col=1, lwd=2)
abline(fit_female$coefficients, col=2, lwd=2)

Since the levels in ‘gender’ are “male” and “female”, with “male” as the first level, as.integer(gender) turns “male” to integer 1 and “female” to integer 2. So the male data are plotted with col=1, which is black; female data are plotted with col=2, which is red.

We can also make separate plots for male and female. As we mentioned before, this type of plots is easily made using lattice graphics’s xyplot. We can add regression lines to the split plots using a panel function. The detail is described in this video. Here we just show you the command:

library(lattice)
xyplot(partyHr ~ drinks | gender, data=survey, 
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
  })

We can also plot residuals. To do it with xyplot, it is better to add the residuals to the data frame before plotting:

survey$residuals <- fit3$residuals
xyplot(residuals ~ drinks | gender, data=survey,
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.abline(h=0)
  })

Since survey did not have a column named “residuals”, the command survey$residuals <- fit3$residuals created a new column to the data frame. Then the residuals can be plotted just like other columns.

Does including ‘gender’ and an interaction term make it a better model? If you compare the summary statistics of fit and fit3, you will see that the new model doesn’t improve much. For example, the R² for the new model is only slightly larger and the residual standard error is only slightly smaller. The plots above also show that the residuals for the new models are not much different from those of the simple regression model. So adding ‘gender’ and interaction term does not seem to substantially reduce the residuals.

Factor Variables are Smart

Here is another demonstration that factor variables can be used to fit two groups of data without splitting the data. We are going to work backward here. We will create two groups of data and then combine them.

First we create two groups of artificial data x1, y1 and x2, y2:

x1 <- seq(0,1,length.out=100)
x2 <- seq(2.5,3.5,length.out=100)
set.seed(5619783)
y1 <- 5 + 2.2*x1 + rnorm(100,sd=0.4)
y2 <- 11.2 -2.2*x2 + rnorm(100,sd=0.4)

Each variable x1, x2, y1, and y2 is a numeric vector of length 100; y1 is related to x1 by a linear relationship with a noise rnorm(100,sd=0.4) added to it; y2 is also related to x2 by a linear relationship with a noise added to it. Let’s now fit a linear model for each group:

fit1 <- lm(y1 ~ x1)
fit2 <- lm(y2 ~ x2)
fit1$coefficients

(Intercept)          x1 
   4.947955    2.327810

fit2$coefficients

(Intercept)          x2 
  11.392317   -2.265344

Now we combine the two data set:

x <- c(x1,x2)
y <- c(y1,y2)

The first 100 elements in x is x1 and the next 100 elements is x2, similarly for y. To label the two group, we create a factor vector group of length 200, with the first 100 elements labeled “1” and the second 100 elements labeled “2”. There are at least two ways to create the group variable. The first way is

group <- rep(c(1,2), each=100)
group <- as.factor(group)

Another, a simpler, way is to use the gl() function:

group <- gl(2,100)
group

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Levels: 1 2

Briefly, gl(n,k) creates a factor vector with n levels and each level has k replications. By default, the levels are labeled ‘1’, ‘2’, …, ‘n’ but it can be changed by setting the labels parameter. See ?gl for detail.

The group variable sets the first 100 elements to be in level ‘1’ and the next 100 elements to be in level ‘2’. We can plot the combined data:

plot(y ~ x, col=as.integer(group), pch=19, las=1)

Here group 1 data are plotted with col=1, which is black. Group 2 data are plotted with col=2, which is red. Clearly the two groups are widely separated and they each have different intercept and slope when we fit a linear model to them. If we simply fit a linear model to the combined data, the fit won’t be good:

fit_combined <- lm(y ~ x)
summary(fit_combined)


Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.84096 -0.55405 -0.02143  0.51098  1.83781 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.35741    0.09791   64.93   <2e-16 ***
x           -0.57333    0.04511  -12.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8189 on 198 degrees of freedom
Multiple R-squared:  0.4493,    Adjusted R-squared:  0.4465 
F-statistic: 161.5 on 1 and 198 DF,  p-value: < 2.2e-16

plot(y ~ x, col=as.integer(group), pch=19, las=1)
abline(fit_combined, col="green", lwd=2)

When we include the factor variable group with the interaction term, we have

fit_group <- lm(y ~ x*group)
summary(fit_group)


Call:
lm(formula = y ~ x * group)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.02744 -0.30744 -0.00761  0.25988  1.33553 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.94796    0.08571   57.73   <2e-16 ***
x            2.32781    0.14809   15.72   <2e-16 ***
group2       6.44436    0.45451   14.18   <2e-16 ***
x:group2    -4.59315    0.20942  -21.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4318 on 196 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.8461 
F-statistic: 365.7 on 3 and 196 DF,  p-value: < 2.2e-16

We see that the intercept and slope for ‘x’ is exactly the same as those for group 1 data as we calculated above with fit1. To find the intercept for group 2, we add the intercept to the coefficient of ‘group2’. To find the slope for group 2, we add the coefficients of ‘x’ and ‘x:group2’. We can now extract the coefficients for each group:

intercept_group1 <- fit_group$coefficients[1]
slope_group1 <- fit_group$coefficients[2]
intercept_group2 <- fit_group$coefficients[1] + fit_group$coefficients[3]
slope_group2 <- fit_group$coefficients[2] + fit_group$coefficients[4]
c(intercept_group1, slope_group1)

(Intercept)           x 
   4.947955    2.327810

c(intercept_group2, slope_group2)

(Intercept)           x 
  11.392317   -2.265344

As expected, we get the same numbers as calculated by fit1 and fit2. Let’s plot the regression lines:

plot(y ~ x, col=as.integer(group), pch=19, las=1)
abline(fit_combined, col="green", lwd=2)
abline(intercept_group1, slope_group1, col=1, lwd=2)
abline(intercept_group2, slope_group2, col=2, lwd=2)

Clearly, the model with ‘group’ and interaction term (black and red lines) is better than the simple lm(y ~ x) fit used by fit_combined (green line). Comparing the summary statistics of fit_combined and fit_group, we clearly see that the residual standard error reduces and R² increases substantially.

Factor Variables with More Than Two Levels

Recall that religion is a factor variable in the survey data frame:

survey$religion[1:10]

 [1] Other Religion Christian      Christian      Christian     
 [5] Christian      Other Religion Agnostic       Christian     
 [9] Christian      Agnostic      
8 Levels: Christian Jewish Muslim Hindu Buddhist ... Atheist

It has 8 levels with Christian as the base level. Suppose we fit ‘drinks’ versus ‘religion’:

fit_drinks <- lm(drinks ~ religion, data=survey)
summary(fit_drinks)


Call:
lm(formula = drinks ~ religion, data = survey)

Residuals:
   Min     1Q Median     3Q    Max 
-9.723 -6.086 -2.923  2.445 44.201 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              7.5548     0.3822  19.767  < 2e-16 ***
religionJewish           2.1682     1.2325   1.759  0.07881 .  
religionMuslim          -5.3730     1.6883  -3.182  0.00150 ** 
religionHindu           -3.5851     1.6883  -2.123  0.03393 *  
religionBuddhist        -4.6318     1.5603  -2.969  0.00306 ** 
religionOther Religion   0.1674     1.1771   0.142  0.88694    
religionAgnostic        -1.4687     0.8585  -1.711  0.08741 .  
religionAtheist         -1.7563     0.9012  -1.949  0.05155 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.447 on 1130 degrees of freedom
Multiple R-squared:  0.02631,   Adjusted R-squared:  0.02028 
F-statistic: 4.362 on 7 and 1130 DF,  p-value: 8.745e-05

The formula drinks ~ religion looks like a simple regression with one variable. You might expect one intercept and one slope. However, the output has one intercept and 7 slopes! This is because religion is a factor variable with 8 levels. When the lm() is applied to a factor variable with k levels, it creates k-1 binary variables corresponding to the last k-1 levels. In the example above, we have 7 binary variables religionJewish, religionMuslim, religionHindu, religionBuddhist, religionOther Religion, religionAgnostic and religionAtheist. The value of religionJewish is 0 if the value of ‘religion’ is not ‘Jewish’ and 1 if ‘religion’ is ‘Jewish’, and similarly for the other binary variables. Note that there is no binary variable for ‘Christian’ because it is the base level. If ‘religion’ is ‘Christian’, all of the 7 binary variables are 0.

The regression equation is \[ {\rm drinks/week} = 7.5548 + 2.1682 f_{\rm Jewish} - 5.3730 f_{\rm Muslim} -3.5851 f_{\rm Hindu} -4.6318 f_{\rm Buddhist}\] \[+ 0.1674 f_{\rm other}-1.4687 f_{\rm Agnostic} -1.7563 f_{\rm Atheist}\] Note that we use $f_{\rm Jewish}$ to denote the binary variable religionJewish and so on.

Suppose ‘religion’ is ‘Christian’, all of the $f$ variables are 0 and the regression equation predicts that drinks/week = 7.5548. This is the average value of drinks/week for students who are Christians. Suppose ‘religion’ is ‘Hindu’, we have $f_{\rm Hindu}=1$ and all other $f$‘s are 0, and the regression equation predicts that \[{\rm drinks/week} = 7.5548 - 3.5851 = 3.9697\] This is the average value of drinks/week for students who are Hindus. We see that the slopes of the binary variables is the difference between the predicted value for that group and the predicted value for ’Christian’. If the difference is 0, it means that the average value of drinks/week for that group is the same as that for Christians. Recall that the t-values and p-values are based on the null hypothesis that the slope is 0. We see that the p-value of the slope for ‘Other Religion’ is 0.87, suggesting that the average value of drinks/week for students in ‘Other Religion’ is the same as that of Christians.

In this particular example, we see that the slopes of the binary variables indicate the difference between the average of the base level and the groups. If we change the base level, we change the meaning of the slopes. For example, we define another factor variable with the base level set to ‘Atheist’

rel <- relevel(survey$religion, "Atheist")

In the model

fit_drinks_rel <- lm(survey$drinks ~ rel)

the slopes now compare the difference between the average of drinks/week to that of Atheists.

summary(fit_drinks_rel)


Call:
lm(formula = survey$drinks ~ rel)

Residuals:
   Min     1Q Median     3Q    Max 
-9.723 -6.086 -2.923  2.445 44.201 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.7985     0.8161   7.105 2.12e-12 ***
relChristian        1.7563     0.9012   1.949  0.05155 .  
relJewish           3.9246     1.4279   2.748  0.00608 ** 
relMuslim          -3.6167     1.8359  -1.970  0.04908 *  
relHindu           -1.8288     1.8359  -0.996  0.31939    
relBuddhist        -2.8754     1.7188  -1.673  0.09462 .  
relOther Religion   1.9237     1.3804   1.394  0.16372    
relAgnostic         0.2876     1.1212   0.257  0.79761    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.447 on 1130 degrees of freedom
Multiple R-squared:  0.02631,   Adjusted R-squared:  0.02028 
F-statistic: 4.362 on 7 and 1130 DF,  p-value: 8.745e-05

What if we don’t want to compare the group average with any particular group? The trick is to remove intercept from the linear model. This can be done with the command

fit_drinks_nointercept <- lm(drinks ~ religion - 1, data=survey)

The -1 in the formula tells the lm() function not to include an intercept. The result is that 8 binary variables are created:

summary(fit_drinks_nointercept)


Call:
lm(formula = drinks ~ religion - 1, data = survey)

Residuals:
   Min     1Q Median     3Q    Max 
-9.723 -6.086 -2.923  2.445 44.201 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
religionChristian        7.5548     0.3822  19.767  < 2e-16 ***
religionJewish           9.7231     1.1718   8.298 2.99e-16 ***
religionMuslim           2.1818     1.6445   1.327   0.1849    
religionHindu            3.9697     1.6445   2.414   0.0159 *  
religionBuddhist         2.9231     1.5127   1.932   0.0536 .  
religionOther Religion   7.7222     1.1133   6.936 6.77e-12 ***
religionAgnostic         6.0861     0.7688   7.916 5.80e-15 ***
religionAtheist          5.7985     0.8161   7.105 2.12e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.447 on 1130 degrees of freedom
Multiple R-squared:  0.3588,    Adjusted R-squared:  0.3542 
F-statistic: 79.02 on 8 and 1130 DF,  p-value: < 2.2e-16

Now each slope represents the group average, and the p-value is based on the null hypothesis that the average is 0. We can confirm that the slopes indeed represent the group averages:

mean(survey$drinks[survey$religion=="Jewish"])

[1] 9.723077

mean(survey$drinks[survey$religion=="Hindu"])

[1] 3.969697

…
If you remember, there is an easy method to calculate the group average using tapply():

tapply(survey$drinks, survey$religion, mean)

     Christian         Jewish         Muslim          Hindu       Buddhist 
      7.554828       9.723077       2.181818       3.969697       2.923077 
Other Religion       Agnostic        Atheist 
      7.722222       6.086093       5.798507

As expected, we get the same numbers as the slopes.

As a final example, let’s fit a model for predicting party hours/week from drinks/week and religion, including the interaction terms:

fit4 <- lm(partyHr ~ drinks*religion, data=survey)
summary(fit4)


Call:
lm(formula = partyHr ~ drinks * religion, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.441  -2.253  -0.727   1.577  47.747 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    2.25261    0.22557   9.986  < 2e-16 ***
drinks                         0.57900    0.01793  32.285  < 2e-16 ***
religionJewish                 2.19373    0.87646   2.503  0.01246 *  
religionMuslim                -0.25371    0.94809  -0.268  0.78905    
religionHindu                 -0.67224    0.94330  -0.713  0.47621    
religionBuddhist               0.79853    0.90326   0.884  0.37685    
religionOther Religion         1.10171    0.68639   1.605  0.10876    
religionAgnostic               0.29765    0.50127   0.594  0.55278    
religionAtheist               -0.82925    0.49979  -1.659  0.09735 .  
drinks:religionJewish         -0.17079    0.06837  -2.498  0.01264 *  
drinks:religionMuslim          0.07428    0.22793   0.326  0.74455    
drinks:religionHindu           0.09923    0.12387   0.801  0.42325    
drinks:religionBuddhist       -0.33334    0.17383  -1.918  0.05541 .  
drinks:religionOther Religion -0.15726    0.05235  -3.004  0.00272 ** 
drinks:religionAgnostic       -0.10902    0.04668  -2.336  0.01968 *  
drinks:religionAtheist        -0.03425    0.04274  -0.801  0.42309    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.458 on 1122 degrees of freedom
Multiple R-squared:  0.5856,    Adjusted R-squared:   0.58 
F-statistic: 105.7 on 15 and 1122 DF,  p-value: < 2.2e-16

The reference level in ‘religion’ is ‘Christian’. The regression equation can be written as \[\mbox{party hours/week} = 2.25261 + 0.579 (\mbox{drinks/week}) + 2.19373 f_{\rm Jewish} -0.25371 f_{\rm Muslim} -0.67224 f_{\rm Hindu}\] \[+ 0.79853 f_{\rm Buddhist} + 1.10171 f_{\rm other} + 0.29765 f_{\rm Agnostic} -0.82925 f_{\rm Atheist}\] \[- 0.17079 f_{\rm Jewish}\cdot (\mbox{drinks/week}) + 0.07428 f_{\rm Muslim}\cdot (\mbox{drinks/week})\] \[+ 0.09923 f_{\rm Hindu}\cdot (\mbox{drinks/week})-0.33334 f_{\rm Buddhist}\cdot (\mbox{drinks/week})\] \[-0.15726 f_{\rm other}\cdot (\mbox{drinks/week}) -0.10902 f_{\rm Agnostic}\cdot (\mbox{drinks/week})\] \[-0.03425 f_{\rm Atheist}\cdot (\mbox{drinks/week})\] Simple regression equations for each of the 8 groups can be reconstructed from this equation. For example, for Christians, \[f_{\rm Jewish}=f_{\rm Muslim}=f_{\rm Hindu}=f_{\rm Buddhist}=f_{\rm other}=f_{\rm Agnostic}=f_{\rm Atheist}=0\] and the regression equation becomes \[\mbox{party hours/week} = 2.25261 + 0.579 (\mbox{drinks/week}) \ \ \ \ \ \mbox{for Christians}\] For Atheists, \[f_{\rm Atheist}=1 \ \ \ , \ \ \ f_{\rm Jewish}=f_{\rm Muslim}=f_{\rm Hindu}=f_{\rm Buddhist}=f_{\rm other}=f_{\rm Agnostic}=0\] and the regression equation becomes \[\mbox{party hours/week} = 2.25261 + 0.579 (\mbox{drinks/week}) -0.82925 -0.03425(\mbox{drinks/week}) \ \ \ \ \ \mbox{for Atheists}\] or \[\mbox{party hours/week} = 1.42336 + 0.54475 (\mbox{drinks/week}) \ \ \ \ \ \mbox{for Atheists}\] Regression equations for other groups can be reconstructed in the same way. We see that the slope -0.82925 in $-0.82925 f_{\rm Atheist}$ is the difference of the intercept between Christians and Atheists, and the slope -0.03425 in the interaction term $-0.03425 f_{\rm Atheist}\cdot (\mbox{drinks/week})$ is the difference of the slope of (drinks/week) between Christians and Atheists. The slopes associated with other binary variables and interaction terms can be interpretted in the same way.

Looking at the summary statistics, it doesn’t seem that this model is better than the simple regression model fit <- lm(partyHr ~ drinks, data=survey) we had before. We can make a few plots to gain more insight:

xyplot(partyHr ~ drinks | religion, data=survey, 
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
  })

survey$residuals <- fit4$residuals
xyplot(residuals ~ drinks | religion, data=survey,
       panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.abline(h=0)
  })

It appears that there are still a lot of scatters even after we split the data into the religion groups. This shows that while the number of drinks/week correlates with party hours/week. The correlation is not perfect even after we split the data into religion groups. There are other important factors we have not included in our model that contribute to the scatters.

Formula in the lm() Function

Note that the formula in the lm() syntax is somewhat different from the regression formula. For example, the command

lm(y ~ x)

means that a linear model of the form $y=\beta_0 + \beta_1 x$ is to be fitted (if x is not a factor variable). The command

lm(y ~ x-1)

means that a linear model of the form $y=\beta_0 x$ is to be fitted. The -1 means to exclude the intercept. The following table is a summary of the commands and model fits.

Command	Model
lm(y ~ x)	$y=\beta_0+\beta_1 x$
lm(y ~ x-1)	$y=\beta_0 x$
lm(y ~ x1 + x2)	$y=\beta_0 + \beta_1 x1 + \beta_2 x2$
lm(y ~ x1 + x1:x2)	$y=\beta_0 + \beta_1 x1 + \beta_2 x1\cdot x2$
lm(y ~ x1*x2)	$y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x1\cdot x2$
lm(y ~ x1x2x3)	$y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x3 + \beta_4 x_1\cdot x_2 + \beta_5 x_1\cdot x_3 + \beta_6 x_2 \cdot x_3 + \beta_7 x_1\cdot x_2 \cdot x_3$

It is assumed that x, x1 and x2 above are not factor variables. If x1 is a factor variable with, say, 3 levels, two binary variables associated with x1 will be created and there will be extra terms.

You may wonder what if we want to fit a model of the form $y=\beta_0+\beta_1 (x1+x2)$. We clearly can’t use lm(y~x1+x2) because it means something different. Other than creating a new variable for x1+x2, one way to do it is to use the I() function:

lm(y ~ I(x1+x2))

This tells lm() that x1+x2 should be considered as one variable. Similarly, if we want to fit a model \[y=\beta_0 (x1 + 4 x1\cdot x2) + \beta_1 (x2)^3\] We should type

lm(y ~ I(x1 + 4*x1*x2) + I(x2^3) - 1)

Again the -1 is to exclude the intercept. If it is omitted, the model $y=\beta_0 + \beta_1 (x1 + 4 x1\cdot x2) + \beta_2 (x2)^3$ will be fitted.

Command	Model
lm(y ~ x)	\(y=\beta_0+\beta_1 x\)
lm(y ~ x-1)	\(y=\beta_0 x\)
lm(y ~ x1 + x2)	\(y=\beta_0 + \beta_1 x1 + \beta_2 x2\)
lm(y ~ x1 + x1:x2)	\(y=\beta_0 + \beta_1 x1 + \beta_2 x1\cdot x2\)
lm(y ~ x1*x2)	\(y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x1\cdot x2\)
lm(y ~ x1x2x3)	\(y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x3 + \beta_4 x_1\cdot x_2 + \beta_5 x_1\cdot x_3 + \beta_6 x_2 \cdot x_3 + \beta_7 x_1\cdot x_2 \cdot x_3\)