In this note, we demonstrate using the lm()
function on categorical variables. We again use the Stat 100 Survey 2, Fall 2015 (combined) data we have been working on for demonstration. First, let’s reload the data from the saved rda file.
load('Stat100_Survey2_Fall2015.rda')
The data frame survey
is loaded, which stores the survey data. As mentioned before, R’s factor variables are designed to represent categorical data. In our data set, the gender
column is a categorical variable: it is either male or female. Right now, the column is a character vector as you can see if you type the command class(survey$gender)
. We want to change it to a factor variable, with “male” set to the first level:
survey$gender <- factor(survey$gender, levels=c("male","female"))
We want to preserve this change so that we won’t have to do this again the next time we load the file, so we overwrite the ‘Stat100_Survey2_Fall2015.rda’:
save(survey, file='Stat100_Survey2_Fall2015.rda')
Recall that last time we fit a linear model predicting student’s party hours/week from the average number of drinks/week:
fit <- lm(partyHr ~ drinks, data=survey)
summary(fit)
Call:
lm(formula = partyHr ~ drinks, data = survey)
Residuals:
Min 1Q Median 3Q Max
-22.623 -2.315 -0.777 1.685 47.685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.31522 0.16404 14.11 <2e-16 ***
drinks 0.54615 0.01395 39.14 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.491 on 1136 degrees of freedom
Multiple R-squared: 0.5742, Adjusted R-squared: 0.5738
F-statistic: 1532 on 1 and 1136 DF, p-value: < 2.2e-16
Now we want to know if there is any difference between male and female students. We can include gender in the model by the command
fit1 <- lm(partyHr ~ drinks + gender, data=survey)
summary(fit1)
Call:
lm(formula = partyHr ~ drinks + gender, data = survey)
Residuals:
Min 1Q Median 3Q Max
-22.494 -2.515 -0.736 1.601 48.262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73761 0.26983 6.440 1.77e-10 ***
drinks 0.55513 0.01431 38.794 < 2e-16 ***
genderfemale 0.77765 0.28891 2.692 0.00721 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.478 on 1135 degrees of freedom
Multiple R-squared: 0.5769, Adjusted R-squared: 0.5762
F-statistic: 773.8 on 2 and 1135 DF, p-value: < 2.2e-16
We see that in addition to the intercept and slope for ‘drinks’, there is a third variable ‘genderfemale’. When lm()
encounters a factor variable with two levels, it creates a new variable based on the second level. In our example, the second level is female, and ‘genderfemale’ is created. It is a binary variable that takes the value 1 if the value of ‘gender’ is female, and 0 if the value of ‘gender’ is not female. The fitted model can be written as \[{\rm party\ hours/week} = 1.73761 + 0.55513\, ({\rm drinks/week})
+ 0.77765\, ({\rm female})\] where for simplicity we use ‘female’ instead of ‘genderfemale’ to denote the binary variable. In this model, male and female have the same slope (0.55513) for ‘drinks’ but different intercepts. For male, the intercept is 1.73761. For female, the intercept is 1.73761 + 0.77765 = 2.51526.
Note that since we set “male” to be the first level (also called the base level or reference level), the intercept corresponds to the intercept for “male”. We can change the base level by the relevel()
command:
gender2 <- relevel(survey$gender,"female")
gender2[1:10]
[1] female female female male female female male female female male
Levels: female male
We see that in gender2
, “female” is the base level. Let’s fit a linear model using gender2
:
fit2 <- lm(survey$partyHr ~ survey$drinks + gender2)
summary(fit2)
Call:
lm(formula = survey$partyHr ~ survey$drinks + gender2)
Residuals:
Min 1Q Median 3Q Max
-22.494 -2.515 -0.736 1.601 48.262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.51525 0.17967 13.999 < 2e-16 ***
survey$drinks 0.55513 0.01431 38.794 < 2e-16 ***
gender2male -0.77765 0.28891 -2.692 0.00721 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.478 on 1135 degrees of freedom
Multiple R-squared: 0.5769, Adjusted R-squared: 0.5762
F-statistic: 773.8 on 2 and 1135 DF, p-value: < 2.2e-16
Since “female” is the base level in gender2
, the intercept returned by the lm()
is the intercept for female. The binary variable is now called ‘gendermale’, which is 1 if the value is “male” in ‘gender2’ and 0 if the value is not “male” in ‘gender2’. We see that the slope for ‘drinks’ remains the same as before. The value of the intercept is 2.51525, which is the intercept for female we calculated above. The intercept for male is 2.51525 + (-0.77765) = 1.7376, consistent with the previous result.
In R, the interaction term is represented by a colon ‘:’. To fit a model for ‘partyHr’ versus ‘drinks’, ‘gender’ and interaction between ‘drink’ and ‘gender’, we can use the command
fit3 <- lm(partyHr ~ drinks + gender + drinks:gender, data=survey)
summary(fit3)
Call:
lm(formula = partyHr ~ drinks + gender + drinks:gender, data = survey)
Residuals:
Min 1Q Median 3Q Max
-19.969 -2.046 -0.860 1.515 47.954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3682 0.2921 8.108 1.33e-15 ***
drinks 0.4920 0.0185 26.595 < 2e-16 ***
genderfemale -0.3220 0.3531 -0.912 0.362
drinks:genderfemale 0.1519 0.0287 5.292 1.45e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.426 on 1134 degrees of freedom
Multiple R-squared: 0.5871, Adjusted R-squared: 0.586
F-statistic: 537.5 on 3 and 1134 DF, p-value: < 2.2e-16
Actually, R has a shorthand notation to represent drinks + gender + drinks:gender
, which is drinks*gender
. So we can simply type
fit3 <- lm(partyHr ~ drinks*gender, data=survey)
summary(fit3)
Call:
lm(formula = partyHr ~ drinks * gender, data = survey)
Residuals:
Min 1Q Median 3Q Max
-19.969 -2.046 -0.860 1.515 47.954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3682 0.2921 8.108 1.33e-15 ***
drinks 0.4920 0.0185 26.595 < 2e-16 ***
genderfemale -0.3220 0.3531 -0.912 0.362
drinks:genderfemale 0.1519 0.0287 5.292 1.45e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.426 on 1134 degrees of freedom
Multiple R-squared: 0.5871, Adjusted R-squared: 0.586
F-statistic: 537.5 on 3 and 1134 DF, p-value: < 2.2e-16
to get the same fitting formula. The coefficients in the above formula mean that the fitted model is given by the equation \[ {\rm party\ hours/week} = 2.3682 + 0.492\, ({\rm drinks/week}) -0.322\, ({\rm female}) + 0.1519\, ({\rm drinks/week}) \cdot ({\rm female})\] Because of the presence of the interaction term, both the slope and intercept are different for male and female. Also note that in this fit, the coefficient of ‘genderfemale’, having a p-value of 0.362, is not statistically significant, meaning that the intercepts for male and female are probably not very different. However, the slopes for ‘drinks’ for male and female are different.
As you have learned in Stat 200, the regression equation above can be split into separate equations for male and female: \[ {\rm party\ hours/week} = \left\{ \begin{array}{ll} 2.3682 + 0.492\, ({\rm drinks/week}) & {\rm for\ male} \\ 2.0462 + 0.6439\, ({\rm drinks/week}) & {\rm for\ female}\end{array} \right. \] Including the interaction term is equivalent to splitting the data into male and female and fit a linear model for each group. We can confirm that this is the result we will get by actually splitting the data into two groups and fit a linear model for each group. Here is the code:
survey_male <- survey[survey$gender=="male",]
survey_female <- survey[survey$gender=="female",]
fit_male <- lm(partyHr ~ drinks, data=survey_male)
fit_female <- lm(partyHr ~ drinks, data=survey_female)
fit_male$coefficients
(Intercept) drinks
2.368229 0.492022
fit_female$coefficients
(Intercept) drinks
2.0462722 0.6438758
As expected, the regression coefficients for each group are the same as what we find above.
Let’s now plot the data with regression lines:
plot(partyHr ~ drinks, data=survey, col=as.integer(gender), xlab="Drinks/week",
ylab="Party hours/week", pch=19, las=1)
abline(fit_male$coefficients, col=1, lwd=2)
abline(fit_female$coefficients, col=2, lwd=2)
Since the levels in ‘gender’ are “male” and “female”, with “male” as the first level, as.integer(gender)
turns “male” to integer 1 and “female” to integer 2. So the male data are plotted with col=1, which is black; female data are plotted with col=2, which is red.
We can also make separate plots for male and female. As we mentioned before, this type of plots is easily made using lattice graphics’s xyplot
. We can add regression lines to the split plots using a panel function. The detail is described in this video. Here we just show you the command:
library(lattice)
xyplot(partyHr ~ drinks | gender, data=survey,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, col = "red")
})
We can also plot residuals. To do it with xyplot
, it is better to add the residuals to the data frame before plotting:
survey$residuals <- fit3$residuals
xyplot(residuals ~ drinks | gender, data=survey,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(h=0)
})
Since survey
did not have a column named “residuals”, the command survey$residuals <- fit3$residuals
created a new column to the data frame. Then the residuals can be plotted just like other columns.
Does including ‘gender’ and an interaction term make it a better model? If you compare the summary statistics of fit
and fit3
, you will see that the new model doesn’t improve much. For example, the R2 for the new model is only slightly larger and the residual standard error is only slightly smaller. The plots above also show that the residuals for the new models are not much different from those of the simple regression model. So adding ‘gender’ and interaction term does not seem to substantially reduce the residuals.
Here is another demonstration that factor variables can be used to fit two groups of data without splitting the data. We are going to work backward here. We will create two groups of data and then combine them.
First we create two groups of artificial data x1, y1 and x2, y2:
x1 <- seq(0,1,length.out=100)
x2 <- seq(2.5,3.5,length.out=100)
set.seed(5619783)
y1 <- 5 + 2.2*x1 + rnorm(100,sd=0.4)
y2 <- 11.2 -2.2*x2 + rnorm(100,sd=0.4)
Each variable x1, x2, y1, and y2 is a numeric vector of length 100; y1 is related to x1 by a linear relationship with a noise rnorm(100,sd=0.4)
added to it; y2 is also related to x2 by a linear relationship with a noise added to it. Let’s now fit a linear model for each group:
fit1 <- lm(y1 ~ x1)
fit2 <- lm(y2 ~ x2)
fit1$coefficients
(Intercept) x1
4.947955 2.327810
fit2$coefficients
(Intercept) x2
11.392317 -2.265344
Now we combine the two data set:
x <- c(x1,x2)
y <- c(y1,y2)
The first 100 elements in x is x1 and the next 100 elements is x2, similarly for y. To label the two group, we create a factor vector group
of length 200, with the first 100 elements labeled “1” and the second 100 elements labeled “2”. There are at least two ways to create the group
variable. The first way is
group <- rep(c(1,2), each=100)
group <- as.factor(group)
Another, a simpler, way is to use the gl()
function:
group <- gl(2,100)
group
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
Briefly, gl(n,k)
creates a factor vector with n levels and each level has k replications. By default, the levels are labeled ‘1’, ‘2’, …, ‘n’ but it can be changed by setting the labels
parameter. See ?gl
for detail.
The group
variable sets the first 100 elements to be in level ‘1’ and the next 100 elements to be in level ‘2’. We can plot the combined data:
plot(y ~ x, col=as.integer(group), pch=19, las=1)
Here group 1 data are plotted with col=1, which is black. Group 2 data are plotted with col=2, which is red. Clearly the two groups are widely separated and they each have different intercept and slope when we fit a linear model to them. If we simply fit a linear model to the combined data, the fit won’t be good:
fit_combined <- lm(y ~ x)
summary(fit_combined)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.84096 -0.55405 -0.02143 0.51098 1.83781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.35741 0.09791 64.93 <2e-16 ***
x -0.57333 0.04511 -12.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8189 on 198 degrees of freedom
Multiple R-squared: 0.4493, Adjusted R-squared: 0.4465
F-statistic: 161.5 on 1 and 198 DF, p-value: < 2.2e-16
plot(y ~ x, col=as.integer(group), pch=19, las=1)
abline(fit_combined, col="green", lwd=2)
When we include the factor variable group
with the interaction term, we have
fit_group <- lm(y ~ x*group)
summary(fit_group)
Call:
lm(formula = y ~ x * group)
Residuals:
Min 1Q Median 3Q Max
-1.02744 -0.30744 -0.00761 0.25988 1.33553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.94796 0.08571 57.73 <2e-16 ***
x 2.32781 0.14809 15.72 <2e-16 ***
group2 6.44436 0.45451 14.18 <2e-16 ***
x:group2 -4.59315 0.20942 -21.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4318 on 196 degrees of freedom
Multiple R-squared: 0.8484, Adjusted R-squared: 0.8461
F-statistic: 365.7 on 3 and 196 DF, p-value: < 2.2e-16
We see that the intercept and slope for ‘x’ is exactly the same as those for group 1 data as we calculated above with fit1
. To find the intercept for group 2, we add the intercept to the coefficient of ‘group2’. To find the slope for group 2, we add the coefficients of ‘x’ and ‘x:group2’. We can now extract the coefficients for each group:
intercept_group1 <- fit_group$coefficients[1]
slope_group1 <- fit_group$coefficients[2]
intercept_group2 <- fit_group$coefficients[1] + fit_group$coefficients[3]
slope_group2 <- fit_group$coefficients[2] + fit_group$coefficients[4]
c(intercept_group1, slope_group1)
(Intercept) x
4.947955 2.327810
c(intercept_group2, slope_group2)
(Intercept) x
11.392317 -2.265344
As expected, we get the same numbers as calculated by fit1
and fit2
. Let’s plot the regression lines:
plot(y ~ x, col=as.integer(group), pch=19, las=1)
abline(fit_combined, col="green", lwd=2)
abline(intercept_group1, slope_group1, col=1, lwd=2)
abline(intercept_group2, slope_group2, col=2, lwd=2)
Clearly, the model with ‘group’ and interaction term (black and red lines) is better than the simple lm(y ~ x)
fit used by fit_combined
(green line). Comparing the summary statistics of fit_combined
and fit_group
, we clearly see that the residual standard error reduces and R2 increases substantially.
Recall that religion
is a factor variable in the survey
data frame:
survey$religion[1:10]
[1] Other Religion Christian Christian Christian
[5] Christian Other Religion Agnostic Christian
[9] Christian Agnostic
8 Levels: Christian Jewish Muslim Hindu Buddhist ... Atheist
It has 8 levels with Christian as the base level. Suppose we fit ‘drinks’ versus ‘religion’:
fit_drinks <- lm(drinks ~ religion, data=survey)
summary(fit_drinks)
Call:
lm(formula = drinks ~ religion, data = survey)
Residuals:
Min 1Q Median 3Q Max
-9.723 -6.086 -2.923 2.445 44.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5548 0.3822 19.767 < 2e-16 ***
religionJewish 2.1682 1.2325 1.759 0.07881 .
religionMuslim -5.3730 1.6883 -3.182 0.00150 **
religionHindu -3.5851 1.6883 -2.123 0.03393 *
religionBuddhist -4.6318 1.5603 -2.969 0.00306 **
religionOther Religion 0.1674 1.1771 0.142 0.88694
religionAgnostic -1.4687 0.8585 -1.711 0.08741 .
religionAtheist -1.7563 0.9012 -1.949 0.05155 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.447 on 1130 degrees of freedom
Multiple R-squared: 0.02631, Adjusted R-squared: 0.02028
F-statistic: 4.362 on 7 and 1130 DF, p-value: 8.745e-05
The formula drinks ~ religion
looks like a simple regression with one variable. You might expect one intercept and one slope. However, the output has one intercept and 7 slopes! This is because religion
is a factor variable with 8 levels. When the lm()
is applied to a factor variable with k levels, it creates k-1 binary variables corresponding to the last k-1 levels. In the example above, we have 7 binary variables religionJewish
, religionMuslim
, religionHindu
, religionBuddhist
, religionOther Religion
, religionAgnostic
and religionAtheist
. The value of religionJewish
is 0 if the value of ‘religion’ is not ‘Jewish’ and 1 if ‘religion’ is ‘Jewish’, and similarly for the other binary variables. Note that there is no binary variable for ‘Christian’ because it is the base level. If ‘religion’ is ‘Christian’, all of the 7 binary variables are 0.
The regression equation is \[ {\rm drinks/week} = 7.5548 + 2.1682 f_{\rm Jewish} - 5.3730 f_{\rm Muslim} -3.5851 f_{\rm Hindu}
-4.6318 f_{\rm Buddhist}\] \[+ 0.1674 f_{\rm other}-1.4687 f_{\rm Agnostic} -1.7563 f_{\rm Atheist}\] Note that we use \(f_{\rm Jewish}\) to denote the binary variable religionJewish
and so on.
Suppose ‘religion’ is ‘Christian’, all of the \(f\) variables are 0 and the regression equation predicts that drinks/week = 7.5548. This is the average value of drinks/week for students who are Christians. Suppose ‘religion’ is ‘Hindu’, we have \(f_{\rm Hindu}=1\) and all other \(f\)‘s are 0, and the regression equation predicts that \[{\rm drinks/week} = 7.5548 - 3.5851 = 3.9697\] This is the average value of drinks/week for students who are Hindus. We see that the slopes of the binary variables is the difference between the predicted value for that group and the predicted value for ’Christian’. If the difference is 0, it means that the average value of drinks/week for that group is the same as that for Christians. Recall that the t-values and p-values are based on the null hypothesis that the slope is 0. We see that the p-value of the slope for ‘Other Religion’ is 0.87, suggesting that the average value of drinks/week for students in ‘Other Religion’ is the same as that of Christians.
In this particular example, we see that the slopes of the binary variables indicate the difference between the average of the base level and the groups. If we change the base level, we change the meaning of the slopes. For example, we define another factor variable with the base level set to ‘Atheist’
rel <- relevel(survey$religion, "Atheist")
In the model
fit_drinks_rel <- lm(survey$drinks ~ rel)
the slopes now compare the difference between the average of drinks/week to that of Atheists.
summary(fit_drinks_rel)
Call:
lm(formula = survey$drinks ~ rel)
Residuals:
Min 1Q Median 3Q Max
-9.723 -6.086 -2.923 2.445 44.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7985 0.8161 7.105 2.12e-12 ***
relChristian 1.7563 0.9012 1.949 0.05155 .
relJewish 3.9246 1.4279 2.748 0.00608 **
relMuslim -3.6167 1.8359 -1.970 0.04908 *
relHindu -1.8288 1.8359 -0.996 0.31939
relBuddhist -2.8754 1.7188 -1.673 0.09462 .
relOther Religion 1.9237 1.3804 1.394 0.16372
relAgnostic 0.2876 1.1212 0.257 0.79761
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.447 on 1130 degrees of freedom
Multiple R-squared: 0.02631, Adjusted R-squared: 0.02028
F-statistic: 4.362 on 7 and 1130 DF, p-value: 8.745e-05
What if we don’t want to compare the group average with any particular group? The trick is to remove intercept from the linear model. This can be done with the command
fit_drinks_nointercept <- lm(drinks ~ religion - 1, data=survey)
The -1
in the formula tells the lm()
function not to include an intercept. The result is that 8 binary variables are created:
summary(fit_drinks_nointercept)
Call:
lm(formula = drinks ~ religion - 1, data = survey)
Residuals:
Min 1Q Median 3Q Max
-9.723 -6.086 -2.923 2.445 44.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
religionChristian 7.5548 0.3822 19.767 < 2e-16 ***
religionJewish 9.7231 1.1718 8.298 2.99e-16 ***
religionMuslim 2.1818 1.6445 1.327 0.1849
religionHindu 3.9697 1.6445 2.414 0.0159 *
religionBuddhist 2.9231 1.5127 1.932 0.0536 .
religionOther Religion 7.7222 1.1133 6.936 6.77e-12 ***
religionAgnostic 6.0861 0.7688 7.916 5.80e-15 ***
religionAtheist 5.7985 0.8161 7.105 2.12e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.447 on 1130 degrees of freedom
Multiple R-squared: 0.3588, Adjusted R-squared: 0.3542
F-statistic: 79.02 on 8 and 1130 DF, p-value: < 2.2e-16
Now each slope represents the group average, and the p-value is based on the null hypothesis that the average is 0. We can confirm that the slopes indeed represent the group averages:
mean(survey$drinks[survey$religion=="Jewish"])
[1] 9.723077
mean(survey$drinks[survey$religion=="Hindu"])
[1] 3.969697
…
If you remember, there is an easy method to calculate the group average using tapply()
:
tapply(survey$drinks, survey$religion, mean)
Christian Jewish Muslim Hindu Buddhist
7.554828 9.723077 2.181818 3.969697 2.923077
Other Religion Agnostic Atheist
7.722222 6.086093 5.798507
As expected, we get the same numbers as the slopes.
As a final example, let’s fit a model for predicting party hours/week from drinks/week and religion, including the interaction terms:
fit4 <- lm(partyHr ~ drinks*religion, data=survey)
summary(fit4)
Call:
lm(formula = partyHr ~ drinks * religion, data = survey)
Residuals:
Min 1Q Median 3Q Max
-17.441 -2.253 -0.727 1.577 47.747
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.25261 0.22557 9.986 < 2e-16 ***
drinks 0.57900 0.01793 32.285 < 2e-16 ***
religionJewish 2.19373 0.87646 2.503 0.01246 *
religionMuslim -0.25371 0.94809 -0.268 0.78905
religionHindu -0.67224 0.94330 -0.713 0.47621
religionBuddhist 0.79853 0.90326 0.884 0.37685
religionOther Religion 1.10171 0.68639 1.605 0.10876
religionAgnostic 0.29765 0.50127 0.594 0.55278
religionAtheist -0.82925 0.49979 -1.659 0.09735 .
drinks:religionJewish -0.17079 0.06837 -2.498 0.01264 *
drinks:religionMuslim 0.07428 0.22793 0.326 0.74455
drinks:religionHindu 0.09923 0.12387 0.801 0.42325
drinks:religionBuddhist -0.33334 0.17383 -1.918 0.05541 .
drinks:religionOther Religion -0.15726 0.05235 -3.004 0.00272 **
drinks:religionAgnostic -0.10902 0.04668 -2.336 0.01968 *
drinks:religionAtheist -0.03425 0.04274 -0.801 0.42309
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.458 on 1122 degrees of freedom
Multiple R-squared: 0.5856, Adjusted R-squared: 0.58
F-statistic: 105.7 on 15 and 1122 DF, p-value: < 2.2e-16
The reference level in ‘religion’ is ‘Christian’. The regression equation can be written as \[\mbox{party hours/week} = 2.25261 + 0.579 (\mbox{drinks/week}) + 2.19373 f_{\rm Jewish} -0.25371 f_{\rm Muslim} -0.67224 f_{\rm Hindu}\] \[+ 0.79853 f_{\rm Buddhist} + 1.10171 f_{\rm other} + 0.29765 f_{\rm Agnostic} -0.82925 f_{\rm Atheist}\] \[- 0.17079 f_{\rm Jewish}\cdot (\mbox{drinks/week}) + 0.07428 f_{\rm Muslim}\cdot (\mbox{drinks/week})\] \[+ 0.09923 f_{\rm Hindu}\cdot (\mbox{drinks/week})-0.33334 f_{\rm Buddhist}\cdot (\mbox{drinks/week})\] \[-0.15726 f_{\rm other}\cdot (\mbox{drinks/week}) -0.10902 f_{\rm Agnostic}\cdot (\mbox{drinks/week})\] \[-0.03425 f_{\rm Atheist}\cdot (\mbox{drinks/week})\] Simple regression equations for each of the 8 groups can be reconstructed from this equation. For example, for Christians, \[f_{\rm Jewish}=f_{\rm Muslim}=f_{\rm Hindu}=f_{\rm Buddhist}=f_{\rm other}=f_{\rm Agnostic}=f_{\rm Atheist}=0\] and the regression equation becomes \[\mbox{party hours/week} = 2.25261 + 0.579 (\mbox{drinks/week}) \ \ \ \ \ \mbox{for Christians}\] For Atheists, \[f_{\rm Atheist}=1 \ \ \ , \ \ \ f_{\rm Jewish}=f_{\rm Muslim}=f_{\rm Hindu}=f_{\rm Buddhist}=f_{\rm other}=f_{\rm Agnostic}=0\] and the regression equation becomes \[\mbox{party hours/week} = 2.25261 + 0.579 (\mbox{drinks/week}) -0.82925 -0.03425(\mbox{drinks/week}) \ \ \ \ \ \mbox{for Atheists}\] or \[\mbox{party hours/week} = 1.42336 + 0.54475 (\mbox{drinks/week}) \ \ \ \ \ \mbox{for Atheists}\] Regression equations for other groups can be reconstructed in the same way. We see that the slope -0.82925 in \(-0.82925 f_{\rm Atheist}\) is the difference of the intercept between Christians and Atheists, and the slope -0.03425 in the interaction term \(-0.03425 f_{\rm Atheist}\cdot (\mbox{drinks/week})\) is the difference of the slope of (drinks/week) between Christians and Atheists. The slopes associated with other binary variables and interaction terms can be interpretted in the same way.
Looking at the summary statistics, it doesn’t seem that this model is better than the simple regression model fit <- lm(partyHr ~ drinks, data=survey)
we had before. We can make a few plots to gain more insight:
xyplot(partyHr ~ drinks | religion, data=survey,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, col = "red")
})
survey$residuals <- fit4$residuals
xyplot(residuals ~ drinks | religion, data=survey,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(h=0)
})
It appears that there are still a lot of scatters even after we split the data into the religion groups. This shows that while the number of drinks/week correlates with party hours/week. The correlation is not perfect even after we split the data into religion groups. There are other important factors we have not included in our model that contribute to the scatters.
Note that the formula in the lm()
syntax is somewhat different from the regression formula. For example, the command
lm(y ~ x)
means that a linear model of the form \(y=\beta_0 + \beta_1 x\) is to be fitted (if x is not a factor variable). The command
lm(y ~ x-1)
means that a linear model of the form \(y=\beta_0 x\) is to be fitted. The -1
means to exclude the intercept. The following table is a summary of the commands and model fits.
Command | Model |
---|---|
lm(y ~ x) | \(y=\beta_0+\beta_1 x\) |
lm(y ~ x-1) | \(y=\beta_0 x\) |
lm(y ~ x1 + x2) | \(y=\beta_0 + \beta_1 x1 + \beta_2 x2\) |
lm(y ~ x1 + x1:x2) | \(y=\beta_0 + \beta_1 x1 + \beta_2 x1\cdot x2\) |
lm(y ~ x1*x2) | \(y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x1\cdot x2\) |
lm(y ~ x1*x2*x3) | \(y=\beta_0 + \beta_1 x1 + \beta_2 x2 + \beta_3 x3 + \beta_4 x_1\cdot x_2 + \beta_5 x_1\cdot x_3 + \beta_6 x_2 \cdot x_3 + \beta_7 x_1\cdot x_2 \cdot x_3\) |
It is assumed that x, x1 and x2 above are not factor variables. If x1 is a factor variable with, say, 3 levels, two binary variables associated with x1 will be created and there will be extra terms.
You may wonder what if we want to fit a model of the form \(y=\beta_0+\beta_1 (x1+x2)\). We clearly can’t use lm(y~x1+x2)
because it means something different. Other than creating a new variable for x1+x2
, one way to do it is to use the I()
function:
lm(y ~ I(x1+x2))
This tells lm()
that x1+x2
should be considered as one variable. Similarly, if we want to fit a model \[y=\beta_0 (x1 + 4 x1\cdot x2) + \beta_1 (x2)^3\] We should type
lm(y ~ I(x1 + 4*x1*x2) + I(x2^3) - 1)
Again the -1
is to exclude the intercept. If it is omitted, the model \(y=\beta_0 + \beta_1 (x1 + 4 x1\cdot x2) + \beta_2 (x2)^3\) will be fitted.