Simple Regression

The `lm()` Function

Linear regression is one of the most important class of models in Statistics. In R, the built-in function lm() is the workhorse with many bells and whistles that handles linear models. In this note we use the Stat 100 Survey 2, Fall 2015 (combined) data we worked on previously to demonstrate the use of lm() to perform a simple linear regression. We will further explore the lm() function in the future lessons.

Since we have saved the data previously, we can reload it using the command

load('Stat100_Survey2_Fall2015.rda')

Using the ls() command, we see that the data frame survey has been loaded to the working space. To remind us of the variables in the data, we type

names(survey)

 [1] "gender"             "genderID"           "greek"             
 [4] "homeTown"           "ethnicity"          "religion"          
 [7] "religious"          "ACT"                "GPA"               
[10] "partyHr"            "drinks"             "sexPartners"       
[13] "relationships"      "firstKissAge"       "favPeriod"         
[16] "hoursCallParents"   "socialMedia"        "texts"             
[19] "good_well"          "parentRelationship" "workHr"            
[22] "percentTuition"     "career"

Recall that we plotted the number of drinks per week versus party hours per week. Here, let’s reverse the plot:

plot(partyHr ~ drinks, data=survey, pch=19, xlab="Drinks/week", ylab="Party hours/week")

It is apparent that as the number of drinks increases, the party hours tend to increase as well. To fit a linear model predicting party hours from drinks, we use the following R command

fit <- lm(partyHr ~ drinks, data=survey)

If you type the command class(fit), you will see that the data type is ‘lm’, something we haven’t see before. If we type fit, it auto-prints the following

fit


Call:
lm(formula = partyHr ~ drinks, data = survey)

Coefficients:
(Intercept)       drinks  
     2.3152       0.5462

It first tells us what is being fit in the model. Then it gives the coefficients of the fit. In our case, the intercept is 2.3152 and the slope of “drinks” is 0.5462. In other words, the linear model is \[ {\rm party\ hours} = 2.3152 + 0.5462 \times {\rm drinks}\] We can also get the coefficients by typing

fit$coefficients

(Intercept)      drinks 
  2.3152229   0.5461528

The variable fit$coefficients is a numeric vector of length two, storing the values of the intercept and slope. We can obtain more information about the fitted model using the summary() function:

summary(fit)


Call:
lm(formula = partyHr ~ drinks, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.623  -2.315  -0.777   1.685  47.685 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.31522    0.16404   14.11   <2e-16 ***
drinks       0.54615    0.01395   39.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.491 on 1136 degrees of freedom
Multiple R-squared:  0.5742,    Adjusted R-squared:  0.5738 
F-statistic:  1532 on 1 and 1136 DF,  p-value: < 2.2e-16

In addition to the values of the intercept and slope, summary(fit) also displays their estimated standard errors, the $t$ values and the two-sided p-values $P(>|t|)$. Here the $t$ values are the value of the parameters (intercept and slope) divided by the estimated standard errors. The two-sided p-values are calculated by assuming that the coefficients follow the $t$ distributions with $n-2$ degrees of freedom. Beside each p-value we see several stars. The number of stars denote the sigificance of the coefficient of being nonzero. As the ‘Signif. codes’ undernearth the coefficients show, 3 stars ‘***’ means the p-value is between 0 and 0.001; 2 stars ‘**’ means the p-value is between 0.001 and 0.01; one star ‘*’ means p-vale is between 0.01 and 0.05; ‘.’ means p-value is between 0.05 and 0.1; and space ‘ ’ means the p-value is between 0.1 and 1. The summary() function also provides information about the residuals of the fit, $R^2$ and residual standard error.

We can now use the abline() function to add the regression line to the plot. Here we have the intercept a=fit$coefficients[1] and slope b=fit$coefficients[2]. Actually, we can simply use abline(fit) to add the regression line:

plot(partyHr ~ drinks, data=survey, pch=19, xlab="Drinks/week", ylab="Party hours/week")
abline(fit, col="red")

Before we continue, let’s check that the lm() function gives the results we expect. You have learned from Stat 100 and Stat 200 that for a simple linear model $y=\beta_0+\beta_1 x$, the slope and intercept are given by \[\beta_1 = r \frac{SD_y}{SD_x} \ \ , \ \ \beta_0 = \bar{y}-\beta_1 \bar{x} ,\] where $SD_x$ and $SD_y$ are the standard deviations of $x$ and $y$, $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$, and $r$ is the correlation coefficient between $x$ and $y$. Let’s confirm the results of lm():

r <- cor(survey$drinks,survey$partyHr)
SDy_div_SDx <- sd(survey$partyHr)/sd(survey$drinks)
slope_drinks <- r*SDy_div_SDx
intercept <- mean(survey$party) - slope_drinks*mean(survey$drinks)
c(intercept,slope_drinks)

[1] 2.3152229 0.5461528

These are exactly the values returned by the lm() function. We know that $R^2$ in simple regression is the same as the square of the correlation coefficient. Let’s check it:

r^2

[1] 0.5742104

It again agrees with the $R^2$ returned by summary(fit). You also learned that the standard error for the slope is given by \[SE^+_{\rm slope}=\sqrt{\frac{1-r^2}{n-2}}\, \frac{SD_y}{SD_x}\] Let’s check it:

n <- nrow(survey)
SE_slope <- sqrt((1-r^2)/(n-2))*SDy_div_SDx
SE_slope

[1] 0.01395362

We again see that it agrees with the value returned by summary(fit). The $t$ value for the slope is

t_slope <- slope_drinks/SE_slope
t_slope

[1] 39.14057

and the two-sided p-value can be computed by the pt() function (with df=n-2):

2*pt(-t_slope,n-2)

[1] 7.535915e-213

These are all consistent with the values returned by summary(fit).

Confidence Interval for Intercept and Slope

Having obtained the coefficients and standard errors, we can calculate the confidence interval (CI) for these coefficients. The assumption is that these coefficients follow the $t$ distribution with $n-2$ degrees of freedom.

As an example, let’s calculate the 95% CI for the slope. We first need to know the t-value, $t_\alpha$, corresponding to the 95% CI.

As shown in the graph above¹, if the middle area under the t-curve is 95%, the area of each tail is (100%-95%)/2 or 2.5%. The value of $t_\alpha$ can be calculated using the qt() function:

t_alpha <- qt(0.025,n-2,lower.tail=FALSE)
t_alpha

[1] 1.962054

This is also equal to -qt(0.025,n-2) since the t-curve is symmetric. The CI of the slope is given by ${\rm slope} \pm t_\alpha SE^+_{\rm slope}$. We can use slope_drinks and SE_slope we computed above to calculate the CI:

slope_95CI <- c(slope_drinks - t_alpha*SE_slope, slope_drinks + t_alpha*SE_slope)
slope_95CI

[1] 0.5187750 0.5735306

In fact, we don’t need to do this calculation. R already has a built-in function confint() that does this calculation for us. However, it is useful to go through this exercise to understand what the function does. The confint() function takes at least three arguments: confint(object,para,level=0.95,…). For our purpose here, we will ignore the … arguments. The object can be the object returned by the lm() function, e.g. our fit. para specifies which parameters the CI is to be calculated, and levelis the CI level. The default is 0.95 (95%). To find the 95% CI for the slope, we can type

confint(fit,'drinks')

          2.5 %    97.5 %
drinks 0.518775 0.5735306

We can also use confint(fit,2) since the slope for ‘drinks’ is the second coefficient. We see that these numbers are the same as those we calculated above.

If the para option is omitted in confint(), the CI for all parameters will be computed:

confint(fit)

               2.5 %    97.5 %
(Intercept) 1.993377 2.6370692
drinks      0.518775 0.5735306

We can change the confidence level by adjusting the level parameter. For example, to compute the 80% CI, we type

confint(fit,level=0.8)

                 10 %      90 %
(Intercept) 2.1048808 2.5255650
drinks      0.5282601 0.5640455

Predicted Values

Having obtained the intercept and slope for the regression, we can use them to predict the outcome. For example, we construct the following function to take an input vector ‘drinks’ and predict the party hours based on the regression coefficients ‘beta’:

predict_partyHrfromDrinks <- function(drinks, beta) {
  beta[1] + beta[2]*drinks
}

Here the input parameter beta is assumed to be a numeric vector of length 2, with beta[1] being the intercept and beta[2] being the slope. As an example, we can predict the party hours for the first 10 students in the survey data by the following command:

predict_partyHrfromDrinks(survey$drinks[1:10],fit$coefficients)

 [1]  5.045987  7.776751  3.407528 10.507515  3.407528  2.861376  3.953681
 [8] 10.507515  2.861376  6.138292

Actually, there is an easier way to get the answer. The predicted values have already been calculated and are stored in fit$fitted.values. To see the first 10 fitted values, we simply type

fit$fitted.values[1:10]

        1         2         3         4         5         6         7 
 5.045987  7.776751  3.407528 10.507515  3.407528  2.861376  3.953681 
        8         9        10 
10.507515  2.861376  6.138292

and we see that they are the same values as calculated above, with names given by rownames(survey), which are simply the row numbers.

We can use our function predict_partyHrfromDrinks to predict the party hours from new data. For example,

drinks <- c(8,5.5,0.2,2)
predict_partyHrfromDrinks(drinks,fit$coefficients)

[1] 6.684445 5.319063 2.424453 3.407528

It turns out that the predict() function together with fit can also be used for new data, but the new data must be put into a data frame and this data frame must contain the independent variables ‘drinks’. Then we use the option newdata=newdataframe to get the predicted outcome. For example,

new_drinks <- data.frame(drinks = c(8,5.5,0.2,2))
rownames(new_drinks) <- c("John","Mary","Jill","Patrick")
predict(fit, newdata=new_drinks)

    John     Mary     Jill  Patrick 
6.684445 5.319063 2.424453 3.407528

Note that when we add the rownames to the data frame, the vector returned by predict() also carries these names. The new data frame can contain other columns other than ‘drinks’ and predict will ignore those other columns:

new_data <- data.frame(gender=c("male","female","female","male"),
                       drinks = c(8,5.5,0.2,2), partyHr = c(1,2,3,4))
rownames(new_data) <- c("John","Mary","Jill","Patrick")
predict(fit, newdata=new_data)

    John     Mary     Jill  Patrick 
6.684445 5.319063 2.424453 3.407528

Residuals and Degrees of Freedom

The fit variable generated by the lm() function also contains the residuals of the fitted data. The vector fit$residuals stores the residuals of all the data in the data frame. To see the first 10 residuals, we can type

fit$residuals[1:10]

         1          2          3          4          5          6 
-4.0459868 -3.7767508  1.5924715 14.4924853  2.5924715 -2.8613757 
         7          8          9         10 
-0.9536813  4.4924853  0.1386243 -3.1382924

Recall that a residual is defined as the difference between the actual value and the predicted value. We can computer our own residuals and compare with those returned by fit$residuals:

my_residuals <- survey$partyHr - predict_partyHrfromDrinks(survey$drinks,fit$coefficients)
max( abs(my_residuals - fit$residuals) )

[1] 9.232615e-13

So we see that my_residuals and fit$residuals are practically the same. The small difference is likely due to numerical truncation error.

The residuals have two important properties: they sum to zero and they are uncorrelated with the independent valuable $x$. These properties follow from the fact that the regression coefficients $\beta_0$ and $\beta_1$ are constructed to minimize the sum of the square of the residuals (SSE). This is called the least square prescription. The mathematical proof requires calculus. If you are interested and have the math background, you can read this pdf note.

Here we skip the math and simply use R to check these assertions using our fitted model fit

sum(fit$residuals)

[1] -6.03767e-13

cor(fit$residuals,survey$drinks)

[1] 9.475895e-17

We see that the claims are indeed supported by our numerical calculations. Since there are $n=1138$ observations in the data frame survey, there are also $n$ residuals. However, these residuals are not independent: they must sum to 0 and they must be uncorrelated with ‘drinks’. As a result, there are only $n-2=1136$ independent residuals. If you tell me any of the $n-2$ residuals, I can figure out the remaining two by the requirements that they must sum to 0 and they must be uncorrelated with ‘drinks’. For this reason, we say that the residuals have $n-2$ degrees of freedom. The residual degrees of freedom of our fitted model is stored in fit$df.residual:

fit$df.residual

[1] 1136

Note that the sum of residuals is 0 and that they are uncorrelated with $x$ is true only if the regression coefficients are given by the correct formulae $\beta_1=r SD_y/SD_x$ and $\beta_0=\bar{y}-\beta_1 \bar{x}$. Other values of $\beta_0$ and $\beta_1$ won’t have these properties. To demonstrate this, we constructed a function to calculate the sum of residuals and the correlation for a general set of $\beta_0$ and $\beta_1$:

sumRes_cor <- function(beta,x,y) {
  residuals <- y - (beta[1] + beta[2]*x)
  output <- c( sum(residuals), cor(x,residuals) )
  names(output) <- c("sum(residuals)","cor(x,residuals)")
  output
}

In this function, y is the dependent variable, x is the independent variables, beta[1] is the intercept and beta[2] is the slope. Note that we assign names to the output vector to make the output easier to read. Let’s first check that it gives the expected results 0 and 0 when correct values of beta are given:

sumRes_cor(fit$coefficients, survey$drinks, survey$partyHr)

  sum(residuals) cor(x,residuals) 
    2.109113e-11     6.030347e-15

Now let’s try a few other values of $\beta_0$ and $\beta_1$ to see what happens:

sumRes_cor(c(0,0),survey$drinks, survey$partyHr)

  sum(residuals) cor(x,residuals) 
    6904.0000000        0.7577667

sumRes_cor(c(2,0.5),survey$drinks, survey$partyHr)

  sum(residuals) cor(x,residuals) 
    719.50000000       0.09766536

sumRes_cor(c(0.5,1),survey$drinks, survey$partyHr)

  sum(residuals) cor(x,residuals) 
    -1482.000000        -0.694407

As expected, they are not zero. You can try more values if you want.

SSE, SST, SSM and RSE

Given a set of data points $(x_1,y_1), (x_2,y_2), (x_3,y_3),\cdots (x_n,y_n)$, and a linear model \[\hat{y}_i = \beta_0 + \beta_1 x_i\] we define the quantities SST, SSM and SSE as \[SST = \sum_{i=1}^n (y_i-\bar{y})^2\] \[SSM = \sum_{i=1}^n (\hat{y}_i-\bar{y})^2\] \[SSE = \sum_{i=1}^n (y_i-\hat{y}_i)^2\]

Note that SST is independent of the model parameters $\beta_0$ and $\beta_1$ and is given by $SST=nSD_y^2$ from the definition of the standard deviation. The other two quantities SSM and SSE depend on the parameters. You are told that if $\beta_0$ and $\beta_1$ are given by the regression formulae, then SST=SSM+SSE, $SSM=nr^2SD_y^2$ and $SSE=n(1-r^2)SD_y^2$. The mathematical proof can be found in this note for mathematically inclined students. Here we will confirm these claims using R.

Below is a simple function that calculates SSE, SSM and SST-SSM-SSE.

computeSS <- function(beta,x,y) {
  n <- length(x)
  SDy2 <- var(y)*(n-1)/n
  SST <- n*SDy2
  yhat <- beta[1] + beta[2]*x
  SSM <- sum((yhat-mean(y))^2)
  SSE <- sum((y-yhat)^2)
  output <- c(SSE,SSM,SST-SSE-SSM)
  names(output) <- c("SSE","SSM","SST-SSE-SSM")
  output
}

Let’s try it on our survey data and the fit coefficients and see if it works:

fitSS <- computeSS(fit$coefficients, survey$drinks, survey$partyHr)
fitSS

         SSE          SSM  SST-SSE-SSM 
2.290873e+04 3.089420e+04 3.201421e-10

We see that the equality SST=SSM+SSE holds. Let’s now confirm that the values of SSM and SSE are given by $nr^2SD_y^2$ and $n(1-r^2)SD_y^2$:

SDy2 <- var(survey$partyHr)*(n-1)/n
c(fitSS["SSM"] - n*r^2*SDy2, fitSS["SSE"] - n*(1-r^2)*SDy2)

          SSM           SSE 
-3.310561e-10  1.455192e-11

This confirms the claims.

The residual standard error (RSE) is defined as the square root of SSE divided by the degrees of freedom: \[RSE = \sqrt{SSE/(n-2)}\] Having calculated SSE, we can easily compute the RSE:

RSE <- sqrt( fitSS["SSE"]/(n-2))
RSE

     SSE 
4.490672

This is the same as the same as the “residual standard error” returned by summary(fit) (see the output above). It estimates the standard deviation of the fitted values.

Note that if $\beta_0$ and $\beta_1$ take values other than those given by the regression formulae, the equality SST=SSM+SSE no longer holds. We can verify that by trying a few other values:

computeSS(c(1,2), survey$drinks, survey$partyHr)

        SSE         SSM SST-SSE-SSM 
   327398.0    499862.9   -773458.0

computeSS(c(-1,3), survey$drinks, survey$partyHr)

        SSE         SSM SST-SSE-SSM 
     855207     1140806    -1942210

computeSS(c(5,1), survey$drinks, survey$partyHr)

        SSE         SSM SST-SSE-SSM 
    82555.0    141886.0   -170638.1

Residual Plots

Residual plots are useful to assess the validity of a linear model. The basic assumption of the linear model is that the data (x,y) can be approximated $y=\beta_0 +\beta_1 x + \epsilon$ with the residual $\epsilon$ independent of $x$ and is random. If the residual $\epsilon$ as a function of $x$ shows a pattern, it suggests that the linear model is not an accurate description of the data and the relationship between $x$ and $y$ may be nonlinear. As an example, consider the following simple data:

set.seed(367134)
x <- seq(0,2,0.01)
y <- (x-0.5)^2 + 0.05*rnorm(length(x))
plot(x,y, pch=20)

The relationship between $x$ and $y$ is clearly nonlinear. If we fit a linear model and plot the residuals vs x, we see a pattern:

fit1 <- lm(y~x)
plot(x, fit1$residuals, pch=20, ylab="residuals")
abline(h=0)

This shows that the residual function $\epsilon(x)$ is not completely random, indicating that the linear model is not an accurate description of the data.

Finally, we should caution that the estimate of standard errors, p-values, and confidence intervals are based on the assumptions that (1) the residual $\epsilon$ follows a normal distribution, and (2) the variance of the residuals is independent of the values of the independent variable $x$. If these assumptions are not true, the estimated values may be used for reference but cannot be taken seriously. The first assumption can be relaxed if the sample size is sufficiently large because of the central limit theorem. The second assumption, also known as homoscedasticity, is essential. The absence of homoscedasticity is called heteroscedasticity.

We can get a sense of how good the homoscedasticity assumption holds from the residual plots. For example, in the linear model predicting party hours from number of drinks, the residual plot is

plot(survey$drinks, fit$residuals, pch=20, xlab="Drinks/week", ylab="Residuals", las=1)
abline(h=0)

This plot shows that the homoscedasticity assumption does not hold well, although the deviation is not too big. Another useful plot is residuals versus the fitted values ($\hat{y}$):

plot(fit$fitted.values, fit$residuals, pch=20, xlab="Predict", ylab="residual", las=1)
abline(h=0)

This plot is essentially the same as the previous plot. This is expected because $\hat{y}=\beta_0+\beta_1 x$, so we simply change the scaling of the horizontal axis. You will see that this plot is useful when we are dealing with regression with multiple variables. Instead of making plots of the residuals versus each of the independent variable, we usually just plot the residuals versus the predicted values, which are a linear combination of the values of all independent variables.

From the plots, we conclude that the homoscedasticity assumption does not hold very well. However, the deviation from the assumption does not seem to be too large for us to completely ignore the diagnostics returned by the linear model. Therefore, for this data set the standard errors, p-values, confidence intervals can be used as reference numbers but should not be taken too seriously.

Summary

In this note, we introduce a few commands related to the lm() function. We do not just tell you on how to use them, but also include a number of side calculations to explain where the numbers come from. Here we summarize all the commands and ideas we have covered in this note.

The lm() function: fitting a linear model, e.g.
```
fit <- lm(partyHr ~ drinks, data=survey)
```
Object returned by the lm() (e.g. fit in the example above) is an ‘lm’ class. When you auto-print it, it returns the calling method and the fitted coefficients.
The summary(object) function provides more information on the object of the ‘lm’ class. In addition to the calling method and coefficients, it also summarizes the residuals of the fit, standard errors, t-values and p-values of the coefficients, residual standard error (RSE), the degrees of freedom, $R^2$ and so on.
The ‘lm’ class object is a list containing many components. Here we use fit as an example of an ‘lm’ class object. A few components of fit we mentioned in this note include
- fit$coefficients: a named vector of coefficients
- fit$residuals: the residuals
- fit$fitted.values: the fitted values
- fit$df.residual: the residual degrees of freedom
confint(object,para,level=0.95): returns the confidence interval of the parameters in para. If para is omitted, confidence interval for all parameters in object will be returned. The confidence level can be changed from the default value 95% by modifying the level parameter.
When the abline() function is applied to an ‘lm’ class object, the fitted straight line in the object will be added to an existing plot, e.g.
```
plot(partyHr ~ drinks, data=survey, pch=19, xlab="Drinks/week", ylab="Party hours/week")
abline(fit, col="red")
```
The function predict(object, newdata=dataframe) returns the fitted values of the outcomes for the data in dataframe, e.g.
```
predict(fit, newdata=data.frame(drinks=c(3,4,8,2,16)))
```
For linear regression with one independent variable $x$, the residuals always sum to 0 and are uncorrelated with the independent variable $x$. As a result, the residuals degrees of freedom is $n-2$, where $n$ is the number of data points.
Residual plots are useful to assess the validity of a linear model. If a residual plot shows a pattern, it suggests that the relationship may be nonlinear.
The estimate of standard errors, t-values, p-values, and confidence interval is based on the assumptions that (1) the residuals follow a normal distribution, and (2) the residual variance is independent of $x$ (homoscedasticity). The first assumption can be relaxed if the sample size is sufficiently large. It is always a good idea to make residual plots to get a sense on the validity of the homoscedasticity assumption.

Footnote:

1. In case you want to know how the graph is generated, here is the code:

p <- -qt(0.025,n-2)
xmax <- 4
npoly <- 100
xr <- seq(p,xmax,length.out=npoly)
yr <- dt(xr,n-2)
xmid <- seq(-p,p,length.out=npoly)
ymid <- dt(xmid,n-2)
curve(dt(x,n-2), xlim=c(-xmax,xmax), xlab="",ylab="", xaxt="n", yaxt="n")
polygon(c(p,xr,xmax),c(0,yr,0),col="red")
polygon(c(-p,-xr,-xmax),c(0,yr,0),col="red")
polygon(c(-p,xmid,p),c(0,ymid,0),col="skyblue")
text(0,0.2,"95%")
text(2.7,0.035,"2.5%")
text(-2.7,0.035,"2.5%")
mtext(expression(-t[alpha]), side=1, at=-p, line=1, cex=1.3)
mtext(expression(t[alpha]), side=1, at=p, line=1, cex=1.3)

The polygon() function is used to shade the areas under the curve; text() is used to add texts in the plot; mtext() is used to add texts in a margin of the plot; expression() is used to write mathematical symbols.

⤶