The maximum likelihood estimation (MLE) is a general class of method in statistics that is used to estimate the parameters in a statistical model. In this note, we will not discuss MLE in the general form. Instead, we will consider a simple case of MLE that is relevant to the logistic regression.
Consider a box with only two type of tickets: one has ‘1’ written on it and another has ‘0’ written on it. Let p be the fraction of the ‘1’ tickets in the box. The value of p is unknown. Suppose that 100 tickets are drawn from the box and 20 of the tickets are ‘1’. What is the best estimate for the value of p? We imagine there are many tickets in the box, so it doesn’t matter whether the tickets are drawn with or without replacement. In the context of MLE, p is the parameter in the model we are trying to estimate.
We can ask the question: given p, what is the probability that we get 20 tickets with ‘1’ from 100 draws? This is not a difficult question. The probability that we get a ‘1’ ticket in each draw is p, and the probability that we get a ‘0’ ticket is (1-p). So the probability that we get 20 ‘1’ tickets and 80 ‘0’ tickets in 100 draws is 1 \[L(p) = P(20 | p) = p^{20}(1-p)^{80}\] This is a function of the unknown parameter p, called the likelihood function. It is the probability of getting the data if the parameter is equal to p. The maximum likelihood estimate for the parameter is the value of p that maximizes the likelihood function.
Instead of working with the likelihood function \(L(p)\), it is more convenient to work with the logarithm of \(L\): \[\ln L(p) = 20 \ln p + 80 \ln(1-p)\] where \(\ln\) denotes natural logarithm (base e). The equation results from two identities of logarithm: \[\log_c (ab) = \log_c (a) + \log_c (b) \ \ \ , \ \ \ \log_c (a^x) = x \log_c (a)\] Here \(\log_c\) is the logarithm function with base c and c can be any positive real numbers.
Since \(\ln(x)\) is an increasing function of \(x\), maximizing \(L(p)\) is the same as maximizing \(\ln L(p)\). So the maximum likelihood estimate for \(p\) boils down to find \(p\) that maximizes the function \(20\ln p + 80\ln(1-p)\). We can plot the function to see what it looks like:
curve(20*log(x)+80*log(1-x), xlim=c(0,1),ylim=c(-150,-50), las=1, xlab="p", ylab="ln L(p)")
Note that in R (and in most programming languages), log denotes natural logarithm ln. The plot shows that the maximum occurs around p=0.2. If you know calculus, you will know how to do the maximization analytically. The answer is that the maximum likelihood estimate for p is p=20/100 = 0.2.
In general, it can be shown that if we get \(n_1\) tickets with ‘1’ from N draws, the maximum likelihood estimate for p is \[p = \frac{n_1}{N}\] In other words, the estimate for the fraction of ‘1’ tickets in the box is the fraction of ‘1’ tickets we get from the N draws. We can express the result in a more abstract way that will turn out to be useful later.
Suppose \(y_i\) is a variable encoding the result of the ith draw. We set \(y_i=1\) if the ticket in the ith draw is ‘1’. We set \(y_i=0\) if the ticket in the ith draw is ‘0’. After N draws, we have the variables \(y_1, y_2, \cdots, y_N\). The number of ‘1’ tickets in N draws is \[n_1 = \sum_{i=1}^N y_i\] and so the maximum likelihood estimate for p is \[p=\frac{n_1}{N} = \frac{1}{N}\sum_{i=1}^N y_i = \bar{y}\] In other words, the maximum likelihood estimate for p is the mean of the \(y\) variable from the N draws.
We can also express the likelihood function \(L(p)\) in terms of \(y_i\). If \(y_i=1\), we get the ‘1’ ticket in the ith draw and the probability is p. If \(y_i=0\), we get the ‘0’ ticket and the probability is (1-p). Combining these two results, we have the probability of getting \(y_i\) being \[P(y_i | p) = p^{y_i} (1-p)^{1-y_i}\] We can check that the formula gives the correct answers for the two cases. If \(y_i=1\), \(p^{y_i}=p^1=p\) and \((1-p)^{1-y_i}=(1-p)^0=1\). It follows that \(p^{y_i} (1-p)^{1-y_i}=p\), consistent with our previous result. If \(y_i=0\), \(p^{y_i}=p^0=1\) and \((1-p)^{1-y_i}=(1-p)^1=(1-p)\) and so \(p^{y_i} (1-p)^{1-y_i}=(1-p)\), which is also consistent with our previous result. So we conclude that the formula works for both \(y_i=1\) and \(y_i=0\). The likelihood function is the probability that we get \(y_1, y_2, \cdots, y_N\) from N draws. Since each draw is independent, we use the multiplication rule to calculate the joint probability, or the likelihood function: \[L(p) = P(y_1, y_2, \cdots, y_N | p) = [p^{y_1}(1-p)^{1-y_1}] [p^{y_2}(1-p)^{1-y_2}] \cdots [p^{y_N}(1-p)^{1-y_N}]\] Using the product notation, we can write \[L(p) = \prod_{i=1}^N p^{y_i} (1-p)^{1-y_i}\] The log-likelihood is given by \[\ln L(p) = \sum_{i=1}^N [y_i \ln p + (1-y_i) \ln (1-p)]\] The result says that the value of p that maximizes the log-likelihood function above is \(p=n_1/N=\bar{y}\).
Now consider a slightly more general case of drawing tickets from two boxes. We label them box 1 and box 2. In each box, there are only two types of tickets: those with ‘1’ written on it and those with ‘0’ written on it. The fraction of the ‘1’ tickets in the two boxes are \(p_1\) and \(p_2\). These are unknown parameters we want to determine using the MLE.
Suppose N tickets are drawn from the two boxes. We record the result in two variables \(x\) and \(y\). The variable \(y\) is the same as before: \(y_i=1\) if the ticket in the ith draw is ‘1’; \(y_i=0\) if the ticket in the ith draw is ‘0’. The variable \(x\) is a categorical (factor) variable. It records whether the draws are from box 1 or box 2. That is, \(x_i=\)“box 1” if the ith draw is from box 1; \(x_i\)=“box 2” if the ith draw is from box 2. Following a similar math, one can show that the log-likelihood function is given by \[\ln L(p_1, p_2) = \sum_{i=1}^N \{ y_i \ln p(x_i) + (1-y_i) \ln [1-p(x_i)] \}\] where \[p(x_i) = \left \{ \begin{array}{ll} p_1 & \mbox{ if } x_i = \mbox{ "box 1"} \\ p_2 & \mbox{ if } x_i = \mbox{"box 2"} \end{array} \right. \] In MLE, the parameters are determined by finding the values of \(p_1\) and \(p_2\) that maximize \(\ln L\). Again, if you know calculus, it won’t be difficult to solve the maximization problem. Here we skip the math and just tell you the solution: \[p_1 = \frac{n_1 (\mbox{box 1})}{N(\mbox{box 1})} = \overline{y(\mbox{box 1})} \ \ \ , \ \ \ p_2 =\frac{n_1 (\mbox{box 2})}{N(\mbox{box 2})} = \overline{y(\mbox{box 2})}\] Here \(n_1 (\mbox{box 1})\) denotes the total number of ‘1’ tickets drawing from box 1, \(N(\mbox{box 1})\) denotes the total number of draws from box 1, and \(\overline{y(\mbox{box 1})}\) means taking the average of \(y\) coming from box 1. The variables \(n_1 (\mbox{box 2})\), \(N(\mbox{box 2})\) and \(\overline{y(\mbox{box 2})}\) are the same quantities associated with box 2.
In words, the maximum likelihood estimate for the fraction of the ‘1’ tickets in each box is the same as the fraction of the ‘1’ tickets drawn from each box. In R, the calculations of \(p_1\) and \(p_2\) are done by splitting the vector y into two groups and then compute the group means, which can be done more conveniently using the tapply()
function.
Consider a more general case where the tickets are drawn from \(k\) boxes (\(k > 2\)). Here again, we only consider the boxes with tickets ‘0’ and ‘1’ only. The unknown parameters are \(p_1, p_2, \cdots, p_k\), the fractions of the ‘1’ tickets in boxes \(1, 2, ..., k\). We want to determine the values of these parameters using MLE from the results of N draws from these boxes.
The \(y\) variable is still the same as before: \(y_i=1\) if the ticket in the ith draw is ‘1’; \(y_i=0\) if the ticket in the ith draw is ‘0’. Our factor variable \(x\) now contains \(k\) levels: \(x_i=\)“box 1” if the ith draw is from box 1; \(x_i=\)“box 2” if the ith draw is from box 2; …; \(x_i=\)“box k” if the ith draw is from box k. The log-likelihood function still takes the same form \[\ln L(p_1, p_2, \cdots, p_k) = \sum_{i=1}^N \{ y_i \ln p(x_i) + (1-y_i) \ln [1-p(x_i)] \}\] The only difference is in the value of \(p(x_i)\): \(p(x_i) = p_j\) (\(j=1, 2, \cdots, k\)) if \(x_i=\)“box j”. The maximization can be done analytically using calculus. It probably wouldn’t surprise you that the maximum likelihood estimate for the parameters is given by \[p_j = \frac{n_1 (\mbox{box j})}{N(\mbox{box j})} = \overline{y(\mbox{box j})} \ \ , \ \ j=1,2,\cdots, k.\] In R, the calculation of \(p_1, p_2, \cdots, p_k\) amounts to splitting the y vector by the factor variable x and then computing the group means, which again can be done more conveniently using the tapply()
function.
Before we go on to discuss an even more general case, it is useful to consider a few examples to demonstrate the use of these box models.
A credit card company is naturally interested in predicting the risk of a custom defaulting on his/her credit card payment. We will analyze a simulated data, freely available from the ISLR package for the book An Introduction to Statistical Learning. The data have been uploaded to our website and can be loaded to R directly using the command
Default <- read.csv("http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/RProgramming/data/Default.csv")
The data contain 10,000 observations and 4 columns. The first column, ‘default’, is a two-level (No/Yes) factor variable indicating whether the customer defaulted on their debt. The second column, ‘student’, is also a two-level (No/Yes) factor variable indicating whether the customer is a student. The third column, ‘balance’, is the average balance that the customer has remaining on their credit card after making their monthly payment. The last column, ‘income’, lists the income of the customer. Here is a quick look at the beginning of the data:
head(Default)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
The simplest model has only one parameter: the fraction of customers who defaulted on their debt. Mapping to the one-box model, we imagine the customers (current and future) represent tickets in a box. The customers who defaulted are the tickets with ‘1’ written on it. The undefaulted customers are the tickets with ‘0’ written on it. The \(y\) variable is 0 for undefaulted customer and 1 for defaulted customers. We want to know the fraction of customers who will default, based on the data we have on the current customers.
We first create the \(y\) vector:
y <- as.integer(Default$default=="Yes")
The expression Default$default==“Yes”
creates a logical vector of length 10,000 (the length of Default$default
), whose values are TRUE for defaulted customers and FALSE for undefaulted customers. When the as.integer()
function is applied to a logical value, it turns TRUE to 1 and FALSE to 0. Thus, the vector y represents the desired y variable. The maximum likelihood estimate of the fraction is the average of y:
mean(y)
[1] 0.0333
This shows that 3.33% of the current customers defaulted on their debt. This is also the maximum likelihood estimate for all the customers (current and future) who have defaulted/will default on their debt.
The one-box model is too crude. It only tells you the fraction of customers who will default, but doesn’t tell you what kind of customers are likely to default. Let’s consider a refined model where we divide the customers into two categories: students and non-students. Mapping to the two-box model, we image student customers represent tickets from box 1 and non-student customers represent tickets from box 2. The parameter \(p_1\) is the fraction of student customers who defaulted, and \(p_2\), the fraction of non-student customers who defaulted. The factor variable \(x\) is therefore Default$student
. The maximum likelihood estimate for \(p_1\) and \(p_2\) are the group means:
tapply(y, Default$student, mean)
No Yes
0.02919501 0.04313859
This shows that 4.3% of students defaulted and 2.9% of non-students defaulted. So students appear to be more likely to default on their debt compared to non-students. We can perform a two-sample t-test to see if the difference is significant:
t.test(y ~ Default$student, var.equal=TRUE)
Two Sample t-test
data: y by Default$student
t = -3.5439, df = 9998, p-value = 0.000396
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.021656006 -0.006231145
sample estimates:
mean in group No mean in group Yes
0.02919501 0.04313859
We see from the small p-value that the difference is significant.
Next we want to know if the fraction of defaults depends on the credit balance. We can look at the following box plots:
plot(balance ~ default, data=Default, las=1)
The box plots show that the defaulted customers tend to have larger balance. We want to find out how the fraction of defaults depends on the credit balance. The probability that a customer defaults can be considered as a function of balance p(balance). The problem is that ‘balance’ is not a factor variable, but a continuous variable that can in principle take infinite number of values. The multiple-box model described above cannot be applied to this variable. We will study a method to tackle a continuous variable in the next section. Here we will construct a factor variable from ‘balance’ by breaking the variable into many intervals.
Let’s first look at the values in the ‘balance’ vector:
summary(Default$balance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 481.7 823.6 835.4 1166.0 2654.0
The minimum is 0 and the maximum is 2654. We can create a factor variable that split the range [0,2654] into 10 equally-spaced intervals using the cut()
function:
balance_cut1 <- cut(Default$balance, breaks=10)
levels(balance_cut1)
[1] "(-2.65,265]" "(265,531]" "(531,796]"
[4] "(796,1.06e+03]" "(1.06e+03,1.33e+03]" "(1.33e+03,1.59e+03]"
[7] "(1.59e+03,1.86e+03]" "(1.86e+03,2.12e+03]" "(2.12e+03,2.39e+03]"
[10] "(2.39e+03,2.66e+03]"
balance_cut
is a factor vector of length 10,000 (same length as Default$balance
) with 10 levels. Just like hist()
, the parameter breaks
can be an integer (specifying the number of intervals) or a numeric vector (specifying the break points). The names of the levels in balance_cut1
are shown above. This means that values in ‘balance’ between -2.65 and 265 are assigned to level 1, named “(-2.65,265]”. Values in ‘balance’ between 265 and 531 are assigned to level 2, named “(265,531]” and so on. We can see the number of observations in each level using the table
function:
table(balance_cut1)
balance_cut1
(-2.65,265] (265,531] (531,796]
1315 1501 1975
(796,1.06e+03] (1.06e+03,1.33e+03] (1.33e+03,1.59e+03]
2044 1568 941
(1.59e+03,1.86e+03] (1.86e+03,2.12e+03] (2.12e+03,2.39e+03]
447 157 44
(2.39e+03,2.66e+03]
8
As can be seen, one disadvantage of breaking the variable in equally-spaced intervals is that there are unequal number of observations in each interval. Computation of group means becomes less accurate if we don’t have enough data in an interval. In the extreme case that there are no observation in an interval, the mean cannot be calculated.
One way to overcome the difficulty is to split the range in equal number of observations instead of equally-spaced intervals. This can be done using quantile()
, which computes the percentiles. For examples, we can split the range into 10 intervals of equal number of observations. We first compute the 0th, 10th, 20th, 30th, …, 90th, and 100th percentiles using the quantile()
function:
quantiles <- quantile(Default$balance, probs=seq(0,1,length=11))
In the command above, seq(0,1,length=11)
generates 11 numbers between 0 and 1 with equal spacing, so it returns 0, 0.1, 0.2, 0.3, …, 1.2 The option probs=seq(0,1,length=11)
tells R to compute the 0th, 10th, 20th, …, 100th percentiles in Default$balance
, corresponding to probs = 0, 0.1, 0.2, …, 1. The result is
quantiles
0% 10% 20% 30% 40% 50% 60%
0.0000 180.5753 395.3784 561.6230 697.1570 823.6370 951.2505
70% 80% 90% 100%
1086.1853 1249.7877 1471.6253 2654.3226
We can now use quantiles
and cut()
to create the following factor variable.
balance_cut2 = cut(Default$balance, breaks=quantiles)
levels(balance_cut2)
[1] "(0,181]" "(181,395]" "(395,562]"
[4] "(562,697]" "(697,824]" "(824,951]"
[7] "(951,1.09e+03]" "(1.09e+03,1.25e+03]" "(1.25e+03,1.47e+03]"
[10] "(1.47e+03,2.65e+03]"
Just like balance_cut1
, there are 10 levels in balance_cut1
, but they are not equally spaced. Instead, they should have equal number of observations because the break points are created from the percentiles. Let’s check to see if that’s the case:
table(balance_cut2)
balance_cut2
(0,181] (181,395] (395,562]
501 1000 1000
(562,697] (697,824] (824,951]
1000 1000 1000
(951,1.09e+03] (1.09e+03,1.25e+03] (1.25e+03,1.47e+03]
1000 1000 1000
(1.47e+03,2.65e+03]
1000
We see that there are 1000 observations in levels 2-10, but only 501 observations in level 1. The total number of observations is 9501. We are missing 499 observations! What is going on? Clearly, something is wrong in the first level. The name of the first level is “(0,181]”. This is the problem: ‘balance’=0 is not included in this level. These are the missing 499 observations! Looking at the help page of cut, ?cut
, we see that the parameter include.lowest
is set to FALSE by default. This means that the lowest ‘breaks’ value is not included. To fix the problem, we just need to set this parameter to TRUE:
balance_cut2 = cut(Default$balance, breaks=quantiles, include.lowest=TRUE)
table(balance_cut2)
balance_cut2
[0,181] (181,395] (395,562]
1000 1000 1000
(562,697] (697,824] (824,951]
1000 1000 1000
(951,1.09e+03] (1.09e+03,1.25e+03] (1.25e+03,1.47e+03]
1000 1000 1000
(1.47e+03,2.65e+03]
1000
Now we see that the first level is named “[0,181]” and there are 1000 observations in this level.3 We have successfully created 10 intervals with equal number of observations within.
Mapping to the box model, we imagine customers in the 10 ‘balance’ intervals represent tickets in 10 boxes. There are 10 parameters \(p_1\), \(p_2\), …, \(p_{10}\) corresponding to the fractions of customers in the intervals who defaulted on their debt. The maximum likelihood estimate of the parameters are simply the group means of y:
p <- tapply(y, balance_cut2, mean)
p
[0,181] (181,395] (395,562]
0.000 0.000 0.000
(562,697] (697,824] (824,951]
0.001 0.002 0.000
(951,1.09e+03] (1.09e+03,1.25e+03] (1.25e+03,1.47e+03]
0.007 0.016 0.038
(1.47e+03,2.65e+03]
0.269
This shows that the fraction of defaults generally increases as ‘balance’ increases. About 27% of customers with ‘balance’ greater than 1470 defaulted. We can visualize the result by making a plot. To do that, we first create a numeric vector of length 10 storing the midpoints of the quantiles:
balance_cut2_midpoints = rep(NA,10) # initialize a vector of length 10
for (i in 1:10) {
balance_cut2_midpoints[i] = (quantiles[i] + quantiles[i+1])/2
}
balance_cut2_midpoints
[1] 90.28763 287.97686 478.50070 629.38999 760.39700 887.44371
[7] 1018.71785 1167.98650 1360.70651 2062.97392
The first point, 90.29 is the average of the 0th and 10th percentiles (0 and 180.58); the second point is the average of the 10th and 20th percentiles and so on. We create the midpoints so that we can plot the parameters \(p_i\) at the midpoints of the intervals. Here is the plot:
plot(y ~ Default$balance, las=1, pch=20, xlab="Balance")
points(balance_cut2_midpoints, p, col="red", pch=20)
abline(v=quantiles, col="blue")
The black points are data: points at \(y=1\) represent customers who defaulted; points at \(y=0\) represent customers who did not default. The vertical blue lines are the 0th, 10th, 20th, …, 90th and 100th percentiles of ‘balance’, indicating the boundaries of the 10 intervals. The red points are the maximum likelihood estimate for the fractions of defaulted customers in the intervals.
Obviously, the credit card company will pay special attention to the customers in the last interval. To get a more detailed information of these customers, we split the last interval into several smaller intervals. The last interval corresponds to the customers between the 90th and 100th percentiles of ‘balance’. We can split this interval by specifying break points at the 92th, 94th, 96th, 98th and 100th percentiles:
quan_last <- quantile(Default$balance, probs=c(0.92,0.94,0.96,0.98,1))
We can combine the percentiles by taking the first 10 elements in quantiles
and quan_last
:
quan_combined <- c(quantiles[1:10], quan_last)
quan_combined
0% 10% 20% 30% 40% 50% 60%
0.0000 180.5753 395.3784 561.6230 697.1570 823.6370 951.2505
70% 80% 90% 92% 94% 96% 98%
1086.1853 1249.7877 1471.6253 1534.6969 1615.2418 1724.3766 1867.3134
100%
2654.3226
The new variable quan_combined
stores the 0th, 10th, 20th, …, 90th, 92th, 94th, 96th, 98th and 100th percentiles of ‘balance’. We can now construct a 14-box model corresponding to the 14 intervals of ‘balance’ by creating a new factor variable:
balance_cut3 <- cut(Default$balance, breaks=quan_combined, include.lowest=TRUE)
table(balance_cut3)
balance_cut3
[0,181] (181,395] (395,562]
1000 1000 1000
(562,697] (697,824] (824,951]
1000 1000 1000
(951,1.09e+03] (1.09e+03,1.25e+03] (1.25e+03,1.47e+03]
1000 1000 1000
(1.47e+03,1.53e+03] (1.53e+03,1.62e+03] (1.62e+03,1.72e+03]
200 200 200
(1.72e+03,1.87e+03] (1.87e+03,2.65e+03]
200 200
We see that the first 9 intervals are the same as before, but the last 5 intervals have 200 observations each. In this 14-box model, the maximum likelihood estimate for the fractions in the intervals are the group means:
p <- tapply(y, balance_cut3, mean)
p
[0,181] (181,395] (395,562]
0.000 0.000 0.000
(562,697] (697,824] (824,951]
0.001 0.002 0.000
(951,1.09e+03] (1.09e+03,1.25e+03] (1.25e+03,1.47e+03]
0.007 0.016 0.038
(1.47e+03,1.53e+03] (1.53e+03,1.62e+03] (1.62e+03,1.72e+03]
0.125 0.125 0.160
(1.72e+03,1.87e+03] (1.87e+03,2.65e+03]
0.290 0.645
This show how the fraction of defaults changes in the last few intervals. We can make another plot summarizing the result.
nintervals = length(levels(balance_cut3)) # Number of intervals
balance_cut3_midpoints <- rep(NA,nintervals) # initialize a vector of length nintervals
# Compute the midpoints of intervals
for (i in 1:nintervals) {
balance_cut3_midpoints[i] <- (quan_combined[i] + quan_combined[i+1])/2
}
# Plot
plot(y ~ Default$balance, las=1, pch=20, xlab="Balance")
points(balance_cut3_midpoints, p, col="red", pch=20)
abline(v=quan_combined, col="blue")
The plot above might remind you of the plot on the second page of this note on linear regression. In that plot, a continuous variable is split into 15 intervals and the average of the y variable is computed in each interval. The linear regression fits a straight line to the data in place of the averages in the intervals. In our case, a straight line won’t be a good fit to the data. Instead, we want to fit a curve that goes from 0 to 1. In logistic regression an S-shaped curve is fitted to the data in place of the averages in the intervals.
In MLE, we want to maximize the log-likelihood function: \[\ln L(\{ p(x) \}) = \sum_{i=1}^N \{ y_i \ln p(x_i) + (1-y_i) \ln [1-p(x_i)] \}\] If \(x\) is a factor variable with k levels, \(\{ p(x) \}\) contains k values corresponding to the k parameters \(p_1, p_2,\cdots,p_k\). If \(x\) is a continuous variable, \(\{ p(x) \}\) in principle has infinite possible values. In this case, we can specify a functional form for \(p(x)\) containing a few parameters. In (one-variable) logistic regression, we specify the function having the form \[p(x) = p(x; \beta_0,\beta_1) = \frac{e^{\beta_0 + \beta_1 x}}{1+e^{\beta_0+\beta_1 x}} = \frac{1}{1+e^{-\beta_0-\beta_1 x}}\] This is called the logistic function or sigmoid function. The curve is S-shaped. We can make plots to take a look at the logistic curves. First, we write a function that computes the logistic curve.
# Define the logistic function
logit <- function(x,beta0,beta1) {
1/(1+exp(-beta0 - beta1*x))
}
Next we fix \(\beta_1=1\) and see how the curve changes with different values of \(\beta_0\):
# Plot the logistic function with beta1=1 and 3 different values of beta0.
curve(logit(x,0,1),xlim=c(-10,10), lwd=2,ylab="Logistic Function", las=1)
curve(logit(x,-2,1),xlim=c(-10,10), lwd=2, lty=2, col="red", add=TRUE)
curve(logit(x,2,1),xlim=c(-10,10), lwd=2, lty=2, col="blue", add=TRUE)
legend(-10,1,c(expression(paste(beta[0]," = 0")),expression(paste(beta[0]," = -2")),expression(paste(beta[0]," = 2"))), lwd=2, lty=1:3, col=c("black","red","blue"))
We see that changing \(\beta_0\) simply shifts the curve horizontally. Now we fix \(\beta_0=0\) and see how the curve changes with different values of \(\beta_1\):
# Plot the logistic function with beta0=0 and 3 different +ve values of beta1.
curve(logit(x,0,1),xlim=c(-10,10), lwd=2,ylab="Logistic Function", las=1)
curve(logit(x,0,2),xlim=c(-10,10), lwd=2, lty=2, col="red", add=TRUE)
curve(logit(x,0,0.5),xlim=c(-10,10), lwd=2, lty=3, col="blue", add=TRUE)
legend(-10,0.9,c(expression(paste(beta[1]," = 1")),expression(paste(beta[1]," = 2")),expression(paste(beta[1]," = 0.5"))), lwd=2, lty=1:3, col=c("black","red","blue"))
# Plot the logistic function with beta0=0 and 3 different -ve values of beta1.
curve(logit(x,0,-1),xlim=c(-10,10), lwd=2,ylab="Logistic Function", las=1)
curve(logit(x,0,-2),xlim=c(-10,10), lwd=2, lty=2, col="red", add=TRUE)
curve(logit(x,0,-0.5),xlim=c(-10,10), lwd=2, lty=3, col="blue", add=TRUE)
legend(5,0.9,c(expression(paste(beta[1]," = -1")),expression(paste(beta[1]," = -2")),expression(paste(beta[1]," = 0.5"))), lwd=2, lty=1:3, col=c("black","red","blue"))
All three curves with positive \(\beta_1\) smoothly increase from nearly 0 to nearly 1. For negative values of \(\beta_1\), the curves decrease smoothly from nearly 1 to nearly 0. The magnitude \(|\beta_1|\) determines how steep the curve is in the transition from one extreme value to the other extreme value.
The logistic function is constructed so that the log odds, \(\ln [p/(1-p)]\), is a linear function of \(x\). There are two parameters \(\beta_0\) and \(\beta_1\) in this function. They are determined by maximizing the log-likelihood function \[\ln L(\beta_0,\beta_1) = \sum_{i=1}^N \{ y_i \ln p(x_i; \beta_0,\beta_1) + (1-y_i) \ln [1-p(x_i;\beta_0,\beta_1)] \}\] The maximization equations can be derived using calculus. However, unlike linear regression, the equations of logistic regression are nonlinear and cannot be solved analytically. They must be solved numerically using a computer. Fortunately, there are robust algorithms for solving these equations numerically.4
In R, the function glm()
stands for generalized linear model. Logistic regression belongs to a family of generalized linear models. Therefore, glm()
can be used to perform a logistic regression. The syntax is similar to lm()
. We will study the function in more detail next week. Here, we demonstrate how it can be used to obtain the parameters \(\beta_0\) and \(\beta_1\).
Let’s use the logistic regression to fit the credit card data. We want to fit a logistic curve for the fraction of defaults as a function of ‘balance’. So the x variable is Default$balance
and the y variable is the vector y we created above indicating whether a customer defaulted (y=1) or not (y=0). To find \(\beta_0\) and \(\beta_1\), we use the following syntax.
fit <- glm(y ~ Default$balance, family=binomial)
The glm()
function can be used to fit a family of generalized linear models. The parameter family=binomial
tells glm()
that we want to fit a logistic curve to the data. The coefficients \(\beta_0\) and \(\beta_1\) can be obtained by typing fit
or summary(fit)
.
fit
Call: glm(formula = y ~ Default$balance, family = binomial)
Coefficients:
(Intercept) Default$balance
-10.651331 0.005499
Degrees of Freedom: 9999 Total (i.e. Null); 9998 Residual
Null Deviance: 2921
Residual Deviance: 1596 AIC: 1600
summary(fit)
Call:
glm(formula = y ~ Default$balance, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
Default$balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8
The parameter \(\beta_0\) = -10.65 is the value of ‘intercept’ and \(\beta_1\) = 0.0055 is the value of ‘Default$balance’. We can add the logistic curve with these values of \(\beta_0\) and \(\beta_1\) to the plot of y vs ‘balance’ above:
# Re-create the y vs 'balance' plot
plot(y ~ Default$balance, las=1, pch=20, xlab="Balance")
points(balance_cut3_midpoints, p, col="red", pch=20)
abline(v=quan_combined, col="blue")
# Add a logistic curve with beta0 = fit$coefficients[1], beta1 = fit$coefficients[2]
curve(logit(x,fit$coefficients[1],fit$coefficients[2]), col="green", add=TRUE)
We see that the logistic model provides a good fitting curve (green line) to the result of the 14-box model, but it does the fit with only two parameters \(\beta_0\) and \(\beta_1\).
Actually, the expression should be multiplied by a factor if we don’t care about the order of getting ‘1’ and ‘0’. If you are familiar with combinatorics, you will know that the factor is \({100 \choose 20} = 100!/(20! 80!)=5.359834\times 10^{20}\). But this factor does not affect the result we are interested in and so is omitted. ↩
The option length=11
should really be length.out=11
, but I abbreviated it to take advantage of R’s partial matching of names. ↩
We deliberately make mistakes in the note to demonstrate the usefulness of the help pages. A help page is to an R command as a dictionary is to a word. The ability to look up a help page is essentially in learning R, just like the fact that the ability to look up a dictionary is essentially in learning a human language. If you find difficulty in understanding the help page, try google. ↩
We can also use the Monte Carlo method to determine the approximate values of the parameters. First, we write a function to calculate the log-likelihood:
# Define the log-likelihood function
# ln L = sum_i [y_i log(p(x_i,beta0,beta1)) + (1-y_i) log(1-p(x_i,beta0,beta1)) ]
loglike <- function(beta0,beta1,x,y) {
# Define the logistic function
logit <- function(x,beta0,beta1) {
1/(1 + exp(-beta0-beta1*x))
}
p <- logit(x,beta0,beta1)
sum( y*log(p) + (1-y)*log(1-p) )
}
Next we write a function to implement the Monte Carlo method to find the maximum of the log likelihood function. The following code is modified from the Monte Carlo note. The function takes 5 parameters: N, beta0_range, beta1_range, x and y. The logic is exactly the same as the minimization code. We choose N random values of \(\beta_0\) and \(\beta_1\) in the specified ranges, calculate ln(likelihood) for these N values of (\(\beta_0\), \(\beta_1\)), and then pick the parameters that give the maximum value of ln(likelihood).
# Find the coefficients of logistic regression using the Monte Carlo method.
# N is the number of beta's to choose
# beta0_range is a numeric vector of length 2:
# choose beta0 in the range (beta0_range[1], beta0_range[2])
# beta1_range is a numeric vector of length 2:
# choose beta1 in the range (beta1_range[1], beta1_range[2])
#
# Find beta0 and beta1 by minimizing the log-likelihood functon
#
find_betaMC <- function(N, beta0_range, beta1_range, x, y) {
# Generate N random values of beta0 and beta1 in the specified ranges.
many_beta0 <- runif(N, beta0_range[1], beta0_range[2])
many_beta1 <- runif(N, beta1_range[1], beta1_range[2])
# Calculate the ln(likelihood) for these beta0 and beta1
loglikelihood <- mapply(loglike, many_beta0,many_beta1, MoreArgs=list(x=x,y=y))
# Search for maximum
ind_max <- which.max(loglikelihood)
output <- c(many_beta0[ind_max],many_beta1[ind_max],loglikelihood[ind_max])
names(output) <- c("beta0", "beta1", "ln(likelihood)")
output
}
We can use this code to determine the parameters for the credit card data. Guessing the ranges of \(\beta_0\) and \(\beta_1\) requires some work. You can gain some idea of what ranges to try by making a few plots, or perform crude searches. After a few trials and errors, we find that the likely range for \(\beta_0\) is (-100,0) and the likely range of \(\beta_1\) is (0,0.1). We will let the reader complete the last steps of running the code to determine the approximate values of \(\beta_0\) and \(\beta_1\), and compare them to the exact solution. The procedure is very similar to that in the Monte Carlo note. ↩