Least Square Prescription and Monte Carlo Method

Introduction

Given a set of data points $(x_1,y_1), (x_2,y_2), (x_3,y_3),\cdots (x_n,y_n)$, we attempt to find a straight line $y=\beta_0+\beta_1x$ that “best fits” the data. Before we find a solution, we have first to define what “best fit” means. One useful measure of the “goodness of fit” is the sum of the square of the residuals, SSE: \[SSE = \sum_{i=1}^n [y_i-(\beta_0+\beta_1 x_i)]^2\] The least square prescription is to find $\beta_0$ and $\beta_1$ that minimize $SSE$. This minimization problem can be solved analytically. The solution is the familiar regression formulae: \[\beta_1 = r SD_y/SD_x \ \ , \ \ \beta_0 = \bar{y}-\beta_1 \bar{x}\] The detailed math can be found in this note for mathematically inclined students.

In this note, we introduce an alternative method to solve this minimization problem: the Monte Carlo method. The Monte Carlo method is a board class of computational algorithms that uses random sampling to obtain numerical results. It is widely used to solve problems that are too complicated to tackle by other techniques.

You will see that the Monte Carlo method is easy to understand and straightforward to implement.

Monte Carlo Method for Finding the Minimum

We use the Stat 100 Survey 2, Fall 2015 (combined) data for this demonstration. We first load the data from our website:

survey <- read.csv('http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/RProgramming/data/Stat100_Survey2_Fall2015.csv')

Previously, we use the lm() function to find the regression coefficients predicting the party hours/week from drinks/week. Let’s reproduce the result here:

fit <- lm(partyHr ~ drinks, data=survey)
fit$coefficients

(Intercept)      drinks 
  2.3152229   0.5461528

These values of intercept and slope are the solution to the minimization problem. Any other values of intercept and slope will increase the value of SSE.

Let’s now try to determine the intercept and slope using the Monte Carlo method, and we will compare the Monte Carlo result to the exact solution. Our goal is to find the intercept and slope that minimize the SSE. We first write a function, which we call computeSSE, to compute the SSE for any given values of intercept beta0 and slope beta1:

computeSSE <- function(beta0,beta1, x, y) {
  sum( (y - (beta0 + beta1*x))^2 )
}

The naive way of performing the minimization is to put a large number of different values of beta0 and beta1 into the computeSSE() function and see which pair gives the smallest value. This is the basic idea of the Monte Carlo method. The only thing that the Monte Carlo method adds is that the values you try should be chosen randomly. The random prescription guarantees that the error of the result you get after $N$ trials will scale according to $1/\sqrt{N}$ for large $N$.

Let’s see how we can implement the Monte Carlo method for our minimization problem. First we want to pick the ranges for the values of the intercept and slope. By making a few plots, we find that a reasonable choice is that the intercept should be between 0 and 5, and the slope should be between 0 and 1. The next step is to randomly pick a value for the intercept beta0 in the range (0,5) and a value for the slope beta1 in the range (0,1). Call the computeSSE() function with the chosen beta0, beta1 and record the SSE. Then pick another set of random beta’s and record its SSE. Repeat the steps as many times as we want. Then we look at the recorded SSE and pick the smallest one. The beta’s that give this SSE is our best estimate for the intercept and slope. Implementing these steps by hand is too tedious, so we want to make the computer do it.

We first want to generate a large set of (beta0,beta1). Since we decide to try beta0 is in the range (0,5) and beta1 in the range (0,1), we can use the runif() function to generate many random numbers uniformly distributed in the specified ranges. Let’s now generate ten thousand different random values of beta0 and beta1:

set.seed(972349003)
n <- 1e4
many_beta0 <- runif(n,0,5)
many_beta1 <- runif(n,0,1)

Next we want to feed these beta0 and beta1 to the computeSSE() function. The simplest method is to use a for-loop:

# Initialize the vector SSE1 to NA's
SSE1 <- rep(NA,n)
for (i in 1:n) {
  SSE1[i] <- computeSSE(many_beta0[i], many_beta1[i], survey$drinks, survey$partyHr)
}

It is sometimes suggested that programmers should avoid using for-loops. The main concern is that using for-loops tends to slow down the code. Let’s try to use one of the ’*apply’ loop-functions instead. Since we want to loop over two parameters beta0 and beta1, mapply() is the loop function to use. Note that computeSSE() has 4 arguments but we only want to loop over the first two arguments. How do we tell mapply() to fix the last two arguments? It is time to consult ?mappay. According to the help page, we can use the parameter MoreArgs to specify a list of other arguments to computeSSE(). So in the following function call, we tell mapply() to always pass survey$drinks to x and survey$partyHr to y while looping over beta0 and beta1.

SSE2 <- mapply(computeSSE, many_beta0,many_beta1, 
              MoreArgs=list(x=survey$drinks, y=survey$partyHr))

Both SSE1 and SSE2 are vectors of length 10,000, storing the values of SSE’s from the 10,000 pairs of beta0 and beta1. Let’s verify that they are exactly the same. We can do this by looking at the maximum value of the absolute difference between SSE1 and SSE2:

max(abs(SSE1-SSE2))

[1] 0

This shows that the two vectors are identical. Does using a for-loop slow down the code? I have done a test comparing the for-loop with the mapply() call. I find that the two methods take about the same amount of time. Using a for-loop does not significantly slow down the code, at least not for this particular problem. So it is a matter of style on which method you prefer to use. In the following, we will use the mapply() method.

Having computed the 10,000 values of SSE, we want to pick the minimum SSE and extract the beta’s that produce the minimum:

min_ind <- which.min(SSE2)
SSEmin <- SSE2[min_ind]
SSEmin

[1] 22909

min_ind

[1] 6442

Here which.min() returns the index of the (first) minimum of a numeric vector. We see that the minimum SSE in our 10,000 samples is 22909 and the index of this minimum is 6442. The best estimate of beta by the Monte Carlo (MC) method is

beta_MC <- c(many_beta0[min_ind],many_beta1[min_ind])
beta_MC

[1] 2.3331831 0.5456774

So the MC estimate of the intercept is 2.3332 and slope is 0.5457, which are close to the exact values calculated above.

A Monte Carlo Code

In the following, we gather the pieces of code together to write a single function find_betaMC(), which uses the Monte Carlo method to find the regression coefficients.

# Find the coefficients of simple regression using the Monte Carlo method.
# N is the number of beta's to choose
# intercept_range is a numeric vector of length 2:
#    choose the intercept in the range (intercept_range[1], intercept_range[2])
# slope_range is a numeric vector of length 2:
#    choose the slope in the range (slope_range[1], slope_range[2])
#
# Find beta0 (intercept) and beta1 (slope) by minimizing
# SSE = sum( (y - (beta0 + beta1*x))^2 )
#
find_betaMC <- function(N, intercept_range, slope_range, x, y) {
  
  # Function that calculates SSE
  computeSSE <- function(beta0,beta1,x,y) {
     sum( (y - (beta0 + beta1*x))^2 )
  }
  
  many_beta0 <- runif(N, intercept_range[1],intercept_range[2])
  many_beta1 <- runif(N, slope_range[1], slope_range[2])
  SSE <- mapply(computeSSE, many_beta0,many_beta1, MoreArgs=list(x=x,y=y))
  ind_min <- which.min(SSE)
  output <- c(many_beta0[ind_min],many_beta1[ind_min],SSE[ind_min])
  names(output) <- c("intercept", "slope", "SSEmin")
  output
}

To use this function, we supply the ranges for the intercept and slope, specify N (the number of beta’s to try), and the data array x and y. Let’s see if we can use it to reproduce our previous result:

set.seed(972349003)
find_betaMC(1e4, c(0,5), c(0,1), survey$drinks, survey$partyHr)

   intercept        slope       SSEmin 
2.333183e+00 5.456774e-01 2.290900e+04

This verifies our new function.

Error Estimate

There are many ways to improve our MC result. For example, we now know the intercept is around 2.3 and slope is around 0.5. We can try again by narrowing our search ranges. Instead of trying values for the intercept between 0 and 5, we can now choose values between 2.2 and 2.4. For the slope, we can choose values between 0.5 and 0.6. We can also estimate the errors of the coefficients by calling the function a few times. Let’s do it 20 times and narrow our search ranges:

set.seed(518963489)
betasMC <- replicate(20, find_betaMC(1e4, c(2.2,2.4), c(0.5,0.6), 
                                    survey$drinks, survey$partyHr))
betasMC

                  [,1]         [,2]         [,3]         [,4]         [,5]
intercept 2.313440e+00 2.313270e+00 2.313831e+00 2.313157e+00 2.312638e+00
slope     5.461167e-01 5.461139e-01 5.463803e-01 5.461782e-01 5.463245e-01
SSEmin    2.290873e+04 2.290873e+04 2.290873e+04 2.290873e+04 2.290873e+04
                  [,6]         [,7]         [,8]         [,9]        [,10]
intercept 2.314295e+00 2.311277e+00     2.313852 2.314935e+00 2.313437e+00
slope     5.461093e-01 5.463177e-01     0.546262 5.464701e-01 5.462268e-01
SSEmin    2.290873e+04 2.290874e+04 22908.728328 2.290874e+04 2.290873e+04
                 [,11]        [,12]        [,13]        [,14]        [,15]
intercept 2.313221e+00 2.315265e+00 2.314810e+00 2.316396e+00 2.316661e+00
slope     5.462548e-01 5.461733e-01 5.460436e-01 5.463901e-01 5.460771e-01
SSEmin    2.290873e+04 2.290873e+04 2.290873e+04 2.290874e+04 2.290873e+04
                 [,16]        [,17]        [,18]        [,19]        [,20]
intercept 2.312759e+00 2.312442e+00 2.316049e+00 2.315074e+00 2.314623e+00
slope     5.462848e-01 5.463502e-01 5.463111e-01 5.463942e-01 5.460668e-01
SSEmin    2.290873e+04 2.290873e+04 2.290873e+04 2.290874e+04 2.290873e+04

This gives us an information on the variation of the coefficients determined by the MC. In addition, by replicating the process 20 times, we have performed a calculation for $N=2\times 10^5$. We can thus combine them by searching for the minimum of the returned SSEmin and the corresponding beta values are the result of a calculation for $N=2\times 10^5$:

ind_min <- which.min(betasMC["SSEmin",])
betaMC_combined <- betasMC[1:2,ind_min]
betaMC_combined

intercept     slope 
2.3152652 0.5461733

You may wonder why then do we replicate 20 calculations with $N=10^4$ when we can just do it with $N=2\times 10^5$. The answer is that we can use the information of the 20 calculations to estimate the errors of the MC result. The Monte Carlo method uses random numbers to obtain numerical result. Each time we do the calculation we get slightly different numbers (unless we reset the seed number). The MC results follow a probability distribution with the mean equal to the values of the exact solution and variances that scale with $1/N$ for large $N$. Since we have perform 20 calculations for $N=10^4$, we can estimate the variances for $N=10^4$ using the var() function:

var_intercept_1e4 <- var(betasMC["intercept",])
var_slope_1e4 <- var(betasMC["slope",])

Note that we do not include the factor $19/20$ in the variances, because we are interested in estimating the variances of the underlying probability distribution not the variances in our sample. So we use the formula \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 \] as an unbiased estimator for the variance of the underlying probability distribution. Since the variances go down as $1/N$, the variances for $N=2\times 10^5$ can be estimated by dividing the variances for $N=10^4$ by 20. Hence the standard deviations of our error estimates for $N=2\times 10^5$ are

sd_intercept_2e5 <- sqrt(var_intercept_1e4/20)
sd_slope_2e5 <- sqrt(var_slope_1e4/20)
c(sd_intercept_2e5, sd_slope_2e5)

[1] 3.139118e-04 2.826804e-05

Based on the estimated standard deviations, we can attach the MC result with an error estimate:
intercept = 2.31527 ± 0.00031
slope = 0.54617 ± 3e-05
The actual deviations from the exact solution are

betaMC_combined - fit$coefficients

   intercept        slope 
4.225226e-05 2.054540e-05

We see that the exact values are within our estimated ranges. Note that we use the standard deviations as our error estimates. This is equivalent to 68% confidence interval, which means that for each value of the intercept and slope, there is 68% chance for the exact value to be within our estimated range.

Nonlinear Curve Fitting

So far we have been considering linear models. It is often the case that the data we have depends on parameters in nonlinear fashion. Standard numerical techniques are developed for nonlinear curve fitting, but these methods tend to be complicated. The Monte Carol method again offers an alternative to the standard techniques, and it is just as easy to fit a nonlinear model as to fit a linear model.

Here we use a biological example mentioned in this book to demonstrate the Monte Carol method. This example studies the change in the blood pressure of three animals when various doses of a drug were injected into the animals. The csv file of the data has been uploaded to our website. We can load it directly to R using the command

response <- read.csv('http://courses.atlas.illinois.edu/spring2016/STAT/STAT200/RProgramming/data/drug_response.csv', comment.char='#')

The option comment.char=‘#’ tells R that the ‘#’ character is a comment character and it will ignore everything in the rest of a line after the ‘#’ character. The data frame response has three columns. The first column (logDose) is log(Dose), representing the logarithm (base 10) of the drug dose. The second column (dBP) is the change in the blood pressure in units of mmHg. The third column (Animal) labels the three animals (“A”, “B” or “C”).

We want to fit a model predicting the change in blood pressure from log(Dose). Let’s first plot the data:

plot(dBP ~ logDose, data=response, col=as.integer(Animal), las=1, pch=19,
     xlab="log(Dose)", ylab="Change in Blood Pressure (mmHg)")

The column ‘Animal’ is a factor variable with levels “A”, “B” and “C”. Thus, as.integer(Animal) returns 1 for animal A, 2 for animal B and 3 for animal C. The data for animal A are plotted with col=1, which is black, data for animal B are plotted with col=2, which is red, data for animal C are plotted with col=3, which is green.

The plot clearly indicates that the change in blood pressure is not a linear function of log(Dose). The standard “dose-response” model is called the Hill equation, or the four-parameter logistical equation, or the variable slope sigmoidal equation. All three names refer to the same model, in which the equation is given in the form \[y = b + \frac{t-b}{1+(Dose50/Dose)^s}\] Here $y$ is the change in pressure, $b$, $t$, $s$ and $Dose50$ are 4 parameters in the model and they all take positive values. What do these parameters mean? Let’s examine the equation closely. In the absence of the drug, $Dose=0$ and so $Dose50/Dose \rightarrow \infty$. The Hill equation gives $y=b$. Thus, $b$ is the change in pressure, if any, in the absence of the drug. When $Dose$ is very large, $Dose50/Dose \rightarrow 0$, the Hill equation gives $y=t$. So $t$ represents the change in pressure produced by an infinite amount of drug dose. When $Dose=Dose50$, $Dose50/Dose=1$ and the Hill equation gives $y=(t+b)/2$. Hence $Dose50$ is the amount of dose that produces a change in pressure halfway between the top response $t$ and bottom response $b$. It can be shown that the function $y$ increases with $Dose$ in the Hill equation. So at zero dose, $y=b$. As Dose increases, $y$ increases. At $Dose=Dose50$, $y=(b+t)/2$. At very high dose, $y$ satuates to the value $t$. The parameter $s$ is called the Hill slope, which denotes the steepness of the dose-response curve.

Since the independent variable is $x=\log(Dose)$, we have $Dose=10^x$. Also define $x50=\log(Dose50)$ and so $Dose50=10^{x50}$. The Hill equation is rewritten in the form \[y=b+\frac{t-b}{1+10^{s(x50-x)}}\] It is useful to make plots to visualize the Hill equation. Even through there are 4 parameters, we can simplify it somewhat in the following plot by setting $b/t=0.1$. Then the Hill equation becomes \[\frac{y}{t} = \frac{b}{t} + \frac{1-b/t}{1+10^{s(x50-x)}}\] and so $y/t$ depends only on $x-x50=\log(Dose/Dose50)$ and $s$.

# Define a function that calculates y/t: y/t = hill(xp,s,bt) 
# Here xp=log(Dose/Dose50)=x-x50, bt = b/t
hill <- function(xp,s,bt) {
  bt + (1-bt)/(1+10^(-s*xp))
}
curve(hill(xp,0.2,0.1),xlim=c(-7,7), ylim=c(0,1), xlab="log(Dose/Dose50) = x-x50", 
      ylab="y/t",las=1,main="Hill function with b/t=0.1",xname="xp",lwd=2)
curve(hill(xp,0.5,0.1),col="red",lty=2,add=TRUE,xname="xp",lwd=2)
curve(hill(xp,1,0.1), col="blue", lty=3, add=TRUE,xname="xp",lwd=2)
curve(hill(xp,2,0.1), col="green", lty=4, add=TRUE,xname="xp",lwd=2)
legend(-6,0.9,c("s=0.2","s=0.5","s=1.0","s=2.0"),col=c("black","red","blue","green"), 
                lty=1:4,lwd=2)

On the left $\log(Dose/Dose50)=-7$ corresponding to very low Dose and $y/t \rightarrow b/t=0.1$. On the right, $\log(Dose/Dose50)=7$ corresponding to very high dose and $y/t$ approaches 1. At $Dose=Dose50$, $\log(Dose/Dose50)=0$ and $y/t=(b+t)/2t=(b/t+1)/2=0.55$. The parameter $s$ determines how steeply $y$ transitions from $b$ to $t$. You will see this type of S-shaped curve again when you study logical regression later.

Let’s now go back to our curve fitting problem. When there is no drug, we don’t expect any response and so we set $b=0$ and the model simplifies to
\[y = \frac{t}{1+10^{s(x50-x)}}\] There are three parameters $t$, $s$ and $x50$ we need to fit. The least square prescription is to seek a solution that minimizes the SSE: \[SSE = \left[y - \frac{t}{1+10^{s(x50-x)}}\right]^2\] We can easily modify our previous Monte Carol code to tackle this problem:

# Find the coefficients t, s, and x50 in the Hill equation
# N is the number of sets of parameters to choose
# t_range is a numeric vector of length 2:
#    choose t in the range (t_range[1], t_range[2])
# s_range is a numeric vector of length 2:
#    choose s in the range (s_range[1], s_range[2])
# x50_range is a numeric vector of length 2:
#    choose x50 in the range (x50_range[1], x50_range[2])
#
# Find t, s and x50 by minimizing
# SSE = sum( (y - t/(1+10^((x50-x)*s)))^2 )
#
find_paraMC <- function(N, t_range, x50_range, s_range, x, y) {
  # Function that calculates SSE
  computeSSE <- function(t,x50,s, x,y) {
    sum( (y - t/(1 + 10^((x50-x)*s)))^2 )
  }
  
  many_t <- runif(N, t_range[1],t_range[2])
  many_x50 <- runif(N, x50_range[1],x50_range[2])
  many_s <- runif(N, s_range[1], s_range[2])
  SSE <- mapply(computeSSE, many_t,many_x50, many_s, MoreArgs=list(x=x,y=y))
  ind_min <- which.min(SSE)
  output <- c(many_t[ind_min],many_x50[ind_min],many_s[ind_min],SSE[ind_min])
  names(output) <- c("t", "x50", "s", "SSEmin")
  output
}

You can see that the above MC code is very similar to the previous one. There are now 3 parameters instead of 2; the computeSSE() function is modified. Other than these changes, the algorithm is the same as before.

We can use the same strategy as before to narrow down the parameters. First, we do a crude search with N=1000 but do it 10 times. Based on the data plot and the meaning of the parameters, we can try $t$ in the range (20,30); x50 in the range (-6,-4) and s in the range (0.1,2):

set.seed(2786113)
replicate(10, find_paraMC(1000, c(20,30), c(-6,-4), c(0.1,2), response$logDose, response$dBP))

              [,1]       [,2]        [,3]        [,4]        [,5]
t       26.7183481  26.746015  27.3568580  25.9807774  27.1567137
x50     -5.8253698  -5.909356  -5.9010307  -5.9243184  -5.8050150
s        0.9265765   0.767706   0.9433985   0.8971518   0.7944455
SSEmin 120.7075970 103.666315 118.4112932 110.5552911 112.5350886
              [,6]       [,7]       [,8]        [,9]       [,10]
t       27.6985852  27.157380  26.635938  26.3115247  27.2986251
x50     -5.8619926  -5.861725  -5.851299  -5.9733473  -5.8482039
s        0.8302337   0.868535   0.877770   0.8559603   0.6973422
SSEmin 109.1946547 106.887090 111.113259 105.1023764 109.4166994

By looking at the variations of these MC results for $t$, $x50$ and $s$, we can narrow down the parameters by choosing $t$ in the range (26,28), $x50$ in (-6,-5.8) and $s$ in (0.6,1). We next do a narrower search in these ranges and run the code 100 times with N=10000.

set.seed(2786113)
coefMC <- replicate(100, find_paraMC(1e4, c(26,28), c(-6,-4), c(0.6,1), 
                            response$logDose, response$dBP))

We can afford to do more iterations in this problem because the data set is very small and it doesn’t take a long time to run the simulation. We now combine the 100 values of the coefficients to obtain a result for $N=10^6$:

ind_min <- which.min(coefMC["SSEmin",])
coefMC_combined <- coefMC[1:3,ind_min]
coefMC_combined

         t        x50          s 
26.9636455 -5.9090758  0.8083159

We use the same technique as above to estimate the error of the coefficients:

var_1e4 <- rep(NA,3)
for ( i in 1:3) {
   var_1e4[i] <- var(coefMC[i,]) 
}
sd_1e6 <- sqrt(var_1e4/100)
sd_1e6

[1] 0.0092955419 0.0008429068 0.0011839262

These numbers give us an estimate of the errors of the coefficients determined by the MC. R has a built-in function nls() to do a nonlinear fitting. See ?nls for more information. Here we simply state the fitting result returned by the function: $t=26.9702$, $x50=-5.908$ and $s=0.8048$. The MC values are close to these exact numbers.

Finally, we plot the fitted curve together with the data:

plot(dBP ~ logDose, data=response, col=as.integer(Animal), las=1, pch=19,
     xlab="log(Dose)", ylab="Change in Blood Pressure (mmHg)")
t <- coefMC_combined[1]
x50 <- coefMC_combined[2]
s <- coefMC_combined[3]
curve(t/(1+10^((x50-x)*s)), col="blue",add=TRUE)

We see that the curve fits the data fairly well.

This completes our demonstration of using the Monte Carlo method to solve a minimization problem. The Monte Carlo method is straightforward to implement and can be very useful to solve complicated problems.