8  Ordinary Least Squares (OLS) Estimation

In this set of notes, you will learn how the coefficients from the fitted regression equation are estimated from the data. Recall that in the previous set of notes, we used the riverview.csv data to examine whether education level is related to income (see the data codebook). To begin, we will load several libraries and import the data into an object called city. We will also fit a model by regressing income on education level and storing those results in an object called lm.a.

# Load libraries
library(dplyr)
library(ggplot2)
library(readr)

# Import data
city = read_csv(file = "https://raw.githubusercontent.com/zief0002/modeling/main/data/riverview.csv")

#View data
city
# Fit regression model
lm.a = lm(income ~ 1 + education, data = city)
lm.a

Call:
lm(formula = income ~ 1 + education, data = city)

Coefficients:
(Intercept)    education  
     11.321        2.651  

The fitted regression equation is

\[ \hat{\mathrm{Income}_i} = 11.321 + 2.651(\mathrm{Education~Level}_i) \]


8.1 Ordinary Least Squares Estimation

How does R determine the coefficient values of \(\hat{\beta}_0=11.321\) and \(\hat{\beta}_1=2.651\)? These values are estimated from the data using a method called Ordinary Least Squares (OLS). To understand how OLS works, consider the following toy data set of five observations:

Table 8.1: Toy data set with predictor (X) and outcome (Y) for five observations.
Xi Yi
30 63
10 44
30 40
50 68
20 25

Which of the following two models fits these data better?

  • Model A: \(~~\hat{Y_i} = 28 + 0.8(X_i)\)
  • Model B: \(~~\hat{Y_i} = 20 + 1(X_i)\)

We could plot the data and both lines and try to determine which seems to fit better.

Scatterplot of the observed toy data and the OLS fitted regression line for Model A.

Scatterplot of the observed toy data and the OLS fitted regression line for Model B.


8.2 Data–Model Fit

In this case, the lines are similar and it is difficult to make a determination of which fits the data better by eyeballing the two plots. Instead of guessing which model fits better, we can actually quantify the fit for the data by computing the residuals (errors) for each model and then comparing both sets of residuals; larger errors indicate a worse fitting model (i.e., more misfit to the data).

Remember, to compute the residuals, we will first need to compute the predicted value (\(\hat{Y}_i\)) for each of the five observations for both models.

Table 8.2: Observed values, predicted values and residuals for Model A.
Xi Yi i ε̂i
30 63 52 11
10 44 36 8
30 40 52 -12
50 68 68 0
20 25 44 -19
Table 8.3: Observed values, predicted values and residuals for Model B.
Xi Yi i ε̂i
30 63 50 13
10 44 30 14
30 40 50 -10
50 68 70 -2
20 25 40 -15

Eyeballing the numeric values of the residuals is also problematic. The size of the residuals is similar for both Models. Also, the eyeballing method would be impractical for larger datasets. So, we have to further quantify the model fit (or misfit). The way we do that in practice is to consider the total amount of error across all the observations. Unfortunately, we cannot just sum the residuals to get the total because some of our residuals are negative and some are positive. To alleviate this problem, we first square the residuals, then we sum them.

\[ \begin{split} \mathrm{Total~Error} &= \sum\hat{\epsilon}_i^2 \\ &= \sum \left( Y_i - \hat{Y}_i\right)^2 \end{split} \]

This is called a sum of squared residuals or sum of squared error (SSE; good name, isn’t it). Computing the squared residuals for Model A and Model B we get:

Table 8.4: Observed values, predicted values, residuals, and squared residuals for Model A.
Xi Yi i ε̂i ε̂i2
30 63 52 11 121
10 44 36 8 64
30 40 52 -12 144
50 68 68 0 0
20 25 44 -19 361
Table 8.5: Observed values, predicted values, residuals, and squared residuals for Model B.
Xi Yi i ε̂i ε̂i2
30 63 50 13 169
10 44 30 14 196
30 40 50 -10 100
50 68 70 -2 4
20 25 40 -15 225

Summing these squared values for each model we obtain:

  • Model A: SSE = 690
  • Model B: SSE = 694

Once we have quantified the model misfit, we can choose the model that has the least amount of error. Since Model A has a lower SSE than Model B, we would conclude that Model A is the better fitting model to the data.


8.2.1 Visualizing the SSE

To further understand the sum of squared error, we can examine a visual representation of the SSE for Model A. Recall that visually, the residual is the vertical distance between an observation and the fitted value (which lie on the fitted line). The residual indicates how different these two quantities are on the Y-metric. In the formula we squared each of the residuals. Visually, this is equivalent to producing the area of a square that has a side length equal to the absolute value of the residual.

This plot visually displays the residual values as line segments with negative residuals shown as dashed lines.

This plot visually displays the squared residuals as the area of a square with side length equal to the absolute value of the residual.
Figure 8.1: Scatterplot of the observed toy data and the OLS fitted regression line for Model A.

The SSE is simply the total area encompassed by all of the squares. Note that the observation that is directly on the line has a residual of 0 and thus does not contribute a quantity to the SSE. If you computed the SSE for a line with different intercept or slope values, the SSE will be different. The plot below shows what this might look like for the flat line produced by \(~~\hat{Y_i} = 50\).

Scatterplot of the observed toy data and the fitted flat line with Y-intercept of 50. The plot visually shows the squared residuals as the area of a square with side length equal to the absolute value of the residual.

Powell & Lehe (2015) created an interactive website to help understand how the SSE is impacted by changing the intercept or slope of a line. You can also see how individual observations impact the SSE value.


8.3 “Best” Fitting Model

In the vocabulary of statistical estimation, the process we just used to adopt Model A over Model B was composed of two parts:

  • Quantification of Model Fit: We quantify how well (or not well) the estimated model fits the data; and
  • Optimization: We find the “best” model based on that quantification. (This boils down to finding the model that produces the biggest or smallest measure of model fit.)

In our example we used the SSE as the quantification of model fit, and then we optimized by selecting the model with the lower SSE. When we use lm() to fit a regression analysis to the data, R needs to consider not just two models like we did in our example, but all potential models (i.e., any intercept and slope). The model coefficients that lm() returns are the “best” in that no other values for intercept or slope would produce a lower SSE. The model returned has the lowest SSE possible thus least squares. For our toy dataset, the model that produces the smallest residuals is

\[ \hat{Y}_i = 28.682 + 8.614(X_i) \]

This model gives the following predicted values and residuals:

Table 8.6: Observed values, predicted values, residuals, and squared residuals for the ‘best’ fitting model.
Xi Yi i ε̂i ε̂i2
30 63 49.61364 13.386364 179.1947
10 44 33.47727 10.522727 110.7278
30 40 49.61364 -9.613636 92.4220
50 68 65.75000 2.250000 5.0625
20 25 41.54545 -16.545455 273.7521

The SSE is 661.16. This is the smallest SSE possible for a linear model. Any other value for the slope or intercept would result in a higher SSE.


8.3.1 Mathematical Optimization

Finding the intercept and slope that give the lowest SSE is referred to as an optimization problem in the field of mathematics. Optimization is such an important (and sometimes difficult) problem that there have been several mathematical and computational optimization methods that have been developed over the years. You can read more about mathematical optimization on Wikipedia if you are interested.

One common mathematical method to find the minimum SSE involves calculus. We would write the SSE as a function of\(\beta_0\) and \(\beta_1\), compute the partial derivatives (w.r.t. each of the coefficients), set these equal to zero, and solve to find the values of the coefficients. The lm() function actually uses an optimization method called QR decomposition to obtain the regression coefficients. The actual mechanics and computation of these methods are beyond the scope of this course. We will just trust that the lm() function is doing things correctly in this course.


8.4 Computing the SSE for the Model Fitted to the Riverview Data

Since the regression model is based on the lowest SSE, it is often useful to compute and report the model’s SSE. We can use R to compute the SSE by carrying out the computations underlying the formula for SSE. Recall that the SSE is

\[ \mathrm{SSE} = \sum \left( Y_i - \hat{Y}_i\right)^2 \]

We need to compute the:

  1. Predicted values (\(\hat{Y}_i\));
  2. Residuals (\(e_i\));
  3. Squared residuals (\(e_i^2\)); and finally,
  4. Sum of the squared residuals (\(\sum e_i^2\)).

From the Riverview data set we have the observed X (education level) and Y (income) values, and from the fitted lm() we have the intercept and slope estimates for the ‘best’ fitting regression model.

# Compute the SSE
city |>
  mutate(
    y_hat = 11.321 + 2.651 * education,  # Step 1: Compute the predicted values of Y
    errors = income - y_hat,             # Step 2: Compute the residuals
    sq_errors = errors ^ 2               # Step 3: Compute the squared residuals
  ) |>
  summarize(
    SSE = sum(sq_errors)                 # Step 4: Compute the sum of the squared residuals
  )

The SSE gives us information about the variation in Y (the outcome variable) that is left over (residual) after we fit the regression model. Since the regression model is a function of X, the SSE tells us about the variation in Y that is left over after we remove the variation associated with, or accounted for by X. In our example it tells us about the residual variation in incomes after we account for employee education level.

In practice, we often report the SSE, but we do not interpret the actual value. The value of the SSE is more useful when comparing models that have been fitted using the same outcome. When researchers are considering different models, the SSEs from these models are compared to determine which model produces the least amount of misfit to the data (similar to what we did earlier).


8.5 References

Powell, V., & Lehe, L. (2015). Ordinary least squares regression: Explained visually. Explained Visually: A Setosa Project. https://setosa.io/ev/ordinary-least-squares-regression/