Chapter 19 Statistical Appplication: Estimating Regression Coefficients

In this chapter, you will learn about how matrix algebra is used to compute regression coefficients.


19.1 Regression Model

Recall the model equation we use in linear regression:

\[ Y_i = \beta_0 + \beta_1(X_{1_i}) + \beta_2(X_{2_i}) + \ldots + \beta_k(X_{k_i}) + \epsilon_i \]

where the response variable (Y) is represented as a linear function of the set of predictors \(X_1,X_2,\ldots,X_k\)) and a residual (\(\epsilon\)). Equation terms with an i subscript vary across subjects. Terms without an i subscript are the same (fixed) across subjects.

Recall that this notation is a mathematical expression of the n subject-specific equations:

\[ \begin{split} Y_1 &= \beta_0 + \beta_1(X_{1_1}) + \beta_2(X_{2_1}) + \ldots + \beta_k(X_{k_1}) + \epsilon_1 \\ Y_2 &= \beta_0 + \beta_1(X_{1_2}) + \beta_2(X_{2_2}) + \ldots + \beta_k(X_{k_2}) + \epsilon_2 \\ Y_3 &= \beta_0 + \beta_1(X_{1_3}) + \beta_2(X_{2_3}) + \ldots + \beta_k(X_{k_3}) + \epsilon_3 \\ \vdots &~ ~~~~~\vdots ~~~~~~~~~~\vdots~~~~~~~~~~~~~~~~~~~\vdots~~~~~~~~~~~~~~\vdots~~~~~~~~~~~~~\vdots~~~~~~~~~~~\vdots \\ Y_n &= \beta_0 + \beta_1(X_{1_n}) + \beta_2(X_{2_n}) + \ldots + \beta_k(X_{k_n}) + \epsilon_n \end{split} \]

These can be arranged into a set of vectors and matrices,

\[ \begin{bmatrix}Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n\end{bmatrix} = \begin{bmatrix}\beta_0(1) + \beta_1(X_{1_1}) + \beta_2(X_{2_1}) + \ldots + \beta_k(X_{k_1}) \\ \beta_0(1) + \beta_1(X_{1_2}) + \beta_2(X_{2_2}) + \ldots + \beta_k(X_{k_2}) \\ \beta_0(1) + \beta_1(X_{1_3}) + \beta_2(X_{2_3}) + \ldots + \beta_k(X_{k_3}) \\ \vdots \\ \beta_0(1) + \beta_1(X_{1_n}) + \beta_2(X_{2_n}) + \ldots + \beta_k(X_{k_n})\end{bmatrix} + \begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n\end{bmatrix} \]

We can re-write this as,

\[ \begin{bmatrix}Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n\end{bmatrix} = \begin{bmatrix}1 & X_{1_1} & X_{2_1} & \ldots & X_{k_1} \\ 1 & X_{1_2} & X_{2_2} & \ldots & X_{k_2} \\ 1 & X_{1_3} & X_{2_3} & \ldots & X_{k_3} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & X_{1_n} & X_{2_n} & \ldots & X_{k_n}\end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\\beta_2 \\ \vdots \\ \beta_k\end{bmatrix}+ \begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n\end{bmatrix} \]

Naming these vectors and matrices, we can use matrix notation to compactly write the regression model as,

\[ \underset{n \times 1}{\mathbf{y}} = \underset{n \times k}{\mathbf{X}}~\underset{k \times 1}{\mathbf{b}} + \underset{n \times 1}{\mathbf{e}} \]

In the equation above,

  • y is a vector of the outcome values,
  • X is a matrix referred to the design matrix (a.k.a., the model matrix, the data matrix),
  • b is a vector of regression coefficients, and
  • e is a vector of the residuals.


19.2 Estimating the Regression Coefficients

When we fit a regression model to a dataset, one of the goals is to estimate the regression coefficients, \(\beta_0, \beta_1, \beta_2,\ldots,\beta_k\). Based on the regression equation,

\[ \mathbf{Y} = \mathbf{Xb} + \boldsymbol{\epsilon} \]

our goal is to solve for terms in the b vector. (Note that here (and moving forward) the dimensions of each matrix/vector have been omitted when we write the regression model.) How this is done, depends on the estimation method used. For the rest of this chapter, we will assume that Ordinary Least Squares (OLS) regression is being used to estimate the coefficients.

In OLS estimation, we want to find the coefficient values that produce the smallest sum of squared residuals. To do this, we first re-write the regression equation to isolate the error vector:

\[ \mathbf{e} = \mathbf{y} - \mathbf{Xb} \]

The sum of squared residual can be expressed in matrix notation as \(\mathbf{e}^{\intercal}\mathbf{e}\). This implies:

\[ \mathbf{e}^{\intercal}\mathbf{e} = (\mathbf{y} - \mathbf{Xb})^{\intercal} (\mathbf{y} - \mathbf{Xb}) \]

Using the rules of transposes and expanding the right-hand side, we get,

\[ \mathbf{y}^{\intercal}\mathbf{y} - \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{y} - \mathbf{y}^{\intercal}\mathbf{X}\mathbf{b} + \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\mathbf{b} \]

Each of these terms is a \(1\times 1\) matrix10, which implies that each term is equal to its transpose. We will re-write the third term \(\mathbf{y}^{\intercal}\mathbf{X}\mathbf{b}\) as its transpose \(\mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{y}\). Re-writing, we get:

\[ \mathbf{y}^{\intercal}\mathbf{y} - \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{y} - \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{y} + \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\mathbf{b} \]

Combining the two middle terms,

\[ \mathbf{y}^{\intercal}\mathbf{y} - 2\mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{y} + \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\mathbf{b} \]

To find the values for the elements in b that minimize the equation, we differentiate this expression with respect to b.

\[ \frac{\delta}{\delta\mathbf{b}}~\mathbf{y}^{\intercal}\mathbf{y} - 2\mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{y} + \mathbf{b}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\mathbf{b} \]

Although calculus, especially calculus on matrices, is beyond the scope of this book, Fox (2009) gives the interested reader some mathematical background on optimization (i.e., minimizing). For now you just need to understand we can optimize a function by computing its derivative, setting the derivative equal to 0, and solving for any remaining unknowns.

Differentiating this we get

\[ -2\mathbf{X}^{\intercal}\mathbf{y} + 2\mathbf{X}^{\intercal}\mathbf{Xb} \]

We set this equal to zero and solve for b.

\[ \begin{split} -2\mathbf{X}^{\intercal}\mathbf{y} + 2\mathbf{X}^{\intercal}\mathbf{Xb} &= 0 \\[2ex] 2\mathbf{X}^{\intercal}\mathbf{Xb} &= 2\mathbf{X}^{\intercal}\mathbf{y} \\[2ex] \mathbf{X}^{\intercal}\mathbf{Xb} &= \mathbf{X}^{\intercal}\mathbf{y} \end{split} \]

This representation of the OLS equations:

\[ \mathbf{X}^{\intercal}\mathbf{Xb} = \mathbf{X}^{\intercal}\mathbf{y} \]

is referred to as the Normal Equations. It is the basis for many methods of soving for b.


To isolate b we pre-multiply both sides of the equation by \((\mathbf{X}^{\intercal}\mathbf{X})^{-1}\).

\[ \begin{split} (\mathbf{X}^{\intercal}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\mathbf{b} &= (\mathbf{X}^{\intercal}\mathbf{X})^{-1}(\mathbf{X}^{\intercal}\mathbf{y}) \\[2ex] \mathbf{I}\mathbf{b} &= (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{y} \\[2ex] \mathbf{b} &= (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{y} \end{split} \]

The vector of regression coefficients can be obtain from:

\[ \mathbf{b} = (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{y} \]

This implies that the vector of regression coefficients can be obtained directly through manipulation of the design matrix and the vector of outcomes. In other words, the OLS coefficients is a direct function of the data.


19.2.1 Example Using Data

We will use the following toy data set to illustrate how regression is carried out via matrix algebra.

Table 19.1: Example set of education data.
ID SAT GPA Self-Esteem IQ
1 560 3.0 11 112
2 780 3.9 10 143
3 620 2.9 19 124
4 600 2.7 7 129
5 720 3.7 18 130
6 380 2.4 13 82

Say we wanted to fit a regression model using SAT and Self-Esteem to predict variation in GPA. We can estimate the regression coefficients by creating a design matrix (X), the vector of outcomes (y), and then using matrix algebra.

# Create vector of outcomes
y = c(3.0, 3.9, 2.9, 2.7, 3.7, 2.4)

# Create design matrix
X = matrix(
  data = c(
    rep(1, 6), 
    560, 780, 620, 600, 720, 380, 
    11, 10, 19, 7, 18, 13
    ),
  ncol = 3
)

# View X
X
     [,1] [,2] [,3]
[1,]    1  560   11
[2,]    1  780   10
[3,]    1  620   19
[4,]    1  600    7
[5,]    1  720   18
[6,]    1  380   13
# Estimate coefficients
b = solve(t(X) %*% X) %*% t(X) %*% y
b
            [,1]
[1,] 0.659213517
[2,] 0.003805501
[3,] 0.009186998

Let’s compare this to the estimates given in the lm() function.

# Create data frame
d = data.frame(
  GPA = c(3.0, 3.9, 2.9, 2.7, 3.7, 2.4),
  SAT = c(560, 780, 620, 600, 720, 380),
  Self = c(11, 10, 19, 7, 18, 13)
)

# Fit model
lm.1 = lm(GPA ~ 1 + SAT + Self, data = d)

# View coefficients
coef(lm.1)
(Intercept)         SAT        Self 
0.659213517 0.003805501 0.009186998 

The matrix algebra and lm() function produce identical coefficient estimates.


19.3 The \(\mathbf{X}^\intercal\mathbf{X}\) Matrix

In computing estimates, the matrix that must be inverted is \(\mathbf{X}^\intercal\mathbf{X}\). In simple regression (with a single predictor) this is:

\[ \begin{split} \mathbf{X}^\intercal\mathbf{X} &= \begin{bmatrix}1 & 1 & 1 & \ldots & 1\\ X_1 & X_2 & X_3 & \ldots & X_n\end{bmatrix} \begin{bmatrix}1 & X_1\\ 1 & X_2\\ 1 & X_3\\ \vdots & \vdots\\ 1& X_n\end{bmatrix} \\[2ex] &= \begin{bmatrix}n & \sum X_i\\ \sum X_i & \sum X_i^2\end{bmatrix} \end{split} \] The determninant can then be found as:

\[ \begin{split} \begin{vmatrix}n & \sum X_i\\ \sum X_i & \sum X_i^2\end{vmatrix} &= n\sum X_i^2 - \bigg(\sum X_i\bigg)^2 \\[2ex] &= n\sum X_i^2 - (n\bar{X})^2 \\[2ex] &= n \bigg(\sum X_i^2 - n\bar{X}^2\bigg) \\[2ex] &= n \sum(X_i - \bar{X})^2 \end{split} \]

This will be a positive value as long as there is variation in X, which implies that the inverse of \(\mathbf{X}^\intercal\mathbf{X}\) should exist. Note that when there is very little variation in X, it may be that the inverse is computationally singular.

When the design matrix expands to include more than one predictor, this is also the case. The inverse of the \(\mathbf{X}^\intercal\mathbf{X}\) matrix can be computed as long as there is variation in the predictors and no one predictor is a linear combination of the other columns in the design matrix.


References

Fox, J. (2009). A mathematical primer for social statistics. Sage.

  1. You should verify that the dimension of each term is \(1\times 1\).↩︎