In 1964, the US Congress passed the Civil Rights Act and also ordered a survey of school districts to evaluate the availability of equal educational opportunity in public education. The results of this survey were reported on in Coleman et al. (1966) and Mosteller & Moynihan (1972). We will use a subset of the data collected (in equal-educational-opportunity.csv) to mimic one of the original regression analyses performed; examining whether the level of school facilities was an important predictor of student achievement after accounting for the variation in faculty credentials and peer influence.

The data come from a random sample of 70 schools in 1965. The variables, which have all been mean-centered and standardized, include:

achievement: Measurement indicating the student achievement level
faculty: Measurement indicating the faculty’s credentials
peer: Measurement indicating the influence of peer groups in the school
school: Measurement indicating the school facilities (e.g., building, teaching materials)

# Load libraries
library(broom)
library(corrr)
library(tidyverse)
library(tidymodels)
library(patchwork)

# Read in data
eeo = read_csv("~/Documents/github/epsy-8264/data/equal-education-opportunity.csv")
head(eeo)

# A tibble: 6 × 4
  achievement faculty    peer school
        <dbl>   <dbl>   <dbl>  <dbl>
1      -0.431   0.608  0.0351  0.166
2       0.800   0.794  0.479   0.534
3      -0.925  -0.826 -0.620  -0.786
4      -2.19   -1.25  -1.22   -1.04 
5      -2.85    0.174 -0.185   0.142
6      -0.662   0.202  0.128   0.273

# Source the residual_plots() function from the script file
source("../../scripts/residual_plots.R")

The problem we faced from the last set of notes, was that the predictors in the model were collinear, so we encountered computational issues when trying to estimate the effects and standard errors. One method to deal with collinearity among a set of predictors is to combine the predictors into a smaller subset of orthogonal measures (called principal components) that can be used instead of the original predictors. This subset of measures will not have the collinearity problems (they are orthogonal to one another), but constitute a slightly smaller amount of “variance accounted for” than the original set of predictors.

Singular Value Decomposition

Another decomposition method that creates orthogonal unit vectors (i.e., basis) is singular value decomposition (SVD). This decomposition method decomposes a matrix A into the product of three matrices, namely:

\[ \mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \]

where, U and V are an orthogonal matrices and D is a diagonal matrix.

PCA using SVD: Matrix Algebra

To carry out a principal components analysis, we need to use SVD to decompose the matrix \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\), where \(\mathbf{X}_{\mathrm{Predictor}}\) is a matrix of the predictors being used in the PCA. Below, we create the \(\mathbf{X}_{\mathrm{Predictor}}\) matrix we use the svd() function to decompose the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix using singular value decomposition.

# Create matrix of predictors
X_p = as.matrix(eeo[ , c("faculty", "peer", "school")])

# SVD decomposition
sv_decomp = svd(t(X_p) %*% X_p)

# View results
sv_decomp

$d
[1] 209.3295610   2.7303093   0.5833274

$u
           [,1]       [,2]       [,3]
[1,] -0.6174223  0.6698093 -0.4124865
[2,] -0.5243779 -0.7413256 -0.4188844
[3,] -0.5863595 -0.0423298  0.8089442

$v
           [,1]       [,2]       [,3]
[1,] -0.6174223  0.6698093 -0.4124865
[2,] -0.5243779 -0.7413256 -0.4188844
[3,] -0.5863595 -0.0423298  0.8089442

\[ \begin{split} \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} &= \mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \\[1em] \begin{bmatrix}81.123 & 66.518 & 75.512 \\ 66.518 & 59.163 & 64.251 \\ 75.512 & 64.251 & 72.358\end{bmatrix} &= \begin{bmatrix}0.617 & 0.670 & -0.412 \\ -0.524 & -0.741 & -0.419 \\ -0.586 & -0.042 & 0.809\end{bmatrix} \begin{bmatrix}209.330 & 0 & 0 \\ 0 & 2.730 & 0 \\ 0 & 0 & 0.583 \end{bmatrix} \begin{bmatrix}-0.617 & -0.524 & -0.586 \\ 0.670 & -0.741 & -0.042 \\ -0.412 & -0.419 & 0.809 \end{bmatrix} \end{split} \]

Understanding What the Matrix Algebra is Doing

Mathematically, since any matrix can be decomposed using SVD, we can also decompose \(\mathbf{X}_{\mathrm{Predictor}} = \mathbf{U}\mathbf{D}\mathbf{V}^{\intercal}\). Then we can write the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix, \(\mathbf{X}^{\intercal}\mathbf{X}\), as:

\[ \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} = (\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal})^{\intercal} (\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal}) \]

Re-expressing this we get:

\[ \begin{split} \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} &= \mathbf{V}\mathbf{D}^{\intercal}\mathbf{U}^{\intercal}\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \\[0.5em] \end{split} \]

Since D is a diagonal matrix, \(\mathbf{D}^{\intercal}\mathbf{D} = \mathbf{D}^2\), so reducing this expression gives:

\[ \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} = \mathbf{V}\mathbf{D}^2\mathbf{V}^{\intercal} \]

The matrices V and \(\mathbf{V}^{\intercal}\) are both orthogonal basis matrices that ultimately act to change the coordinate system by rotating the original basis vectors used in the predictor space. The \(\mathbf{D}^2\) matrix is diagonalizing the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix which amounts to finding the the major axes in the data ellipse along which our data varies.

Using the SVD Decomposition as PCA

As was pointed out in the previous section, the important matrices from the SVD are the V matrix (rotation matrix) and the D matrix (diagonalization matrix). The V matrix are the principal components:

\[ \mathbf{V} = \begin{bmatrix}-0.617 & 0.670 & -0.412 \\ -0.524 & -0.741 & -0.419 \\ -0.586 & -0.042 & 0.809 \end{bmatrix} \]

Note these are the same principal components we obtained using the eigendecomposition. The values in the D matrix are mathematically related to the eigenvalues. Because we based the SVD on the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix, the diagonal elements on D are the eigenvalues! That is,

\[ \lambda_i = d_{ii} \]

The eigenvalues (which are the variances) can then be used to compute the proportion of variance for each of the principal components.

# Compute proportion of variance
sv_decomp$d / sum(sv_decomp$d)

[1] 0.984416916 0.012839862 0.002743222

The principal component scores can be obtained by postmultiplying the mean centered predictor matrix by the V matrix.

# Mean center each predictor
X_p[, 1] = X_p[, 1] - mean(X_p[, 1])
X_p[, 2] = X_p[, 2] - mean(X_p[, 2])
X_p[, 3] = X_p[, 3] - mean(X_p[, 3])

# Compute PC scores
pc_scores = X_p %*% sv_decomp$v
head(pc_scores)

            [,1]        [,2]        [,3]
[1,] -0.41777127  0.37690208 -0.11724716
[2,] -0.98071768  0.15636966 -0.08255266
[3,]  1.36960233 -0.05831171 -0.02181284
[4,]  2.09547335  0.10933210  0.19860746
[5,] -0.02027426  0.25039535  0.13486066
[6,] -0.27859048  0.03203317  0.09791201

Using the prcomp() Function to Carry out PCA using SVD

In practice, we will use R functions (not the matrix algebra) to carry out a PCA using SVD. The prcomp() function carries out PCA using SVD decomposition. Different elements of the The prcomp() object can then be accessed to obtain output from the PCA.

# Fit the PCA using SVD decomposition
svd_pca = eeo %>%
  select(faculty, peer, school) %>%
  scale(center = TRUE, scale = FALSE) %>% # center data
  prcomp()

# View standard deviations and rotation matrix (eigenvector matrix)
svd_pca

Standard deviations (1, .., p=3):
[1] 1.7401965 0.1989041 0.0908621

Rotation (n x k) = (3 x 3):
               PC1         PC2        PC3
faculty -0.6173237  0.67025631 -0.4119077
peer    -0.5241853 -0.74086406 -0.4199407
school  -0.5866355 -0.04332342  0.8086915

Again, to compute the variances we can square the standard deviations output by the function and then use those variances to compute the variance accounted for by the principal components.

# Compute variances
var_pc = svd_pca[[1]] ^ 2
var_pc

[1] 3.028283805 0.039562836 0.008255921

# Compute variance accounted for
var_pc / sum(var_pc)

[1] 0.98445476 0.01286135 0.00268389

We can also obtain the scores on each principal component for each observation in the sample using the augment() function from the {broom} package.

# Obtain PC scores
augment(svd_pca)

# A tibble: 70 × 4
   .rownames .fittedPC1 .fittedPC2 .fittedPC3
   <chr>          <dbl>      <dbl>      <dbl>
 1 1            -0.418      0.377    -0.117  
 2 2            -0.981      0.156    -0.0827 
 3 3             1.37      -0.0582   -0.0214 
 4 4             2.10       0.109     0.199  
 5 5            -0.0203     0.250     0.135  
 6 6            -0.279      0.0319    0.0979 
 7 7            -0.0577     0.229    -0.00757
 8 8            -0.712      0.217     0.0974 
 9 9             1.08      -0.0198   -0.0380 
10 10           -1.41       0.167     0.0983 
# … with 60 more rows

In general it is more efficient to use singular value decomposition than eigendecomposition when carrying out a PCA. As such the use of prcomp() rather than princomp() is recommended in practice.

Scaling the Predictors

In practice, it is important to scale all the predictors used in the PCA. This is especially true when the variables are measured in different metrics or have varying degrees of magnitude. In these cases, not scaling the predictors will often result in results in which variables with large magnitudes of scale dominate the PCA. It also is helpful when the predictors are measured using qualitatively different scales (e.g., one is measured in dollars and another in years of education).

While it isn’t necessary to also center the predictors, it is common to simply standardize the set of predictors being used in the PCA (i.e., convert them to z-scores); thus both centering and scaling them. To do this, we will pipe the selected predictors into the scale() function prior to piping into the prcomp() function.

# Fit the PCA using SVD decomposition on standardized predictors
svd_pca_z = eeo %>%
  select(faculty, peer, school) %>%
  scale(center = TRUE, scale = TRUE) %>%
  prcomp()

# View standard deviations and rotation matrix (eigenvector matrix)
svd_pca_z

Standard deviations (1, .., p=3):
[1] 1.71813654 0.20011873 0.08921511

Rotation (n x k) = (3 x 3):
               PC1         PC2        PC3
faculty -0.5761385  0.67939712 -0.4544052
peer    -0.5754361 -0.73197527 -0.3648089
school  -0.5804634  0.05130072  0.8126687

# tidy version of the rotation matrix (good for graphing)
svd_pca_z %>%
  tidy(matrix = "rotation")

# A tibble: 9 × 3
  column     PC   value
  <chr>   <dbl>   <dbl>
1 faculty     1 -0.576 
2 faculty     2  0.679 
3 faculty     3 -0.454 
4 peer        1 -0.575 
5 peer        2 -0.732 
6 peer        3 -0.365 
7 school      1 -0.580 
8 school      2  0.0513
9 school      3  0.813

# View sds, variance accounted for
svd_pca_z %>%
  tidy(matrix = "eigenvalues")

# A tibble: 3 × 4
     PC std.dev percent cumulative
  <dbl>   <dbl>   <dbl>      <dbl>
1     1  1.72   0.984        0.984
2     2  0.200  0.0134       0.997
3     3  0.0892 0.00265      1

# Obtain PC scores
pc_scores = augment(svd_pca_z)
pc_scores

# A tibble: 70 × 4
   .rownames .fittedPC1 .fittedPC2 .fittedPC3
   <chr>          <dbl>      <dbl>      <dbl>
 1 1            -0.366      0.366     -0.123 
 2 2            -0.950      0.149     -0.0847
 3 3             1.34      -0.0633    -0.0197
 4 4             2.09       0.129      0.193 
 5 5             0.0152     0.267      0.127 
 6 6            -0.269      0.0437     0.0952
 7 7            -0.0275     0.230     -0.0128
 8 8            -0.672      0.231      0.0905
 9 9             1.06      -0.0261    -0.0369
10 10           -1.37       0.182      0.0925
# … with 60 more rows

Here the results from using the standardized predictors are not that different from the previous results since the variables were already reported as z-scores to begin with. The variance accounted for by the principal components is comparable (within rounding); the standardization of the predictors does not change this. The actual principal components in the V (rotation) matrix are different because of the centering and scaling, but the interpretations are the same as when we used the decomposition based on the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix. Similarly, because of the centering and scaling, the PC scores are different.

Behind the Scenes

By standardizing the predictors, we are carrying out the SVD on the correlation matrix of the predictors rather than on the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix. To see this, recall that the correlation matrix of the predictors (\(\mathbf{R_X}\)) is based on the standardized predictors, namely,

\[ \mathbf{R_X} = \frac{1}{n-1} \big(\mathbf{X}^{\intercal}_{\mathrm{Standardized~Predictor}}\mathbf{X}_{\mathrm{Standardized~Predictor}}\big) \]

The \(1/(n-1)\) component is a scalar and is pulled out so that the decomposition is carried out on the \(\mathbf{X}^{\intercal}_{\mathrm{Standardized~Predictor}}\mathbf{X}_{\mathrm{Standardized~Predictor}}\) matrix:

\[ \frac{1}{n-1} \big(\mathbf{X}^{\intercal}_{\mathrm{Standardized~Predictor}}\mathbf{X}_{\mathrm{Standardized~Predictor}}\big) = \frac{1}{n-1}\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \]

Similarly, we can also decompose the covariance matrix of the predictors (\(\boldsymbol\Sigma_{\mathbf{X}}\)), which is based on the centered predictors,

\[ \boldsymbol\Sigma_{\mathbf{X}} = \frac{1}{n-1} \big( \mathbf{X}^{\intercal}_{\mathrm{Centered~Predictor}}\mathbf{X}_{\mathrm{Centered~Predictor}}\big) \]

In this case, the decomposition is carried out on the \(\mathbf{X}^{\intercal}_{\mathrm{Centered~Predictor}}\mathbf{X}_{\mathrm{Centered~Predictor}}\) matrix. To do this using prcomp() we would change the scale= argument to FALSE in the scale() function, while still leaving center=TRUE.

Using the Principal Components in a Regression Model

Remember, we undertook the PCA because of the collinearity in the original predictors. Rather than using the original predictor values in our regression, we can use the scores from the PCA. Since we created the principal components to be orthogonal, this should alleviate any collinearity problems.

# Add the PC scores to the original data
eeo2 = eeo %>%
  bind_cols(pc_scores)

# View data
eeo2

# A tibble: 70 × 8
   achievement faculty    peer  school .rownames .fittedPC1 .fittedPC2
         <dbl>   <dbl>   <dbl>   <dbl> <chr>          <dbl>      <dbl>
 1      -0.431   0.608  0.0351  0.166  1            -0.366      0.366 
 2       0.800   0.794  0.479   0.534  2            -0.950      0.149 
 3      -0.925  -0.826 -0.620  -0.786  3             1.34      -0.0633
 4      -2.19   -1.25  -1.22   -1.04   4             2.09       0.129 
 5      -2.85    0.174 -0.185   0.142  5             0.0152     0.267 
 6      -0.662   0.202  0.128   0.273  6            -0.269      0.0437
 7       2.64    0.242 -0.0902  0.0497 7            -0.0275     0.230 
 8       2.36    0.594  0.218   0.519  8            -0.672      0.231 
 9      -0.913  -0.616 -0.490  -0.632  9             1.06      -0.0261
10       0.594   0.994  0.622   0.934  10           -1.37       0.182 
# … with 60 more rows, and 1 more variable: .fittedPC3 <dbl>

# Fit model using PC scores
lm.pc = lm(achievement ~ 1 + .fittedPC1 + .fittedPC2 + .fittedPC3, data = eeo2)

# Check for collinearity -- correlations
eeo2 %>%
  select(starts_with(".fitted")) %>%
  correlate()

# A tibble: 3 × 4
  term       .fittedPC1 .fittedPC2 .fittedPC3
  <chr>           <dbl>      <dbl>      <dbl>
1 .fittedPC1  NA         -4.02e-15   4.20e-16
2 .fittedPC2  -4.02e-15  NA          6.68e-16
3 .fittedPC3   4.20e-16   6.68e-16  NA

# Check for collinearity -- VIF
car::vif(lm.pc)

.fittedPC1 .fittedPC2 .fittedPC3 
         1          1          1

Examining some of the collinearity diagnostics we see that the predictors in this model are completely uncorrelated and the VIF values are 1; indicating that the SEs for these coefficients are exactly as large as they would be if the predictors were independent (which they are). Looking at the model- and coefficient-level output:

# Model-level output
glance(lm.pc)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
1     0.206         0.170  2.07      5.72 0.00153     3  -148.  306.
# … with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

# Coefficient-level output
tidy(lm.pc, conf.int = 0.95)

# A tibble: 4 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0192     0.247    0.0776 0.938      -0.475     0.513
2 .fittedPC1   -0.568      0.145   -3.91   0.000217   -0.857    -0.278
3 .fittedPC2   -0.881      1.25    -0.708  0.482      -3.37      1.61 
4 .fittedPC3   -3.22       2.79    -1.15   0.253      -8.80      2.35

The model-level output from this model is exactly the same as the model-level output from the model fitted with the original predictors. The coefficient-level output is where we start to see differences. Because there is no collinearity in this model, we now see a statistically significant result at the coefficient level (PC1). The other thing to remember here is that each principal component is a composite variable composed of all three predictors and perhaps had a more substantive interpretation. We need to use those substantive interpretations in the interpretations of coefficients:

The intercept is the predicted average achievement for cases where all the composite variables are 0.
The positive slope associated with the first composite indicates that higher values on this composite are associated with higher achievement, on average. In other words, higher values on all three predictors are associated with higher achievement, on average.
The negative slope associated with the second composite indicates that higher values on this composite are associated with lower achievement, on average. In other words, larger contrasts between faculty credentials and peer influence are associated with lower achievement, on average.
The positive slope associated with the third composite indicates that higher values on this composite are associated with higher achievement, on average. In other words, larger contrasts between school facilities and faculty credentials/peer influence are associated with higher achievement, on average.

Again, these interpretations may not be satisfactory—it all depends on whether the loadings offer a reasonable interpretation of the composite variable.

Dimension Reduction

One of the most useful qualities of a PCA, aside from fixing collinearity issues, can be to reduce the size of the predictor space in a regression model; thereby reducing the complexity of the model and improving statistical power. To do this, we consider the variance accounted for by each of the principal components.

In our example, the first principal component accounted for most of the variance in the original predictor space (approximately 98%). The other two principal components did not account for that much variation, which suggests that they may not be necessary. We can capture most of the variation in the original three predictors by just using the first principal component.

# Fit reduced model using PC1
lm.pc.2 = lm(achievement ~ 1 + .fittedPC1, data = eeo2)

# Model-level output
glance(lm.pc.2)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.184         0.172  2.07      15.4 0.000209     1  -149.  304.
# … with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

# Coefficient-level output
tidy(lm.pc.2, conf.int = 0.95)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0192     0.247    0.0777 0.938      -0.474     0.512
2 .fittedPC1   -0.568      0.145   -3.92   0.000209   -0.857    -0.279

In this model, the model \(R^2\) value is slightly smaller (0.183 rather than 0.206) since the first principal component did not account for all of the variance in the original predictors. However, this difference is negligible, and the model is still explains a statistically relevant amount of variation in achievement scores, \(F(1,68)=15.25\), \(p=0.0002\).

From the coefficient-level output, we see that the magnitude of the intercept and slope are comparable to those in the model where all three composites were used. The standard error and p-value for the composite in this model are slightly smaller than when we used all of the predictors. This represents the additional two degrees of freedom in the error term that we got from reducing the number of predictors. (In this example, the differences are negligible.)

Chatterjee, S., & Hadi, A. S. (2012). Regression analysis by example. Wiley.

Coleman, J. S., Cambell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfield, F. D., & York, R. L. (1966). Equality of educational opportunity. U.S. Government Printing Office.

Mosteller, F., & Moynihan, D. F. (1972). On equality of educational opportunity. Random House.

📝 Principal Components Analysis via Singular Value Decomposition

Singular Value Decomposition

PCA using SVD: Matrix Algebra

Understanding What the Matrix Algebra is Doing

Using the SVD Decomposition as PCA

Using the prcomp() Function to Carry out PCA using SVD

Scaling the Predictors

Behind the Scenes

Using the Principal Components in a Regression Model

Dimension Reduction

References