đź“ť Principal Components Analysis via Singular Value Decomposition
In this set of notes, we will give a brief introduction to principal components analysis via singular value decomposition. We will continue to use the equal-education-opportunity.csv data provided from Chatterjee & Hadi (2012) to evaluate the availability of equal educational opportunity in public education. The goal of the regression analysis is to examine whether the level of school facilities was an important predictor of student achievement after accounting for the variation in faculty credentials and peer influence.
A script file for the analyses in these notes is also available:
# Load libraries
library(broom)
library(corrr)
library(tidyverse)
library(tidymodels)
library(patchwork)
# Read in data
= read_csv("https://raw.githubusercontent.com/zief0002/pensive-giraffe/main/data/equal-education-opportunity.csv")
eeo head(eeo)
# Source the residual_plots() function from the script file
source("../scripts/residual_plots.R")
Singular Value Decomposition
Another decomposition method that creates orthogonal unit vectors (i.e., basis) is singular value decomposition (SVD). This decomposition method decomposes a matrix A into the product of three matrices, namely:
\[ \mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \]
where, U and V are an orthogonal matrices and D is a diagonal matrix.
PCA using SVD: Matrix Algebra
To carry out a principal components analysis, we need to use SVD to decompose the matrix \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\), where \(\mathbf{X}_{\mathrm{Predictor}}\) is a matrix of the predictors being used in the PCA. Below, we create the \(\mathbf{X}_{\mathrm{Predictor}}\) matrix we use the svd()
function to decompose the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix using singular value decomposition.
# Create matrix of predictors
= as.matrix(eeo[ , c("faculty", "peer", "school")])
X_p
# SVD decomposition
= svd(t(X_p) %*% X_p)
sv_decomp
# View results
sv_decomp
$d
[1] 209.3295610 2.7303093 0.5833274
$u
[,1] [,2] [,3]
[1,] -0.6174223 0.6698093 -0.4124865
[2,] -0.5243779 -0.7413256 -0.4188844
[3,] -0.5863595 -0.0423298 0.8089442
$v
[,1] [,2] [,3]
[1,] -0.6174223 0.6698093 -0.4124865
[2,] -0.5243779 -0.7413256 -0.4188844
[3,] -0.5863595 -0.0423298 0.8089442
\[ \begin{split} \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} &= \mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \\[1em] \begin{bmatrix}81.123 & 66.518 & 75.512 \\ 66.518 & 59.163 & 64.251 \\ 75.512 & 64.251 & 72.358\end{bmatrix} &= \begin{bmatrix}0.617 & 0.670 & -0.412 \\ -0.524 & -0.741 & -0.419 \\ -0.586 & -0.042 & 0.809\end{bmatrix} \begin{bmatrix}209.330 & 0 & 0 \\ 0 & 2.730 & 0 \\ 0 & 0 & 0.583 \end{bmatrix} \begin{bmatrix}-0.617 & -0.524 & -0.586 \\ 0.670 & -0.741 & -0.042 \\ -0.412 & -0.419 & 0.809 \end{bmatrix} \end{split} \]
Understanding What the Matrix Algebra is Doing
Mathematically, since any matrix can be decomposed using SVD, we can also decompose \(\mathbf{X}_{\mathrm{Predictor}} = \mathbf{U}\mathbf{D}\mathbf{V}^{\intercal}\). Then we can write the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix, \(\mathbf{X}^{\intercal}\mathbf{X}\), as:
\[ \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} = (\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal})^{\intercal} (\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal}) \]
Re-expressing this we get:
\[ \begin{split} \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} &= \mathbf{V}\mathbf{D}^{\intercal}\mathbf{U}^{\intercal}\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \\[0.5em] \end{split} \]
Since D is a diagonal matrix, \(\mathbf{D}^{\intercal}\mathbf{D} = \mathbf{D}^2\), so reducing this expression gives:
\[ \mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}} = \mathbf{V}\mathbf{D}^2\mathbf{V}^{\intercal} \]
The matrices V and \(\mathbf{V}^{\intercal}\) are both orthogonal basis matrices that ultimately act to change the coordinate system by rotating the original basis vectors used in the predictor space. The \(\mathbf{D}^2\) matrix is diagonalizing the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix which amounts to finding the the major axes in the data ellipse along which our data varies.
Using the SVD Decomposition as PCA
As was pointed out in the previous section, the important matrices from the SVD are the V matrix (rotation matrix) and the D matrix (diagonalization matrix). The V matrix are the principal components:
\[ \mathbf{V} = \begin{bmatrix}-0.617 & 0.670 & -0.412 \\ -0.524 & -0.741 & -0.419 \\ -0.586 & -0.042 & 0.809 \end{bmatrix} \]
Note these are the same principal components we obtained using the eigendecomposition. The values in the D matrix are mathematically related to the eigenvalues. Because we based the SVD on the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix, the diagonal elements on D are the eigenvalues! That is,
\[ \lambda_i = d_{ii} \]
The eigenvalues (which are the variances) can then be used to compute the proportion of variance for each of the principal components.
# Compute proportion of variance
$d / sum(sv_decomp$d) sv_decomp
[1] 0.984416916 0.012839862 0.002743222
The principal component scores can be obtained by postmultiplying the mean centered predictor matrix by the V matrix.
# Mean center each predictor
1] = X_p[, 1] - mean(X_p[, 1])
X_p[, 2] = X_p[, 2] - mean(X_p[, 2])
X_p[, 3] = X_p[, 3] - mean(X_p[, 3])
X_p[,
# Compute PC scores
= X_p %*% sv_decomp$v
pc_scores head(pc_scores)
[,1] [,2] [,3]
[1,] -0.41777127 0.37690208 -0.11724716
[2,] -0.98071768 0.15636966 -0.08255266
[3,] 1.36960233 -0.05831171 -0.02181284
[4,] 2.09547335 0.10933210 0.19860746
[5,] -0.02027426 0.25039535 0.13486066
[6,] -0.27859048 0.03203317 0.09791201
Using the prcomp() Function to Carry out PCA using SVD
In practice, we will use R functions (not the matrix algebra) to carry out a PCA using SVD. The prcomp()
function carries out PCA using SVD decomposition. Different elements of the The prcomp()
object can then be accessed to obtain output from the PCA.
# Fit the PCA using SVD decomposition
= eeo %>%
svd_pca select(faculty, peer, school) %>%
scale(center = TRUE, scale = FALSE) %>% # center data
prcomp()
# View standard deviations and rotation matrix (eigenvector matrix)
svd_pca
Standard deviations (1, .., p=3):
[1] 1.7401965 0.1989041 0.0908621
Rotation (n x k) = (3 x 3):
PC1 PC2 PC3
faculty -0.6173237 0.67025631 -0.4119077
peer -0.5241853 -0.74086406 -0.4199407
school -0.5866355 -0.04332342 0.8086915
Again, to compute the variances we can square the standard deviations output by the function and then use those variances to compute the variance accounted for by the principal components.
# Compute variances
= svd_pca[[1]] ^ 2
var_pc var_pc
[1] 3.028283805 0.039562836 0.008255921
# Compute variance accounted for
/ sum(var_pc) var_pc
[1] 0.98445476 0.01286135 0.00268389
We can also obtain the scores on each principal component for each observation in the sample using the augment()
function from the {broom}
package.
# Obtain PC scores
augment(svd_pca)
In general it is more efficient to use singular value decomposition than eigendecomposition when carrying out a PCA. As such the use of prcomp()
rather than princomp()
is recommended in practice.
Scaling the Predictors
In practice, it is important to scale all the predictors used in the PCA. This is especially true when the variables are measured in different metrics or have varying degrees of magnitude. In these cases, not scaling the predictors will often result in results in which variables with large magnitudes of scale dominate the PCA. It also is helpful when the predictors are measured using qualitatively different scales (e.g., one is measured in dollars and another in years of education).
Recall, the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix for a set of standardized predictors his the correlation matrix of the predictors. Thus, the SVD is actually being carried out on the correlation matrix rather than the raw \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix.
While it isn’t necessary to also center the predictors, it is common to simply standardize the set of predictors being used in the PCA (i.e., convert them to z-scores); thus both centering and scaling them. To do this, we will pipe the selected predictors into the scale()
function prior to piping into the prcomp()
function.
# Fit the PCA using SVD decomposition on standardized predictors
= eeo %>%
svd_pca_z select(faculty, peer, school) %>%
scale(center = TRUE, scale = TRUE) %>%
prcomp()
# View standard deviations and rotation matrix (eigenvector matrix)
svd_pca_z
Standard deviations (1, .., p=3):
[1] 1.71813654 0.20011873 0.08921511
Rotation (n x k) = (3 x 3):
PC1 PC2 PC3
faculty -0.5761385 0.67939712 -0.4544052
peer -0.5754361 -0.73197527 -0.3648089
school -0.5804634 0.05130072 0.8126687
# tidy version of the rotation matrix (good for graphing)
%>%
svd_pca_z tidy(matrix = "rotation")
# View sds, variance accounted for
%>%
svd_pca_z tidy(matrix = "eigenvalues")
# Obtain PC scores
= augment(svd_pca_z)
pc_scores pc_scores
Here the results from using the standardized predictors are not that different from the previous results since the variables were already reported as z-scores to begin with. The variance accounted for by the principal components is comparable (within rounding); the standardization of the predictors does not change this. The actual principal components in the V (rotation) matrix are different because of the centering and scaling, but the interpretations are the same as when we used the decomposition based on the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix. Similarly, because of the centering and scaling, the PC scores are different.
Behind the Scenes
By standardizing the predictors, we are carrying out the SVD on the correlation matrix of the predictors rather than on the \(\mathbf{X}^{\intercal}_{\mathrm{Predictor}}\mathbf{X}_{\mathrm{Predictor}}\) matrix. To see this, recall that the correlation matrix of the predictors (\(\mathbf{R_X}\)) is based on the standardized predictors, namely,
This implies that you can also carry out PCA on summaries of the predictors (rather than the raw data) by decomposing either the covariance matrix of the predictors or the correlation matrix of the predictors. This can be useful for example, when trying to reproduce the results of a PCA from a published paper, in which authors will often report summaries of the data used (e.g., correlation matrices) but not the raw data.
\[ \mathbf{R_X} = \frac{1}{n-1} \big(\mathbf{X}^{\intercal}_{\mathrm{Standardized~Predictor}}\mathbf{X}_{\mathrm{Standardized~Predictor}}\big) \]
The \(1/(n-1)\) component is a scalar and is pulled out so that the decomposition is carried out on the \(\mathbf{X}^{\intercal}_{\mathrm{Standardized~Predictor}}\mathbf{X}_{\mathrm{Standardized~Predictor}}\) matrix:
\[ \frac{1}{n-1} \big(\mathbf{X}^{\intercal}_{\mathrm{Standardized~Predictor}}\mathbf{X}_{\mathrm{Standardized~Predictor}}\big) = \frac{1}{n-1}\mathbf{U}\mathbf{D}\mathbf{V}^{\intercal} \]
Similarly, we can also decompose the covariance matrix of the predictors (\(\boldsymbol\Sigma_{\mathbf{X}}\)), which is based on the centered predictors,
\[ \boldsymbol\Sigma_{\mathbf{X}} = \frac{1}{n-1} \big( \mathbf{X}^{\intercal}_{\mathrm{Centered~Predictor}}\mathbf{X}_{\mathrm{Centered~Predictor}}\big) \]
In this case, the decomposition is carried out on the \(\mathbf{X}^{\intercal}_{\mathrm{Centered~Predictor}}\mathbf{X}_{\mathrm{Centered~Predictor}}\) matrix. To do this using prcomp()
we would change the scale=
argument to FALSE
in the scale()
function, while still leaving center=TRUE
.
Using the Principal Components in a Regression Model
Remember, we undertook the PCA because of the collinearity in the original predictors. Rather than using the original predictor values in our regression, we can use the scores from the PCA. Since we created the principal components to be orthogonal, this should alleviate any collinearity problems.
# Add the PC scores to the original data
= eeo %>%
eeo2 bind_cols(pc_scores)
# View data
eeo2
# Fit model using PC scores
= lm(achievement ~ 1 + .fittedPC1 + .fittedPC2 + .fittedPC3, data = eeo2)
lm.pc
# Check for collinearity -- correlations
%>%
eeo2 select(starts_with(".fitted")) %>%
correlate()
# Check for collinearity -- VIF
::vif(lm.pc) car
.fittedPC1 .fittedPC2 .fittedPC3
1 1 1
Examining some of the collinearity diagnostics we see that the predictors in this model are completely uncorrelated and the VIF values are 1; indicating that the SEs for these coefficients are exactly as large as they would be if the predictors were independent (which they are). Looking at the model- and coefficient-level output:
# Model-level output
glance(lm.pc)
# Coefficient-level output
tidy(lm.pc, conf.int = 0.95)
The model-level output from this model is exactly the same as the model-level output from the model fitted with the original predictors. The coefficient-level output is where we start to see differences. Because there is no collinearity in this model, we now see a statistically significant result at the coefficient level (PC1). The other thing to remember here is that each principal component is a composite variable composed of all three predictors and perhaps had a more substantive interpretation. We need to use those substantive interpretations in the interpretations of coefficients:
- The intercept is the predicted average achievement for cases where all the composite variables are 0.
- The positive slope associated with the first composite indicates that higher values on this composite are associated with higher achievement, on average. In other words, higher values on all three predictors are associated with higher achievement, on average.
- The negative slope associated with the second composite indicates that higher values on this composite are associated with lower achievement, on average. In other words, larger contrasts between faculty credentials and peer influence are associated with lower achievement, on average.
- The positive slope associated with the third composite indicates that higher values on this composite are associated with higher achievement, on average. In other words, larger contrasts between school facilities and faculty credentials/peer influence are associated with higher achievement, on average.
Again, these interpretations may not be satisfactory—it all depends on whether the loadings offer a reasonable interpretation of the composite variable.
Dimension Reduction
One of the most useful qualities of a PCA, aside from fixing collinearity issues, can be to reduce the size of the predictor space in a regression model; thereby reducing the complexity of the model and improving statistical power. To do this, we consider the variance accounted for by each of the principal components.
In our example, the first principal component accounted for most of the variance in the original predictor space (approximately 98%). The other two principal components did not account for that much variation, which suggests that they may not be necessary. We can capture most of the variation in the original three predictors by just using the first principal component.
# Fit reduced model using PC1
.2 = lm(achievement ~ 1 + .fittedPC1, data = eeo2)
lm.pc
# Model-level output
glance(lm.pc.2)
# Coefficient-level output
tidy(lm.pc.2, conf.int = 0.95)
In this model, the model \(R^2\) value is slightly smaller (0.183 rather than 0.206) since the first principal component did not account for all of the variance in the original predictors. However, this difference is negligible, and the model is still explains a statistically relevant amount of variation in achievement scores, \(F(1,68)=15.25\), \(p=0.0002\).
From the coefficient-level output, we see that the magnitude of the intercept and slope are comparable to those in the model where all three composites were used. The standard error and p-value for the composite in this model are slightly smaller than when we used all of the predictors. This represents the additional two degrees of freedom in the error term that we got from reducing the number of predictors. (In this example, the differences are negligible.)