Measure | 1. | 2. | 3. | 4. | 5. |
---|---|---|---|---|---|
1. Achievement | 50 (10) | --- | --- | --- | --- |
2. Ability | 0.737 | 100 (15) | --- | --- | --- |
3. Motivation | 0.255 | 0.205 | 50 (10) | --- | --- |
4. Previous Coursework | 0.615 | 0.498 | 0.375 | 4 (2) | --- |
5. Family Background | 0.417 | 0.417 | 0.190 | 0.372 | 0 (1) |
🎶 Regression from Summary Measures
When we have raw data, we can fit a regression model using the lm()
function and obtain model- and coefficient-level summaries using the glance()
and tidy()
functions respectively. However, there are times we might want to compute regression coefficients and standard errors, but are not given the raw data. This is common for example in articles, which often report summaries rather than providing raw data. In this set of notes, we will examine how we can fit a regression model to summaries of the data, rather than to the raw data itself.
So long as we have certain statistical information we can compute the regression coefficients and standard errors without having the raw data. To do this, we need the sample size, means, standard deviations, and correlations between variables. For example, consider the following summary table:
Say we wanted to use this information to fit a regression model to predict variation in achievement using the other four predictors. Mathematically,
\[ \begin{split} \mathrm{Achievement}_i = &\beta_0 + \beta_1(\mathrm{Ability}_i) + \beta_2(\mathrm{Motivation}_i) + \\ &\beta_3(\mathrm{Coursework~}_i) + \beta_4(\mathrm{Family~Background}_i) + \epsilon_i \end{split} \]
To do this we are going to create the correlation matrix, the vector of attribute means, and the vector of attribute standard deviations.
# Create correlation matrix
= matrix(
corrs data = c(
1.000, 0.737, 0.255, 0.615, 0.417,
0.737, 1.000, 0.205, 0.498, 0.417,
0.255, 0.205, 1.000, 0.375, 0.190,
0.615, 0.498, 0.375, 1.000, 0.372,
0.417, 0.417, 0.190, 0.372, 1.000
),nrow = 5
)
# View correlation matrix
corrs
[,1] [,2] [,3] [,4] [,5]
[1,] 1.000 0.737 0.255 0.615 0.417
[2,] 0.737 1.000 0.205 0.498 0.417
[3,] 0.255 0.205 1.000 0.375 0.190
[4,] 0.615 0.498 0.375 1.000 0.372
[5,] 0.417 0.417 0.190 0.372 1.000
# Create mean vector
= c(50, 100, 50, 4, 0)
means
# Create sd vector
= c(10, 15, 10, 2, 1)
sds
# Set sample size
= 1000 n
Computing Regression Coefficients from Summary Data
To compute the coefficients from raw data, recall that we solve the normal equations based on:
\[ \mathbf{X}^\intercal\mathbf{Xb} = \mathbf{X}^\intercal\mathbf{y} \]
Recall that if we mean center the variables in a regression analysis, the coefficients for this set of predictors would be exactly the same as if we used the raw predictors. (The intercept would differ, but ignore that for now.) That is if we solved the normal equations based on
\[ \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}\mathbf{b} = \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}} \]
the elements of b associated with the predictors in this set of equations would be the same as those from solving for b in \(\mathbf{X}^\intercal\mathbf{Xb} = \mathbf{X}^\intercal\mathbf{y}\).
So, if our goal is to find the coefficients associated with the predictors, it doesn’t matter if we use the raw or mean centered predictors. This is useful since there is a direct relationship between the \(\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}\) matrix and the variance-covariance matrix of the raw predictors:
\[ \mathrm{Cov}(X,X) = \frac{\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}}{n-1} \]
Similarly, there is a direct relationship between the \(\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}}\) matrix and the vector of covariances between the raw predictors and the raw outcome:
\[ \mathrm{Cov}(X,y) = \frac{\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}}}{n-1} \]
This means we can write our second set of normal equations (using the mean centered variables) as:
\[ \begin{split} \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}\mathbf{b} &= \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}} \\[1em] \mathrm{Cov}(X,X) (n-1)\mathbf{b} &= \mathrm{Cov}(X,y)(n-1) \\[1em] \mathrm{Cov}(X,X) \mathbf{b} &= \mathrm{Cov}(X,y) \end{split} \]
Which we can use to solve for b:
\[ \mathbf{b} = \mathrm{Cov}(X,X)^{-1}\mathrm{Cov}(X,y) \]
Going From the Correlations to Covariances
The summary measures we have give correlations, not covariances. However, it is quite easy to convert a correlation matrix to a covariance matrix. From the chapter Statistical Application: SSCP, Variance–Covariance, and Correlation Matrices in Matrix Algebra for Educational Scientists, know that:
\[ \mathbf{R} = \mathbf{S} \boldsymbol\Sigma \mathbf{S} \]
where R is the correlation matrix, \(\boldsymbol\Sigma\) is the covariance matrix, and S is a diagonal scaling matrix with diagonal elements equal to the reciprocal of the standard deviations of each of the variables in the covariance matrix. Re-arranging this:
\[ \boldsymbol\Sigma = \mathbf{S}^{-1} \mathbf{R} \mathbf{S}^{-1} \]
Recall that the inverse of a diagonal matrix is another diagonal matrix where the diagonal elements are the resciprocals of the original. Thus the diagonal elements of the inverse of our scaling matrix will be the standard deviations of the variables. Using this formula, we can now convert the given correlation matrix to a covariance matrix.
# Compute the scaling matrix S^(-1)
= diag(sds)
S_inv S_inv
[,1] [,2] [,3] [,4] [,5]
[1,] 10 0 0 0 0
[2,] 0 15 0 0 0
[3,] 0 0 10 0 0
[4,] 0 0 0 2 0
[5,] 0 0 0 0 1
# Compute covariance matrix
= S_inv %*% corrs %*% S_inv
covs covs
[,1] [,2] [,3] [,4] [,5]
[1,] 100.00 110.550 25.50 12.300 4.170
[2,] 110.55 225.000 30.75 14.940 6.255
[3,] 25.50 30.750 100.00 7.500 1.900
[4,] 12.30 14.940 7.50 4.000 0.744
[5,] 4.17 6.255 1.90 0.744 1.000
Adding in the row and column names:
Achievement | Ability | Motivation | Previous.Coursework | Family.Background | |
---|---|---|---|---|---|
Achievement | 100.00 | 110.550 | 25.50 | 12.300 | 4.170 |
Ability | 110.55 | 225.000 | 30.75 | 14.940 | 6.255 |
Motivation | 25.50 | 30.750 | 100.00 | 7.500 | 1.900 |
Previous Coursework | 12.30 | 14.940 | 7.50 | 4.000 | 0.744 |
Family Background | 4.17 | 6.255 | 1.90 | 0.744 | 1.000 |
Remember, the diagonal elements in the covariance matrix are variances of each variable and the off-diagonal elements are the covariances between two variables. For example, the first diagonal element is 100, which is the variance of the achievement variable. The remaining elements in the first column indicate the covariances between achievement and ability, achievement and motivation, achievement and previous coursework, and achievement and family background.
Finding the Predictor Coefficients
To compute the coefficients for the predictors, we need to obtain two things:
- Cov(X, X): The variance-covariance matrix of the predictors, and
- Cov(X, y): The vector that contains the covariances between the outcome and each predictor. (Note: This vector does not include the variance of the outcome, only the covariances with the predictors.)
In our example, the elements in the first column (or row) of the variance-covariance matrix other than the variance of the outcome are the values in the vector Cov(X, y). The elements in the other rows/columns make up the elements of the Cov(X, X) matrix.
Achievement | Ability | Motivation | Previous.Coursework | Family.Background | |
---|---|---|---|---|---|
Achievement | 100.00 | 110.550 | 25.50 | 12.300 | 4.170 |
Ability | 110.55 | 225.000 | 30.75 | 14.940 | 6.255 |
Motivation | 25.50 | 30.750 | 100.00 | 7.500 | 1.900 |
Previous Coursework | 12.30 | 14.940 | 7.50 | 4.000 | 0.744 |
Family Background | 4.17 | 6.255 | 1.90 | 0.744 | 1.000 |
Here the blue elements compose Cov(X, y) and the reddish-purple elements constitute Cov(X, X). We can create this vector and matrix using indexing.
# Create Cov(X, y)
= covs[-1 , 1]
cov_xy cov_xy
[1] 110.55 25.50 12.30 4.17
# Create Cov(X, X)
= covs[-1 , -1]
cov_xx cov_xx
[,1] [,2] [,3] [,4]
[1,] 225.000 30.75 14.940 6.255
[2,] 30.750 100.00 7.500 1.900
[3,] 14.940 7.50 4.000 0.744
[4,] 6.255 1.90 0.744 1.000
Then we can use these to compute the predictor coefficients.
# Compute predictor coefficients
= solve(cov_xx) %*% cov_xy
b b
[,1]
[1,] 0.36737330
[2,] 0.01257721
[3,] 1.55001342
[4,] 0.69497334
Thus the coefficient estimates are:
- \(\hat\beta_{\mathrm{Ability}} = 0.367\)
- \(\hat\beta_{\mathrm{Motivation}} = 0.013\)
- \(\hat\beta_{\mathrm{Previous~Coursework}} = 1.550\)
- \(\hat\beta_{\mathrm{Family~Background}} = 0.695\)
Computing the Intercept
To compute the intercept, we can use the following:
\[ \hat\beta_0 = \bar{y} - \bar{\mathbf{x}}^\intercal\mathbf{b} \]
where \(\bar{y}\) is the mean of the outcome, \(\bar{\mathbf{x}}\) is the vector of predictor means, and b is the associated vector of predictor coefficients.
In our example,
# Compute intercept
= means[1] - t(means[2:5]) %*% b
b_0 b_0
[,1]
[1,] 6.433756
Thus the fitted equation for the unstandardized model is:
\[ \begin{split} \widehat{\mathrm{Achievement}_i} = &6.434 + 0.367(\mathrm{Ability}_i) + 0.013(\mathrm{Motivation}_i) + \\ &1.550(\mathrm{Coursework~}_i) + 0.695(\mathrm{Family~Background}_i) \end{split} \]
Standard Errors for the Coefficients
Using the raw data, we estimated the standard error for the coefficients by taking the square root of the diagonal elements of the variance-covariance matrix of the coefficients, where the the variance-covariance matrix of the coefficients was computed as:
\[ \hat\sigma^2_{e} (\mathbf{X}^\intercal\mathbf{X})^{-1} \]
We can use the summary values we have to also obtain the \(\hat\sigma^2_{e}\) estimate and to compute the elements of the \(\mathbf{X}^\intercal\mathbf{X}\) matrix.
The estimate of the residual variance is found using:
\[ \hat\sigma^2_{e} = \hat\sigma^2_y (1 - R^2_{\mathrm{Adj}}) \]
where \(\hat\sigma^2_y\) is the variance of the outcome and \(R^2_{\mathrm{Adj}}\) is the Adjusted \(R^2\) value which is found using:
\[ R^2_{\mathrm{Adj}} = 1 - \frac{\mathrm{df}_{\mathrm{Total}}}{\mathrm{df}_{\mathrm{Residual}}}(1 -R^2) \]
To find the \(R^2\) value, we can invert the correlation matrix and compute one minus the reciprocal of the diagonal elements of this inverted matrix. The results are the \(R^2\) values for the given term as an outcome variable and all other terms in the correlation matrix as predictors.
# Invert correlation matrix
= solve(corrs)
corr_inv
# Compute the R2 values
= 1 - 1/diag(corr_inv)
r2 r2
[1] 0.6289704 0.5591756 0.1439290 0.4421816 0.2201437
Here the \(R^2\) value from regressing achievement on all the other variables in the correlation matrix (e.g., ability, motivation, previous coursework, and family background) is 0.629. We can then use this to compute adjusted \(R^2\) and subsequently use that to compute \(\hat\sigma^2_y\).
# Compute adjusted R2
= 1 - (999 / 995) * (1 - r2[1])
r2_adj r2_adj
[1] 0.6274788
# Compute estimated residual variance
= sds[1]^2 * (1 - r2_adj)
s2_e s2_e
[1] 37.25212
Our estimate for \(\hat\sigma^2_y\) is 37.25.
Compute \(\mathbf{X}^\intercal\mathbf{X}\) Matrix
The elements in the first row and column of the \(\mathbf{X}^\intercal\mathbf{X}\) matrix are functions of the sample size and means of the predictor variables. Namely the first element in the first row and column is n and the remaining elements in the that row and column are created as \(n\mathbf{M_x}\), where n is the sample size, and \(\mathbf{M_x}\) is the vector of predictor means. The remaining elements constitute a submatrix defined as:
\[ (\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}} = (n-1)\mathrm{Cov}(X,X) + n(\mathbf{M_x})(\mathbf{M_x}^\intercal) \]
where n is the sample size, Cov(X,X) is the variance covariance matrix ofthe predictors, and \(\mathbf{M_x}\) is again the vector of predictor means. We cn then create the \(\mathbf{X}^\intercal\mathbf{X}\) matrix as:
\[ \require{color} \begin{split} \mathbf{X}^\intercal\mathbf{X} &= \begin{bmatrix}{\color[rgb]{0.044147,0.363972,0.636955}n} & {\color[rgb]{0.044147,0.363972,0.636955}n\mathbf{M_x}}\\{\color[rgb]{0.044147,0.363972,0.636955}n\mathbf{M_x}} & {\color[rgb]{0.748052,0.381230,0.589698}(\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}}}\end{bmatrix} \\[2ex] &= \begin{bmatrix}{\color[rgb]{0.044147,0.363972,0.636955}n} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X1}} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X2}} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X3}} & {\color[rgb]{0.044147,0.363972,0.636955}\ldots} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{Xk}}\\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X1}} & & & & & \\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X2}} & & & & & \\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X3}} & & & {\color[rgb]{0.748052,0.381230,0.589698}(\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}}} & & \\ {\color[rgb]{0.044147,0.363972,0.636955}\vdots} & & & & & \\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{Xk}} & & & & & \end{bmatrix} \end{split} \]
To create this matrix using R, we will:
- Compute the submatrix \((\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}}\).
- Bind the \(n\mathbf{M_x}\) vector to the top of the submatrix.
- Bind the vector that contains n and \(n\mathbf{M_x}\) to the left of the resulting matrix from Step 2.
The resulting matrix is the \(\mathbf{X}^\intercal\mathbf{X}\) matrix. For our example,
# Step 1: Create submatrix
= 999 * cov_xx + 1000 * means[2:5] %*% t(means[2:5])
sub_mat sub_mat
[,1] [,2] [,3] [,4]
[1,] 10224775.000 5030719.2 414925.060 6248.745
[2,] 5030719.250 2599900.0 207492.500 1898.100
[3,] 414925.060 207492.5 19996.000 743.256
[4,] 6248.745 1898.1 743.256 999.000
# Step 2: Bind n(M_x) to top of submatrix
= rbind(1000 * means[2:5], sub_mat)
mat_2 mat_2
[,1] [,2] [,3] [,4]
[1,] 100000.000 50000.0 4000.000 0.000
[2,] 10224775.000 5030719.2 414925.060 6248.745
[3,] 5030719.250 2599900.0 207492.500 1898.100
[4,] 414925.060 207492.5 19996.000 743.256
[5,] 6248.745 1898.1 743.256 999.000
# Step 3: Bind vector to left of Step 2 matrix
= cbind(c(1000, 1000 * means[2:5]), mat_2)
XtX XtX
[,1] [,2] [,3] [,4] [,5]
[1,] 1000 100000.000 50000.0 4000.000 0.000
[2,] 100000 10224775.000 5030719.2 414925.060 6248.745
[3,] 50000 5030719.250 2599900.0 207492.500 1898.100
[4,] 4000 414925.060 207492.5 19996.000 743.256
[5,] 0 6248.745 1898.1 743.256 999.000
We can then scale this matrix using our previous estimate of the residual variance to obtain the variance-covariance matrix for the coefficients and compute the coefficient standard errors.
# Compute var-cov matrix of b
= s2_e * solve(XtX)
cov_b cov_b
[,1] [,2] [,3] [,4] [,5]
[1,] 2.86183238 -0.021078176856 -0.018522600965 0.052341868 0.1280945885
[2,] -0.02107818 0.000240314903 -0.000001964506 -0.000713772 -0.0009683908
[3,] -0.01852260 -0.000001964506 0.000435428791 -0.000763097 -0.0002472826
[4,] 0.05234187 -0.000713772031 -0.000763096999 0.014297546 -0.0047228461
[5,] 0.12809459 -0.000968390764 -0.000247282552 -0.004722846 0.0473303248
# Compute SEs
= sqrt(diag(cov_b))
se se
[1] 1.69169512 0.01550209 0.02086693 0.11957235 0.21755534
These are the standard errors for the intercept and each predictor. Here I add the coefficients and standard errors to a data frame and use them to compute t-values, associated p-values, and confidence intervals.
# Load library
library(tidyverse)
# Create regression table
data.frame(
Predictor = c("Intercept", "Ability", "Motivation", "Previous Coursework", "Family Background"),
B = c(b_0, b),
SE = se
|>
) mutate(
t = round(B /SE, 3),
p = round(2 * pt(-abs(t), df = 995), 5),
CI = paste0("(", round(B + qt(.025, df = 995)*SE, 3), ", ", round(B + qt(.975, df = 995)*SE, 3), ")")
)
Standardized Regression from Summaries
We can also compute pertinent values from the standardized regression coefficients in a similar manner; in fact it is easier since there is no intercept to compute in a standardized regression. In a standardized regression, the coefficients are based on the centered and scaled data. Similarly, the correlation matrix is a scaled version of the covariance matrix (which represents the centered data). This allows us to also use the correlation matrices in the set of normal equations:
\[ \mathrm{Cor}(X,X)\mathbf{b^*} = \mathrm{Cor}(X,y) \]
Here we use the notation \(\mathbf{b^*}\) to differentiate these predictor estimates from the unstandardized estimates. We can solve this for \(\mathbf{b^*}\):
\[ \mathbf{b^*} = \mathrm{Cor}(X,X)^{-1}\mathrm{Cor}(X,y) \]
This reminds us that the regression coefficients in a standardized regression are correlations (or partial correlations in the multiple regression) between a predictor and the outcome. Here we us the original correlation matrix to compute the standardized regression coefficients:
# Compute standardized coefficients
= solve(corrs[-1 , -1]) %*% corrs[1 , 2:5]
b_star b_star
[,1]
[1,] 0.55105995
[2,] 0.01257721
[3,] 0.31000268
[4,] 0.06949733
Writing the fitted equation:
\[ \begin{split} \hat{z}_{\mathrm{Achievement}_i} = &0.551(z_{\mathrm{Ability}_i}) + 0.012(z_{\mathrm{Motivation}_i}) + \\ &0.310(z_{\mathrm{Coursework~}_i}) + 0.069(z_{\mathrm{Family~Background}_i}) \end{split} \]
We can also compute the associated standard errors using the following:
\[ \mathrm{SE}_{\mathrm{Std}} = \mathrm{SE}_{\mathrm{Unstd}} \bigg(\frac{b_{\mathrm{Unstd}}}{b_{\mathrm{Std}}}\bigg) \]
In our example,
# Compute SEs
-1] * b_star / b se[
[,1]
[1,] 0.02325314
[2,] 0.02086693
[3,] 0.02391447
[4,] 0.02175553
Here we had to remove the first element in the se
object since it was related to the intercept. Using the estimates from the standardized regresion, the t-values and associated p-values are identical to those from the standardized model. In practice, if you are reporting standardized coefficients, it is typical to report the unstandardized estimates and standard errors, the standardized estimates, and the t- and p-values.
Predictor | B | SE | Std. B | t | p |
---|---|---|---|---|---|
Ability | 0.37 | 0.02 | 0.551 | 18.50 | 0.00000 |
Motivation | 0.01 | 0.02 | 0.013 | 0.50 | 0.61719 |
Previous Coursework | 1.55 | 0.12 | 0.310 | 12.92 | 0.00000 |
Family Background | 0.69 | 0.22 | 0.069 | 3.14 | 0.00174 |
Intercept | 6.43 | 1.69 | 3.80 | 0.00015 | |
Note: R-squared =0.629; Residual Standard Error = 6.10 |
Simulating Data Using Summary Values
We can also simulate a set of data that mimics the correlation matrix, means, and standard deviations that were reported in the summaries. To do this, we need to make an assumption that the variables are all multivariately normally distributed. If we make this strong assumption, then we can simulate using the mvrnorm()
function from the {MASS}
package. In this function we need to provide the sample size, the vector of means (for the outcome and predictors), and the variance-covariance matrix of all the variables.
::mvrnorm(n = 1000, mu = means, Sigma = covs) MASS
This will draw a sample from a population where the variables have those means and variance-covariance matrix. This means that the data we draw will likely not have the same means, SDs, or correlations. If we want the data to reproduce the summary values exactly, we add the argument empirical=TRUE
. Here we do this and assign the data to the sim_dat
object
# Set seed for reproducibility
set.seed(1)
# Simulate the data
= MASS::mvrnorm(n = 1000, mu = means, Sigma = covs, empirical = TRUE) sim_dat
This is a matrix object, so we convert it to a data frame and also assign column names so the data are easier to use in our functions.
# Convert to data frame
= data.frame(sim_dat)
sim_dat
# Change column names
names(sim_dat) = c("achieve", "ability", "motivation", "coursework", "fam_back")
# View data
head(sim_dat)
Now we can check that the summaries are being reproduced.
# Compute means and SDs
|>
sim_dat summarize(
across(.cols = everything(), .fns = list(Mean = mean, SD = sd))
|>
) round(3)
# Compute correlation matrix
|>
sim_dat ::correlate() corrr
Finally, we should be able to reproduce the regression estimates we have computed.
# Fit unstandardized model
= lm(achieve ~ 1 + ability + motivation + coursework + fam_back, data = sim_dat)
lm_unstd
# Load broom library
library(broom)
# Model-level output
glance(lm_unstd) |>
print(width = Inf)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.629 0.627 6.10 422. 1.89e-212 4 -3225. 6463. 6492.
deviance df.residual nobs
<dbl> <int> <int>
1 37066. 995 1000
# Coefficient-level output
tidy(lm_unstd)
# Standardized model
|>
sim_dat scale() |>
data.frame() |>
lm(formula = achieve ~ 1 + ability + motivation + coursework + fam_back) |>
tidy() |>
filter(term != "(Intercept)")