🎶 Regression from Summary Measures

Published

October 13, 2022


When we have raw data, we can fit a regression model using the lm() function and obtain model- and coefficient-level summaries using the glance() and tidy() functions respectively. However, there are times we might want to compute regression coefficients and standard errors, but are not given the raw data. This is common for example in articles, which often report summaries rather than providing raw data. In this set of notes, we will examine how we can fit a regression model to summaries of the data, rather than to the raw data itself.

So long as we have certain statistical information we can compute the regression coefficients and standard errors without having the raw data. To do this, we need the sample size, means, standard deviations, and correlations between variables. For example, consider the following summary table:

Correlation matrix for five attributes measured on n = 1,000 students. Means (standard deviations) for each attribute are provided on the main diagonal.
Measure 1. 2. 3. 4. 5.
1. Achievement 50 (10) --- --- --- ---
2. Ability 0.737 100 (15) --- --- ---
3. Motivation 0.255 0.205 50 (10) --- ---
4. Previous Coursework 0.615 0.498 0.375 4 (2) ---
5. Family Background 0.417 0.417 0.190 0.372 0 (1)

Say we wanted to use this information to fit a regression model to predict variation in achievement using the other four predictors. Mathematically,

\[ \begin{split} \mathrm{Achievement}_i = &\beta_0 + \beta_1(\mathrm{Ability}_i) + \beta_2(\mathrm{Motivation}_i) + \\ &\beta_3(\mathrm{Coursework~}_i) + \beta_4(\mathrm{Family~Background}_i) + \epsilon_i \end{split} \]

To do this we are going to create the correlation matrix, the vector of attribute means, and the vector of attribute standard deviations.

# Create correlation matrix
corrs = matrix(
  data = c(
    1.000, 0.737, 0.255, 0.615, 0.417,
    0.737, 1.000, 0.205, 0.498, 0.417,
    0.255, 0.205, 1.000, 0.375, 0.190,
    0.615, 0.498, 0.375, 1.000, 0.372,
    0.417, 0.417, 0.190, 0.372, 1.000
  ),
  nrow = 5
)

# View correlation matrix
corrs
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] 1.000 0.737 0.255 0.615 0.417
[2,] 0.737 1.000 0.205 0.498 0.417
[3,] 0.255 0.205 1.000 0.375 0.190
[4,] 0.615 0.498 0.375 1.000 0.372
[5,] 0.417 0.417 0.190 0.372 1.000
# Create mean vector
means = c(50, 100, 50, 4, 0)

# Create sd vector
sds = c(10, 15, 10, 2, 1)

# Set sample size
n = 1000


Computing Regression Coefficients from Summary Data

To compute the coefficients from raw data, recall that we solve the normal equations based on:

\[ \mathbf{X}^\intercal\mathbf{Xb} = \mathbf{X}^\intercal\mathbf{y} \]

Recall that if we mean center the variables in a regression analysis, the coefficients for this set of predictors would be exactly the same as if we used the raw predictors. (The intercept would differ, but ignore that for now.) That is if we solved the normal equations based on

\[ \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}\mathbf{b} = \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}} \]

the elements of b associated with the predictors in this set of equations would be the same as those from solving for b in \(\mathbf{X}^\intercal\mathbf{Xb} = \mathbf{X}^\intercal\mathbf{y}\).

So, if our goal is to find the coefficients associated with the predictors, it doesn’t matter if we use the raw or mean centered predictors. This is useful since there is a direct relationship between the \(\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}\) matrix and the variance-covariance matrix of the raw predictors:

\[ \mathrm{Cov}(X,X) = \frac{\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}}{n-1} \]

Similarly, there is a direct relationship between the \(\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}}\) matrix and the vector of covariances between the raw predictors and the raw outcome:

\[ \mathrm{Cov}(X,y) = \frac{\mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}}}{n-1} \]

This means we can write our second set of normal equations (using the mean centered variables) as:

\[ \begin{split} \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{X}_{\mathrm{Dev}}\mathbf{b} &= \mathbf{X}_{\mathrm{Dev}}^\intercal\mathbf{y}_{\mathrm{Dev}} \\[1em] \mathrm{Cov}(X,X) (n-1)\mathbf{b} &= \mathrm{Cov}(X,y)(n-1) \\[1em] \mathrm{Cov}(X,X) \mathbf{b} &= \mathrm{Cov}(X,y) \end{split} \]

Which we can use to solve for b:

\[ \mathbf{b} = \mathrm{Cov}(X,X)^{-1}\mathrm{Cov}(X,y) \]


Going From the Correlations to Covariances

The summary measures we have give correlations, not covariances. However, it is quite easy to convert a correlation matrix to a covariance matrix. From the chapter Statistical Application: SSCP, Variance–Covariance, and Correlation Matrices in Matrix Algebra for Educational Scientists, know that:

\[ \mathbf{R} = \mathbf{S} \boldsymbol\Sigma \mathbf{S} \]

where R is the correlation matrix, \(\boldsymbol\Sigma\) is the covariance matrix, and S is a diagonal scaling matrix with diagonal elements equal to the reciprocal of the standard deviations of each of the variables in the covariance matrix. Re-arranging this:

\[ \boldsymbol\Sigma = \mathbf{S}^{-1} \mathbf{R} \mathbf{S}^{-1} \]

Recall that the inverse of a diagonal matrix is another diagonal matrix where the diagonal elements are the resciprocals of the original. Thus the diagonal elements of the inverse of our scaling matrix will be the standard deviations of the variables. Using this formula, we can now convert the given correlation matrix to a covariance matrix.

# Compute the scaling matrix S^(-1)
S_inv = diag(sds)
S_inv
     [,1] [,2] [,3] [,4] [,5]
[1,]   10    0    0    0    0
[2,]    0   15    0    0    0
[3,]    0    0   10    0    0
[4,]    0    0    0    2    0
[5,]    0    0    0    0    1
# Compute covariance matrix
covs = S_inv %*% corrs %*% S_inv
covs
       [,1]    [,2]   [,3]   [,4]  [,5]
[1,] 100.00 110.550  25.50 12.300 4.170
[2,] 110.55 225.000  30.75 14.940 6.255
[3,]  25.50  30.750 100.00  7.500 1.900
[4,]  12.30  14.940   7.50  4.000 0.744
[5,]   4.17   6.255   1.90  0.744 1.000

Adding in the row and column names:

Variance-covariance matrix for five attributes measured on n = 1,000 students.
Achievement Ability Motivation Previous.Coursework Family.Background
Achievement 100.00 110.550 25.50 12.300 4.170
Ability 110.55 225.000 30.75 14.940 6.255
Motivation 25.50 30.750 100.00 7.500 1.900
Previous Coursework 12.30 14.940 7.50 4.000 0.744
Family Background 4.17 6.255 1.90 0.744 1.000


Remember, the diagonal elements in the covariance matrix are variances of each variable and the off-diagonal elements are the covariances between two variables. For example, the first diagonal element is 100, which is the variance of the achievement variable. The remaining elements in the first column indicate the covariances between achievement and ability, achievement and motivation, achievement and previous coursework, and achievement and family background.


Finding the Predictor Coefficients

To compute the coefficients for the predictors, we need to obtain two things:

  • Cov(X, X): The variance-covariance matrix of the predictors, and
  • Cov(X, y): The vector that contains the covariances between the outcome and each predictor. (Note: This vector does not include the variance of the outcome, only the covariances with the predictors.)

In our example, the elements in the first column (or row) of the variance-covariance matrix other than the variance of the outcome are the values in the vector Cov(X, y). The elements in the other rows/columns make up the elements of the Cov(X, X) matrix.

Variance-covariance matrix for five attributes measured on n = 1,000 students.
Achievement Ability Motivation Previous.Coursework Family.Background
Achievement 100.00 110.550 25.50 12.300 4.170
Ability 110.55 225.000 30.75 14.940 6.255
Motivation 25.50 30.750 100.00 7.500 1.900
Previous Coursework 12.30 14.940 7.50 4.000 0.744
Family Background 4.17 6.255 1.90 0.744 1.000


Here the blue elements compose Cov(X, y) and the reddish-purple elements constitute Cov(X, X). We can create this vector and matrix using indexing.

# Create Cov(X, y)
cov_xy = covs[-1 , 1]
cov_xy
[1] 110.55  25.50  12.30   4.17
# Create Cov(X, X)
cov_xx = covs[-1 , -1]
cov_xx
        [,1]   [,2]   [,3]  [,4]
[1,] 225.000  30.75 14.940 6.255
[2,]  30.750 100.00  7.500 1.900
[3,]  14.940   7.50  4.000 0.744
[4,]   6.255   1.90  0.744 1.000

Then we can use these to compute the predictor coefficients.

# Compute predictor coefficients
b = solve(cov_xx) %*% cov_xy
b
           [,1]
[1,] 0.36737330
[2,] 0.01257721
[3,] 1.55001342
[4,] 0.69497334

Thus the coefficient estimates are:

  • \(\hat\beta_{\mathrm{Ability}} = 0.367\)
  • \(\hat\beta_{\mathrm{Motivation}} = 0.013\)
  • \(\hat\beta_{\mathrm{Previous~Coursework}} = 1.550\)
  • \(\hat\beta_{\mathrm{Family~Background}} = 0.695\)


Computing the Intercept

To compute the intercept, we can use the following:

\[ \hat\beta_0 = \bar{y} - \bar{\mathbf{x}}^\intercal\mathbf{b} \]

where \(\bar{y}\) is the mean of the outcome, \(\bar{\mathbf{x}}\) is the vector of predictor means, and b is the associated vector of predictor coefficients.

In our example,

# Compute intercept
b_0 = means[1] - t(means[2:5]) %*% b
b_0
         [,1]
[1,] 6.433756

Thus the fitted equation for the unstandardized model is:

\[ \begin{split} \widehat{\mathrm{Achievement}_i} = &6.434 + 0.367(\mathrm{Ability}_i) + 0.013(\mathrm{Motivation}_i) + \\ &1.550(\mathrm{Coursework~}_i) + 0.695(\mathrm{Family~Background}_i) \end{split} \]


Standard Errors for the Coefficients

Using the raw data, we estimated the standard error for the coefficients by taking the square root of the diagonal elements of the variance-covariance matrix of the coefficients, where the the variance-covariance matrix of the coefficients was computed as:

\[ \hat\sigma^2_{e} (\mathbf{X}^\intercal\mathbf{X})^{-1} \]

We can use the summary values we have to also obtain the \(\hat\sigma^2_{e}\) estimate and to compute the elements of the \(\mathbf{X}^\intercal\mathbf{X}\) matrix.


The estimate of the residual variance is found using:

\[ \hat\sigma^2_{e} = \hat\sigma^2_y (1 - R^2_{\mathrm{Adj}}) \]

where \(\hat\sigma^2_y\) is the variance of the outcome and \(R^2_{\mathrm{Adj}}\) is the Adjusted \(R^2\) value which is found using:

\[ R^2_{\mathrm{Adj}} = 1 - \frac{\mathrm{df}_{\mathrm{Total}}}{\mathrm{df}_{\mathrm{Residual}}}(1 -R^2) \]

To find the \(R^2\) value, we can invert the correlation matrix and compute one minus the reciprocal of the diagonal elements of this inverted matrix. The results are the \(R^2\) values for the given term as an outcome variable and all other terms in the correlation matrix as predictors.

# Invert correlation matrix
corr_inv = solve(corrs)

# Compute the R2 values
r2 = 1 - 1/diag(corr_inv)
r2
[1] 0.6289704 0.5591756 0.1439290 0.4421816 0.2201437

Here the \(R^2\) value from regressing achievement on all the other variables in the correlation matrix (e.g., ability, motivation, previous coursework, and family background) is 0.629. We can then use this to compute adjusted \(R^2\) and subsequently use that to compute \(\hat\sigma^2_y\).

# Compute adjusted R2
r2_adj = 1 - (999 / 995) * (1 - r2[1])
r2_adj
[1] 0.6274788
# Compute estimated residual variance
s2_e = sds[1]^2 * (1 - r2_adj)
s2_e
[1] 37.25212

Our estimate for \(\hat\sigma^2_y\) is 37.25.


Compute \(\mathbf{X}^\intercal\mathbf{X}\) Matrix

The elements in the first row and column of the \(\mathbf{X}^\intercal\mathbf{X}\) matrix are functions of the sample size and means of the predictor variables. Namely the first element in the first row and column is n and the remaining elements in the that row and column are created as \(n\mathbf{M_x}\), where n is the sample size, and \(\mathbf{M_x}\) is the vector of predictor means. The remaining elements constitute a submatrix defined as:

\[ (\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}} = (n-1)\mathrm{Cov}(X,X) + n(\mathbf{M_x})(\mathbf{M_x}^\intercal) \]

where n is the sample size, Cov(X,X) is the variance covariance matrix ofthe predictors, and \(\mathbf{M_x}\) is again the vector of predictor means. We cn then create the \(\mathbf{X}^\intercal\mathbf{X}\) matrix as:

\[ \require{color} \begin{split} \mathbf{X}^\intercal\mathbf{X} &= \begin{bmatrix}{\color[rgb]{0.044147,0.363972,0.636955}n} & {\color[rgb]{0.044147,0.363972,0.636955}n\mathbf{M_x}}\\{\color[rgb]{0.044147,0.363972,0.636955}n\mathbf{M_x}} & {\color[rgb]{0.748052,0.381230,0.589698}(\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}}}\end{bmatrix} \\[2ex] &= \begin{bmatrix}{\color[rgb]{0.044147,0.363972,0.636955}n} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X1}} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X2}} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X3}} & {\color[rgb]{0.044147,0.363972,0.636955}\ldots} & {\color[rgb]{0.044147,0.363972,0.636955}n\overline{Xk}}\\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X1}} & & & & & \\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X2}} & & & & & \\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{X3}} & & & {\color[rgb]{0.748052,0.381230,0.589698}(\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}}} & & \\ {\color[rgb]{0.044147,0.363972,0.636955}\vdots} & & & & & \\ {\color[rgb]{0.044147,0.363972,0.636955}n\overline{Xk}} & & & & & \end{bmatrix} \end{split} \]

To create this matrix using R, we will:

  1. Compute the submatrix \((\mathbf{X}^\intercal\mathbf{X})_{\mathrm{Sub}}\).
  2. Bind the \(n\mathbf{M_x}\) vector to the top of the submatrix.
  3. Bind the vector that contains n and \(n\mathbf{M_x}\) to the left of the resulting matrix from Step 2.

The resulting matrix is the \(\mathbf{X}^\intercal\mathbf{X}\) matrix. For our example,

# Step 1: Create submatrix
sub_mat = 999 * cov_xx + 1000 * means[2:5] %*% t(means[2:5])
sub_mat
             [,1]      [,2]       [,3]     [,4]
[1,] 10224775.000 5030719.2 414925.060 6248.745
[2,]  5030719.250 2599900.0 207492.500 1898.100
[3,]   414925.060  207492.5  19996.000  743.256
[4,]     6248.745    1898.1    743.256  999.000
# Step 2: Bind n(M_x) to top of submatrix
mat_2 = rbind(1000 * means[2:5], sub_mat)
mat_2
             [,1]      [,2]       [,3]     [,4]
[1,]   100000.000   50000.0   4000.000    0.000
[2,] 10224775.000 5030719.2 414925.060 6248.745
[3,]  5030719.250 2599900.0 207492.500 1898.100
[4,]   414925.060  207492.5  19996.000  743.256
[5,]     6248.745    1898.1    743.256  999.000
# Step 3: Bind vector to left of Step 2 matrix
XtX = cbind(c(1000, 1000 * means[2:5]), mat_2)
XtX
       [,1]         [,2]      [,3]       [,4]     [,5]
[1,]   1000   100000.000   50000.0   4000.000    0.000
[2,] 100000 10224775.000 5030719.2 414925.060 6248.745
[3,]  50000  5030719.250 2599900.0 207492.500 1898.100
[4,]   4000   414925.060  207492.5  19996.000  743.256
[5,]      0     6248.745    1898.1    743.256  999.000

We can then scale this matrix using our previous estimate of the residual variance to obtain the variance-covariance matrix for the coefficients and compute the coefficient standard errors.

# Compute var-cov matrix of b
cov_b = s2_e * solve(XtX)
cov_b
            [,1]            [,2]            [,3]         [,4]          [,5]
[1,]  2.86183238 -0.021078176856 -0.018522600965  0.052341868  0.1280945885
[2,] -0.02107818  0.000240314903 -0.000001964506 -0.000713772 -0.0009683908
[3,] -0.01852260 -0.000001964506  0.000435428791 -0.000763097 -0.0002472826
[4,]  0.05234187 -0.000713772031 -0.000763096999  0.014297546 -0.0047228461
[5,]  0.12809459 -0.000968390764 -0.000247282552 -0.004722846  0.0473303248
# Compute SEs
se = sqrt(diag(cov_b))
se
[1] 1.69169512 0.01550209 0.02086693 0.11957235 0.21755534

These are the standard errors for the intercept and each predictor. Here I add the coefficients and standard errors to a data frame and use them to compute t-values, associated p-values, and confidence intervals.

# Load library
library(tidyverse)

# Create regression table
data.frame(
  Predictor = c("Intercept", "Ability", "Motivation", "Previous Coursework", "Family Background"),
  B = c(b_0, b),
  SE = se
) |>
  mutate(
    t = round(B /SE, 3),
    p = round(2 * pt(-abs(t), df = 995), 5),
    CI = paste0("(", round(B + qt(.025, df = 995)*SE, 3), ", ", round(B + qt(.975, df = 995)*SE, 3), ")")
  )


Standardized Regression from Summaries

We can also compute pertinent values from the standardized regression coefficients in a similar manner; in fact it is easier since there is no intercept to compute in a standardized regression. In a standardized regression, the coefficients are based on the centered and scaled data. Similarly, the correlation matrix is a scaled version of the covariance matrix (which represents the centered data). This allows us to also use the correlation matrices in the set of normal equations:

\[ \mathrm{Cor}(X,X)\mathbf{b^*} = \mathrm{Cor}(X,y) \]

Here we use the notation \(\mathbf{b^*}\) to differentiate these predictor estimates from the unstandardized estimates. We can solve this for \(\mathbf{b^*}\):

\[ \mathbf{b^*} = \mathrm{Cor}(X,X)^{-1}\mathrm{Cor}(X,y) \]

This reminds us that the regression coefficients in a standardized regression are correlations (or partial correlations in the multiple regression) between a predictor and the outcome. Here we us the original correlation matrix to compute the standardized regression coefficients:

# Compute standardized coefficients
b_star = solve(corrs[-1 , -1]) %*% corrs[1 , 2:5]
b_star
           [,1]
[1,] 0.55105995
[2,] 0.01257721
[3,] 0.31000268
[4,] 0.06949733

Writing the fitted equation:

\[ \begin{split} \hat{z}_{\mathrm{Achievement}_i} = &0.551(z_{\mathrm{Ability}_i}) + 0.012(z_{\mathrm{Motivation}_i}) + \\ &0.310(z_{\mathrm{Coursework~}_i}) + 0.069(z_{\mathrm{Family~Background}_i}) \end{split} \]

We can also compute the associated standard errors using the following:

\[ \mathrm{SE}_{\mathrm{Std}} = \mathrm{SE}_{\mathrm{Unstd}} \bigg(\frac{b_{\mathrm{Unstd}}}{b_{\mathrm{Std}}}\bigg) \]

In our example,

# Compute SEs
se[-1] * b_star / b
           [,1]
[1,] 0.02325314
[2,] 0.02086693
[3,] 0.02391447
[4,] 0.02175553

Here we had to remove the first element in the se object since it was related to the intercept. Using the estimates from the standardized regresion, the t-values and associated p-values are identical to those from the standardized model. In practice, if you are reporting standardized coefficients, it is typical to report the unstandardized estimates and standard errors, the standardized estimates, and the t- and p-values.

Unstandardized and standardized regression coefficients for fitting a model to predict variation in student achievement for n = 1,000 students.
Predictor B SE Std. B t p
Ability 0.37 0.02 0.551 18.50 0.00000
Motivation 0.01 0.02 0.013 0.50 0.61719
Previous Coursework 1.55 0.12 0.310 12.92 0.00000
Family Background 0.69 0.22 0.069 3.14 0.00174
Intercept 6.43 1.69 3.80 0.00015
Note: R-squared =0.629; Residual Standard Error = 6.10


Simulating Data Using Summary Values

We can also simulate a set of data that mimics the correlation matrix, means, and standard deviations that were reported in the summaries. To do this, we need to make an assumption that the variables are all multivariately normally distributed. If we make this strong assumption, then we can simulate using the mvrnorm() function from the {MASS} package. In this function we need to provide the sample size, the vector of means (for the outcome and predictors), and the variance-covariance matrix of all the variables.

MASS::mvrnorm(n = 1000, mu = means, Sigma = covs)

This will draw a sample from a population where the variables have those means and variance-covariance matrix. This means that the data we draw will likely not have the same means, SDs, or correlations. If we want the data to reproduce the summary values exactly, we add the argument empirical=TRUE. Here we do this and assign the data to the sim_dat object

# Set seed for reproducibility
set.seed(1)

# Simulate the data
sim_dat = MASS::mvrnorm(n = 1000, mu = means, Sigma = covs, empirical = TRUE)

This is a matrix object, so we convert it to a data frame and also assign column names so the data are easier to use in our functions.

# Convert to data frame
sim_dat = data.frame(sim_dat)

# Change column names
names(sim_dat) = c("achieve", "ability", "motivation", "coursework", "fam_back")

# View data
head(sim_dat)

Now we can check that the summaries are being reproduced.

# Compute means and SDs
sim_dat |>
  summarize(
    across(.cols = everything(), .fns = list(Mean = mean, SD = sd))
  ) |>
  round(3)
# Compute correlation matrix
sim_dat |>
  corrr::correlate()

Finally, we should be able to reproduce the regression estimates we have computed.

# Fit unstandardized model
lm_unstd = lm(achieve ~ 1 + ability + motivation + coursework + fam_back, data = sim_dat)

# Load broom library
library(broom)

# Model-level output
glance(lm_unstd) |>
  print(width = Inf)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.629         0.627  6.10      422. 1.89e-212     4 -3225. 6463. 6492.
  deviance df.residual  nobs
     <dbl>       <int> <int>
1   37066.         995  1000
# Coefficient-level output
tidy(lm_unstd)
# Standardized model
sim_dat |>
  scale() |>
  data.frame() |>
  lm(formula = achieve ~ 1 + ability + motivation + coursework + fam_back) |>
  tidy() |>
  filter(term != "(Intercept)")