10 Standardized Regression

In this chapter, you will learn about standardized regression. You will also learn how the regression coefficients from a simple regression can be computed from other summary measures. This will help you see how these measures impact the regression coefficients. We will use the riverview.csv data to examine whether education level is related to income (see the data codebook). To begin, we will load several libraries and import the data into an object called city. We will also fit a model by regressing income on education level and storing those results in an object called lm.a.

# Load libraries
library(corrr)
library(dplyr)
library(ggplot2)
library(readr)

# Import data
city = read_csv(file = "https://raw.githubusercontent.com/zief0002/modeling/main/data/riverview.csv")

#View data
city

ABCDEFGHIJ0123456789

education <dbl>	income <dbl>	seniority <dbl>	gender <chr>	party <chr>
8	26.430	9	female	Independent
8	37.449	7	Not female	Democrat
10	34.182	16	female	Independent
10	25.479	1	female	Republican
10	47.034	14	Not female	Democrat
12	46.488	11	female	Democrat
12	52.480	16	female	Independent
12	37.656	14	Not female	Democrat
12	50.265	24	Not female	Democrat
14	32.631	5	female	Independent

# Fit regression model
lm.a = lm(income ~ 1 + education, data = city)
lm.a


Call:
lm(formula = income ~ 1 + education, data = city)

Coefficients:
(Intercept)    education  
     11.321        2.651

The fitted regression equation is

$\hat{{Income}_{i}} = 11.321 + 2.651 ({Education Level}_{i})$

Recall also that we had previously computed summary measures for the outcome and predictor variables, as well as the correlation between them.

Table 10.1: Correlation between income and education level for the Riverview employees. The mean and standard deviations for each attribute is displayed on the main diagonal.

Measure	1.	2.
1. Income	53.74 (14.55)	—
2. Education level	.795	16.00 (4.36)

10.1 Correlation’s Relationship to Regression

The correlation coefficient and the slope of the regression line are directly related to one another. Mathematically, the estimated slope of the simple regression line can be computed as:

${\hat{β}}_{1} = r_{x y} \times \frac{s_{y}}{s_{x}}$

where, $s_{x}$ and $s_{y}$ are the standard deviations for the variables x (predictor) and y (outcome), respectively, and $r_{x y}$ is the correlation between x and y. If we are carrying out a regression analysis, there must be variation in both x and y, which implies that both $s_{x}$ and $s_{y}$ are greater than 0. This in turn implies that the ratio of the standard deviations (the second term on the right-hand side of the equation) is also a positive number. This means the sign of the slope is completely dependent on the sign of the correlation coefficient. If $r_{x y} > 0$ then ${\hat{β}}_{1} > 0$ . If $r_{x y} < 0$ then ${\hat{β}}_{1} < 0$ .

The magnitude of the regression slope (sometimes referred to as the effect of x on y) is impacted by three factors: (1) the magnitude of the correlation between x and y; (2) the amount of variation in y; and (3) the amount of variation in x. In general, there is a larger effect of x on y when:

There is a stronger relationship (higher correlation; positive or negative) between x and y;
There is more variation in the outcome; or
There is less variation in the predictor.

10.1.1 Checking the Formula on Our Data

Let’s use the summary measures from the Riverview data to confirm the formula for the slope.

$\begin{aligned} {\hat{β}}_{1} & = r_{x y} \times \frac{s_{y}}{s_{x}} \\ = 0.795 \times \frac{14.55}{4.36} \\ = 2.65 \end{aligned}$ This is the same value for the slope that we got from the lm() output.

10.2 Standardized Regression

In standardized regression, the correlation plays a more obvious role. Standardized regression is simply regression performed on the standardized variables (z-scores) rather than on the unstandardized variables. To carry out a standardized regression:

Standardize the outcome and predictor(s) by turning all the observations into z-scores
Fit a model by regressing $z_{y}$ on $z_{x}$

Here we will perform a standardized regression on the Riverview data.

10.2.1 Step 1: Standardizing the Variables

Remember that a z-score is computed as:

$z_{x} = \frac{x - \bar{x}}{s_{x}}$

That is we subtract the mean from our observation and then divide by the standard deviation. When we standardize a variable, we are going to turn each of the original observations into a z-score.

A standardized variable will always have a mean of 0 and a standard deviation of 1. Subtracting the mean from each observation (this process is called mean centering) we are making the mean of the newly transformed observations 0 — we re-centered the distribution so. Then, by dividing by the standard deviation (called scaling) we are changing the SD of the transformed observations to be 1. Anytime you add/subtract a value from each observation in a variable, you will shift the mean of that distribution. Anytime you multiply/divide each observation in a variable by some number, you change the SD.

# Standardize the outcome and predictor
city = city |>
  mutate(
    z_income = (income - mean(income)) / sd(income),
    z_education = (education - mean(education)) / sd(education),
  )

# View updated data
head(city)

ABCDEFGHIJ0123456789

education <dbl>	income <dbl>	seniority <dbl>	gender <chr>	party <chr>	z_income <dbl>	z_education <dbl>
8	26.430	9	female	Independent	-1.8767288	-1.8337699
8	37.449	7	Not female	Democrat	-1.1195679	-1.8337699
10	34.182	16	female	Independent	-1.3440569	-1.3753274
10	25.479	1	female	Republican	-1.9420760	-1.3753274
10	47.034	14	Not female	Democrat	-0.4609430	-1.3753274
12	46.488	11	female	Democrat	-0.4984609	-0.9168849

# Marginal distribution of the standardized incomes
ggplot(data = city, aes(x = z_income)) +
  stat_density(geom = "line") +
  theme_bw() +
  xlab("Standardized incomes") +
  ylab("Probability density")

# Marginal distribution of the standardized education levels
ggplot(data = city, aes(x = z_education)) +
  stat_density(geom = "line") +
  theme_bw() +
  xlab("Standardized education level") +
  ylab("Probability density")

Density plot of the standardized employee incomes and standardized education levels.

# Compute summaries
city |>
  summarize(
    M_y = mean(z_income),
    SD_y = sd(z_income),
    M_x = mean(z_education),
    SD_x = sd(z_education)
  )

ABCDEFGHIJ0123456789

M_y <dbl>	SD_y <dbl>	M_x <dbl>	SD_x <dbl>
-1.665335e-16	1	-1.249001e-16	1

Note that the shapes of the distributions for the standardized variables are identical to the shapes of the distributions of the unstandardized variables. Unless the distribution of a variable is normal to begin with, computing z-scores DOES NOT make the standardized distribution normal. The means for both standardized variables are 0 (because of rounding in the computation they are not exactly 0, but quite close) and the standard deviations are both 1. We can also compute the correlation between the standardized variables:

# Compute correlation
city |>
  select(z_income, z_education) |>
  correlate()

ABCDEFGHIJ0123456789

term <chr>	z_income <dbl>	z_education <dbl>
z_income	NA	0.7947847
z_education	0.7947847	NA

The correlation ( $r = 0.795$ ) between the standardized variables is exactly the same as the correlation between the unstandardized variables. Centering and scaling does not impact the relationship between variables! That means in regression analysis, it is irrelevant whether we perform the analysis on the unstandardized variables or whether we center or scale those variables. The choice of centering and scaling has more to do with making the interpretation of the results more relevant.

10.2.2 Step 2: Fit a Regression Model Using the Standardized Variables

Now that we have standardized the variables being used in the regression, we can fit a model by regressing $z_{y}$ (the standardized outcome) on $z_{x}$ (the standardized predictor).

# Fit standardized regression
lm.z = lm(z_income ~ 1 + z_education, data = city)
lm.z


Call:
lm(formula = z_income ~ 1 + z_education, data = city)

Coefficients:
(Intercept)  z_education  
 -7.883e-17    7.948e-01

The fitted regression equation is:

${\hat{z}}_{{Income}_{i}} = 0 + 0.795 (z_{{Education}_{i}})$

The intercept in a standardized regression is always 0.¹ Notice that the slope of the standardized regression is the correlation between the predictor and outcome.

10.2.3 Your Turn

Use the formula for the slope to understand why the slope from a standardized regression will always be equal to the value of the correlation coefficient.

If we interpret these coefficients:

The predicted mean standardized income for all employees who have a standardized education level of 0 is 0.
Each one-unit difference in the standardized education level is associated with a 0.795-unit difference in standardized income, on average.

Remember that standardized variables have a mean equal to 0 and a standard deviation equal to 1. Using that, these interpretations can be revised to:

The mean income for all employees who have the mean level of education is predicted to be the mean income.
Each one-standard deviation difference in education level is associated with a 0.795-standard deviation difference in income, on average.

Here is a scatterplot of the standardized variables along with the fitted standardized regression line. This will help you visually see the results of the standardized regression analysis.

ggplot(data = city, aes(x = z_education, y = z_income)) +
  geom_point() +
  theme_bw() +
  xlab("Standardized education level") +
  ylab("Standardized income") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_abline(intercept = 0, slope = 0.795)

Plot of the standardized income versus the standardized education level for the Riverview employees. The mean values are also displayed (dashed lines) along with the fitted regression line (solid line).

Using standardized regression results allows us to talk about the effect of x on y in a standard metric (standard deviation difference). This can be quite helpful when the unstandardized metric is less meaningful. This is also why some researchers refer to correlation as an effect, even though the value of $R^{2}$ is more useful in summarizing the usefulness of the model. Standardized regression also makes the intercept interpretable, since the mean value of x is not extrapolated.

The choice to standardize, center, or scale the outcome or predictors only impacts the interpretations of the intercept and slope, it does not change the underlying relationships between the variables in the model. When it is helpful to have these different interpretations, then transfom the variables. Otherwise, don’t worry about it.

10.2.4 A Slick Property of the Regression Line

Notice from the previous scatterplot of the standardized regression results that the standardized regression line goes through the point $(0, 0)$ . Since the variables are standardized, this is the point $(\bar{x}, \bar{y})$ . The regression line will always go through the point $(\bar{x}, \bar{y})$ even if the variables are unstandardized. This is an important property of the regression line.

We can show this property mathematically by predicting $y$ when $x$ is at its mean. The predicted value when $x = \bar{x}$ is then

${\hat{Y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} (\bar{x})$

Using a common formula for the regression intercept,

${\hat{β}}_{0} = \bar{y} - {\hat{β}}_{1} (\bar{x}),$

and substituting this into the prediction equation:

$\begin{aligned} {\hat{Y}}_{i} & = {\hat{β}}_{0} + {\hat{β}}_{1} (\bar{x}) \\ = \bar{y} - {\hat{β}}_{1} (\bar{x}) + {\hat{β}}_{1} (\bar{x}) \\ = \bar{y} \end{aligned}$

This implies that $(\bar{x}, \bar{y})$ is always on the regression line and that the predicted value of y for x-values at the mean is always the mean of y.

10.2.5 Variance Accounted For in a Standardized Regression

The $R^{2}$ value for the standardized and unstandardized regression models are identical. That is because the correlation between x and y and that between $z_{x}$ and $z_{y}$ are identical (see below). Thus the squared correlation will also be the same, in this case $R^{2} = {0.795}^{2} = 0.632$ .

We can also compute $R^{2}$ as the proportion reduction in error variation (PRE) from the intercept-only model. To do so we again compute the sum of squared error (SSE) for the standardized models (intercept-only and intercept-slope) and determine how much variation was explained by including the standardized education level as a predictor.

Remember that the intercept-only model is referred to as the marginal mean model—it predicts the marginal mean of y regardless of the value of x. Since the variables are standardized, the marginal mean of y is 0. Thus the equation for the intercept-only model when the variables are standardized is:

${\hat{z}}_{Income} = 0$

You could also fit the intercept-only model to obtain this result, lm(z_income ~ 1, data = city). We can now compute the SSE based on the intercept-only model.

# Compute the SSE for the standardized intercept-only model
city |>
  mutate(
    y_hat = 0,
    errors = z_income - y_hat,
    sq_errors = errors ^ 2
  ) |>
  summarize(
    SSE = sum(sq_errors)
  )

ABCDEFGHIJ0123456789

SSE <dbl>
31

We also compute the SSE for the standardized model that includes the standardized predictor of time spent on homework.

# Compute the SSE for the standardized slope-intercept model
city |>
  mutate(
    y_hat = 0 + 0.795 * z_education,
    errors = z_income - y_hat,
    sq_errors = errors ^ 2
  ) |>
  summarize(
    SSE = sum(sq_errors)
  )

ABCDEFGHIJ0123456789

SSE <dbl>
11.41784

The proportion reduction in SSE is:

$PRE = \frac{31 - 11.4}{31} = 0.632$

We can say that differences in education level explains 63.2% of the variation in employee incomes, and that 36.8% of the varition in income remains unexplained. Note that if we compute the SSEs for the unstandardized models, they will be different than the SSEs for the standardized models (after all, they are in a different metric), but they will be in the same proportion, which produces the same value as the $R^{2}$ value.

R or other statistical software might round this to a very small number. The intercept should always be reported as zero, or dropped from the fitted equation.↩︎