In this set of notes you will learn about power transformations, how to use power transformations to re-express data so that the re-expressed data meet the assumption of ā€œlinearityā€ (i.e., straightenā€ curvilinear data), and see this in an empirical example.


Power Transformations

All of the transformations, or re-expressions, of data we have seen in this course are power transformations. Power transformations essentially transform some variable X using the function:

\[ X \rightarrow X^{(p)} \]

where p is some power. Here are some important ideas about power transformations:

This is called the ladder of transformations, since we can think about these different power transformations as a ladders going up or down from the \(p=1\) (no transformation) starting point.

Ladder of transformations indicating upward (up-the-ladder) power transformations and downward (down-the-ladder) power transformations.

Ladder of transformations indicating upward (up-the-ladder) power transformations and downward (down-the-ladder) power transformations.

Rule of the Bulge

The Rule of the Bulge is a technique introduced by John Tukey and Frederick Mosteller for ā€œstraighteningā€ data to better meet the assumption of linearity. Note that the following figure shows the four monotonic curves; one in each of the four quadrants:

Ladder of transformations indicating upward (up-the-ladder) power transformations and downward (down-the-ladder) power transformations.

Ladder of transformations indicating upward (up-the-ladder) power transformations and downward (down-the-ladder) power transformations.

The Rule of the Bulge tells us:

To illustrate, consider the the non-linear relationships depicted in the following two scatterplots.

# Import data
mn = readr::read_csv(file = "https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/mn-schools.csv")
fert = readr::read_csv(file = "https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/fertility.csv")

# Create scatterplot of graduation data
p1 = ggplot(data = mn, aes(x = sat, y = grad)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  theme_light() +
  xlab("Estimated median SAT score (in hundreds)") +
  ylab("Six-year graduation rate")

# Create scatterplot of fertility data
p2 = ggplot(data = fert, aes(x = educ_female, y = infant_mortality)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    xlab("Average female education level") +
    ylab("Infant mortality rate (per 1,000 live births") +
    theme_light()

# Layout plots
p1 | p2

The scatterplot of the relationship between median SAT scores and graduation rate indicates a relationship similar to that in Quadrant 2. This suggests that we could try to: (1) Re-express X using a downward transformation; (2) re-express Y using an upward transformation, or (3) both. In the notes, we fitted a model in which we log-transformed the median SAT scores (downward transformation of X) to ā€œstraightenā€ the relationship,

Y ~ 1 + ln(X)

The scatterplot of the relationship between female education level and infant mortality rate indicates a relationship similar to that in Quadrant 3. This suggests that we could try to: (1) Re-express X using a downward transformation; (2) re-express Y using a downward transformation, or (3) both. Below is the plot in which we re-expressed infant mortality rate using a log-teansform (downward transformation.)

ln(Y) ~ 1 + X
ggplot(data = fert, aes(x = educ_female, y = log(infant_mortality))) +
    geom_point() +
    geom_smooth(se = FALSE) +
    xlab("Average female education level") +
    ylab("ln(Infant mortality rate)") +
    theme_light()


Caution āš ļø

Sometimes these re-expressions will not be adequate. In some cases, you might not be able to ā€œstraightenā€ the data enough to meet the assumption. This is because these transformations ā€œdeteriorateā€ or ā€œspuriously increaseā€ the information contained in the data. As you use re-expressions further down-the-ladder, the variation in the re-expressed data decreases (Less variation = less information). Eventually, the variation in the re-epressed data will be so small that the values become indistinguishable (no information).

In the other direction, as you use re-expressions further up-the-ladder, the variation in the re-expressed data increases (more variation = more information), albeit spuriously. Essentially, we are adding information that is not truly in the data. This might lead us to finding results that aren’t really there, or over-emphasizing relationships.

Re-expressions that only go a little way up- or down-the-ladder are fine. Just beware if you need to go too far up- or down-the-ladder to straighten your data. In those cases you may want to use a different method of estimating the model than OLS (e.g., non-linear least squares).


Empirical Example

We will use the mammals.csv data to predict variation in body weight for mammals using their brain weight as a predictor.

# Load libraries
library(broom)
library(tidyverse)

# Import data
mammal = read_csv("https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/mammal.csv")
head(mammal)
# Examine relationship
ggplot(data = mammal, aes(x = brain_weight, y = body_weight)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  theme_light() +
  xlab("Brain weight (in g)") +
  ylab("Body weight (in kg)")

The relationship is non-linear, and shows an exponential growth curve. Use the Rule of the Bulge mnemonic, we identify this curve in the lower right-hand quadrant. To help straighten this curve we can either:

Since there is only a single predictor, transforming Y is low-cost (it doesn’t affect the relationship between Y and other predictors), whereas transforming X with an upward transformation means we would have to include more than one effect in the model (e.g., \(X\) and \(X^2\)).

Because of this I will transform Y using the log-transformation. Looking at the relationship between ln(body weight) and brain weight, we will see if this ā€œstraightenedā€ the relationship.

ggplot(data = mammal, aes(x = brain_weight, y = log(body_weight))) +
  geom_point() +
  geom_smooth(se = FALSE) +
  theme_light() +
  xlab("Brain weight (in g)") +
  ylab("ln(Body weight)")

The relationship is non-linear, and shows a decay version of the exponential growth curve. Use the Rule of the Bulge mnemonic, we identify this curve in the upper left-hand quadrant. To help straighten this curve we can either:

Since we just used a downward transformation on Y to fix the last relationship, using an upward transformation now would just re-introduce the initial problem. Because of this I will transform X using the log-transformation. Looking at the relationship between ln(body weight) and ln(brain weight), we will see if this ā€œstraightenedā€ the relationship.

ggplot(data = mammal, aes(x = log(brain_weight), y = log(body_weight))) +
  geom_point() +
  geom_smooth(se = FALSE) +
  theme_light() +
  xlab("ln(Brain weight)") +
  ylab("ln(Body weight)")

This relationship looks linear! So we can fit a linear model that uses ln(brain weight) to predict variation in ln(body weight). We can then use back-transformations and a plot of the fitted equation to interpret the coefficients in the model.


Fit Linear Model

Fitting the linear model and looking at it’s output:

# Fit model
lm.1 = lm(log(body_weight) ~ 1 + log(brain_weight), data = mammal)

# Model-level output
glance(lm.1)
# Coefficient-level output
tidy(lm.1)

Interpreting this output:

The fitted equation is:

\[ \hat{\ln(\mathrm{Body~Weight})}_i = -2.51 + 1.22\bigg[\ln(\mathrm{Brain~Weight}_i)\bigg] \]

We can also back-transform these log entities to get a better interpretation of the coefficients. For the intercept, when log(brain weight) is 0, actual brain weight = 1. Thus, mammals with a 1-gram brain weight have a predicted log(body weight) of \(-2.51\), on average. Exponentiating this (\(e^{-2.51}=0.081\)), so we can interpret the intercept as:

To consider the interpretation of the slope, we utilize the fact that log-transforming X (using the natural logarithm) results in an interpretation that can be interpreted as a 1% change in X. As such, we choose a series of brain weights that differ by 1% and plug them into our fitted equation to get predicted log(body weights):

# Choose brain weights that differ by 1%
body = c(100, 101, 102.01)

# Get predicted ln(body weight) values
-2.51 + 1.22 * log(body)
[1] 3.108308 3.120447 3.132586

The interpretation is:

Here 0.012 is the slope coefficient divided by 100. Now let’s transform the log(body weight) values to raw body weights. To do this, we exponentiate these predicted values:

# Exponentiate the predicted values
exp(-2.51 + 1.22 * log(body))
[1] 22.38313 22.65651 22.93322

This results in a constant multiplicative difference of 1.0122, Namely,

Or, interpreting this as a percent change:

This 1.22% change is essentially the slope coefficient from the fitted equation. Thus when we log-transform both X and Y using the natural logarithm, we can interpret both the change in X and change in Y as a percent change. In general:

We can also plot the fitted curve to facilitate a graphical interpretation.

# Plot fitted curve
ggplot(data = mammal, aes(x = brain_weight, y = body_weight)) +
  geom_point(alpha = 0.2) +
  geom_function(fun = function(x){exp(-2.509 + 1.225*log(x))}) +
  theme_light() +
  xlab("Brain weight (in g)") +
  ylab("Body weight (in kg)")