In this set of notes you will learn about power transformations, how to use power transformations to re-express data so that the re-expressed data meet the assumption of ālinearityā (i.e., straightenā curvilinear data), and see this in an empirical example.
All of the transformations, or re-expressions, of data we have seen in this course are power transformations. Power transformations essentially transform some variable X using the function:
\[ X \rightarrow X^{(p)} \]
where p is some power. Here are some important ideas about power transformations:
This is called the ladder of transformations, since we can think about these different power transformations as a ladders going up or down from the \(p=1\) (no transformation) starting point.
Ladder of transformations indicating upward (up-the-ladder) power transformations and downward (down-the-ladder) power transformations.
The Rule of the Bulge is a technique introduced by John Tukey and Frederick Mosteller for āstraighteningā data to better meet the assumption of linearity. Note that the following figure shows the four monotonic curves; one in each of the four quadrants:
Ladder of transformations indicating upward (up-the-ladder) power transformations and downward (down-the-ladder) power transformations.
The Rule of the Bulge tells us:
To illustrate, consider the the non-linear relationships depicted in the following two scatterplots.
# Import data
mn = readr::read_csv(file = "https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/mn-schools.csv")
fert = readr::read_csv(file = "https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/fertility.csv")
# Create scatterplot of graduation data
p1 = ggplot(data = mn, aes(x = sat, y = grad)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
theme_light() +
xlab("Estimated median SAT score (in hundreds)") +
ylab("Six-year graduation rate")
# Create scatterplot of fertility data
p2 = ggplot(data = fert, aes(x = educ_female, y = infant_mortality)) +
geom_point() +
geom_smooth(se = FALSE) +
xlab("Average female education level") +
ylab("Infant mortality rate (per 1,000 live births") +
theme_light()
# Layout plots
p1 | p2The scatterplot of the relationship between median SAT scores and graduation rate indicates a relationship similar to that in Quadrant 2. This suggests that we could try to: (1) Re-express X using a downward transformation; (2) re-express Y using an upward transformation, or (3) both. In the notes, we fitted a model in which we log-transformed the median SAT scores (downward transformation of X) to āstraightenā the relationship,
Y ~ 1 + ln(X)
The scatterplot of the relationship between female education level and infant mortality rate indicates a relationship similar to that in Quadrant 3. This suggests that we could try to: (1) Re-express X using a downward transformation; (2) re-express Y using a downward transformation, or (3) both. Below is the plot in which we re-expressed infant mortality rate using a log-teansform (downward transformation.)
ln(Y) ~ 1 + X
ggplot(data = fert, aes(x = educ_female, y = log(infant_mortality))) +
geom_point() +
geom_smooth(se = FALSE) +
xlab("Average female education level") +
ylab("ln(Infant mortality rate)") +
theme_light()Sometimes these re-expressions will not be adequate. In some cases, you might not be able to āstraightenā the data enough to meet the assumption. This is because these transformations ādeteriorateā or āspuriously increaseā the information contained in the data. As you use re-expressions further down-the-ladder, the variation in the re-expressed data decreases (Less variation = less information). Eventually, the variation in the re-epressed data will be so small that the values become indistinguishable (no information).
In the other direction, as you use re-expressions further up-the-ladder, the variation in the re-expressed data increases (more variation = more information), albeit spuriously. Essentially, we are adding information that is not truly in the data. This might lead us to finding results that arenāt really there, or over-emphasizing relationships.
Re-expressions that only go a little way up- or down-the-ladder are fine. Just beware if you need to go too far up- or down-the-ladder to straighten your data. In those cases you may want to use a different method of estimating the model than OLS (e.g., non-linear least squares).
We will use the mammals.csv data to predict variation in body weight for mammals using their brain weight as a predictor.
# Load libraries
library(broom)
library(tidyverse)
# Import data
mammal = read_csv("https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/mammal.csv")
head(mammal)# Examine relationship
ggplot(data = mammal, aes(x = brain_weight, y = body_weight)) +
geom_point() +
geom_smooth(se = FALSE) +
theme_light() +
xlab("Brain weight (in g)") +
ylab("Body weight (in kg)")The relationship is non-linear, and shows an exponential growth curve. Use the Rule of the Bulge mnemonic, we identify this curve in the lower right-hand quadrant. To help straighten this curve we can either:
Since there is only a single predictor, transforming Y is low-cost (it doesnāt affect the relationship between Y and other predictors), whereas transforming X with an upward transformation means we would have to include more than one effect in the model (e.g., \(X\) and \(X^2\)).
Because of this I will transform Y using the log-transformation. Looking at the relationship between ln(body weight) and brain weight, we will see if this āstraightenedā the relationship.
ggplot(data = mammal, aes(x = brain_weight, y = log(body_weight))) +
geom_point() +
geom_smooth(se = FALSE) +
theme_light() +
xlab("Brain weight (in g)") +
ylab("ln(Body weight)")The relationship is non-linear, and shows a decay version of the exponential growth curve. Use the Rule of the Bulge mnemonic, we identify this curve in the upper left-hand quadrant. To help straighten this curve we can either:
Since we just used a downward transformation on Y to fix the last relationship, using an upward transformation now would just re-introduce the initial problem. Because of this I will transform X using the log-transformation. Looking at the relationship between ln(body weight) and ln(brain weight), we will see if this āstraightenedā the relationship.
ggplot(data = mammal, aes(x = log(brain_weight), y = log(body_weight))) +
geom_point() +
geom_smooth(se = FALSE) +
theme_light() +
xlab("ln(Brain weight)") +
ylab("ln(Body weight)")This relationship looks linear! So we can fit a linear model that uses ln(brain weight) to predict variation in ln(body weight). We can then use back-transformations and a plot of the fitted equation to interpret the coefficients in the model.
Fitting the linear model and looking at itās output:
# Fit model
lm.1 = lm(log(body_weight) ~ 1 + log(brain_weight), data = mammal)
# Model-level output
glance(lm.1)# Coefficient-level output
tidy(lm.1)Interpreting this output:
The fitted equation is:
\[ \hat{\ln(\mathrm{Body~Weight})}_i = -2.51 + 1.22\bigg[\ln(\mathrm{Brain~Weight}_i)\bigg] \]
We can also back-transform these log entities to get a better interpretation of the coefficients. For the intercept, when log(brain weight) is 0, actual brain weight = 1. Thus, mammals with a 1-gram brain weight have a predicted log(body weight) of \(-2.51\), on average. Exponentiating this (\(e^{-2.51}=0.081\)), so we can interpret the intercept as:
To consider the interpretation of the slope, we utilize the fact that log-transforming X (using the natural logarithm) results in an interpretation that can be interpreted as a 1% change in X. As such, we choose a series of brain weights that differ by 1% and plug them into our fitted equation to get predicted log(body weights):
# Choose brain weights that differ by 1%
body = c(100, 101, 102.01)
# Get predicted ln(body weight) values
-2.51 + 1.22 * log(body)[1] 3.108308 3.120447 3.132586
The interpretation is:
Here 0.012 is the slope coefficient divided by 100. Now letās transform the log(body weight) values to raw body weights. To do this, we exponentiate these predicted values:
# Exponentiate the predicted values
exp(-2.51 + 1.22 * log(body))[1] 22.38313 22.65651 22.93322
This results in a constant multiplicative difference of 1.0122, Namely,
Or, interpreting this as a percent change:
This 1.22% change is essentially the slope coefficient from the fitted equation. Thus when we log-transform both X and Y using the natural logarithm, we can interpret both the change in X and change in Y as a percent change. In general:
We can also plot the fitted curve to facilitate a graphical interpretation.
# Plot fitted curve
ggplot(data = mammal, aes(x = brain_weight, y = body_weight)) +
geom_point(alpha = 0.2) +
geom_function(fun = function(x){exp(-2.509 + 1.225*log(x))}) +
theme_light() +
xlab("Brain weight (in g)") +
ylab("Body weight (in kg)")