A brief introduction to cross-validation for model selection.
In the previous set of notes, we examined predictors of average life expectancy in a state, using data from states-2019.csv. We used several model building strategies and performance metrics to select and evaluate predictors. Unfortunately, we were evaluating the model/predictors using the same data that we used to build the model. This has several problems in practice, especially if you are using inferential metrics (p-values) to select the predictors:
It is better to use one set of data to fit your model(s) and a completely separate set of data to evaluate the models’ performance. Evaluating your models on a separate set of data from the data you initially fit the models on is referred to as cross-validation.
We will again use the states-2019.csv data to introduce methods of cross-validation. After importing the data, we will also standardize all the numeric variables.
# Load libraries
library(modelr)
library(patchwork)
library(tidyverse)
library(tidymodels) # Loads broom, rsample, parsnip, recipes, workflow, tune, yardstick, and dials
# Import and view data
usa = read_csv("https://raw.githubusercontent.com/zief0002/epsy-8264/master/data/states-2019.csv")
head(usa)
# A tibble: 6 × 9
state life_expectancy population income illiteracy murder hs_grad
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alabama 75.4 4.90 23.6 14.8 7.8 85.3
2 Alaska 78.8 0.732 33.1 9.2 8.2 92.4
3 Arizona 79.9 7.28 25.7 13.1 5.1 86.5
4 Arkansas 75.9 3.02 22.9 13.7 7.2 85.6
5 Califor… 81.6 39.5 30.4 23.1 4.4 82.5
6 Colorado 80.5 5.76 32.4 9.9 3.7 91.1
# … with 2 more variables: frost <dbl>, area <dbl>
# Create standardized variables after removing state names
z_usa = usa %>%
select(-state) %>%
scale(center = TRUE, scale = TRUE) %>%
data.frame()
From our previous notes, there were a several candidate models that we may have been considering based on their AICc values. In this set of notes we will consider the best k-predictor models that had an AICc value within four of the minimum AICc value. These models include the following predictors:
In practice, we do not typically have two data sets at our disposal. Instead, we randomly split our sample data into two parts: a training set (used to fit the candidate models), and a validation set (used to evaluate the candidate models). Because we use a data set composed of different observations to evaluate the candidate models than we did to fit the candidate models, the resulting evaluation is not biased toward the data we used to fit the model (i.e., less overfit). Because of this, we get better indications about the generalizability of the models to out-of-sample data.
To do this we need to fit our candidate models using the training data. Then, we use the coefficients we obtained from the training data to “fit” the same models to the validation data to obtain fitted values or residuals. Using these fitted values/residuals (which are from the validation data), we can compute one or more measures of model performance for each candidate model, and use those to select a “best” candidate model. In cross-validation, the model performance metrics used are simply compared to determine the “best” model; there are no significance tests performed.
There are several metrics of model performance we can use to compare the candidate models. One very common metric used is the residual variance (a.ka., mean square error). Because this is computed on the validation data rather than on the data used to fit the model, we refer to it as the Cross-Validated Mean Square Error (CV-MSE). This, again, is a measure of out-of-sample performance.
Summarizing this process for our example, the algorithm for performing the cross-validation is:
z_usa data into two sets of observations; training and validation data sets.Next, we will explore and carry out each of the steps in this algorithm to select a “best” model from our candidate models for the z_usa data.
There are many ways to do this, but here we will use the sample() function to randomly sample 35 cases (rows) from the z_usa data (about 2/3 of the data). These 35 cases will make up the training data. Then we will use the filter() function to select those cases. The remaining cases (those 17 observations not sampled) are put into a validation set of data.
# Make the random sampling replicable
set.seed(42)
# Select the cases to be in the training set
training_cases = sample(1:nrow(z_usa), size = 35, replace = FALSE)
# Create training data from the sampled cases
train = z_usa %>%
filter(row_number() %in% training_cases)
# Create validation data from the remaining cases
validate = z_usa %>%
filter(!row_number() %in% training_cases)
We now fit the four candidate models to the observations in the training data.
Now we can obtain the predicted life expectancy for the observations in the validation data based on the coefficients from the models fitted to the training data. These vectors of fitted values can be used to compute the residuals for the validation observations, which in turn can be used to compute the CV-MSE, which is:
\[ \mathrm{CV\mbox{-}MSE} = \frac{1}{n} \sum_{i=1}^n e_i^2 \]
where n is the number of observations in the validation set. We can compute the CV-MSE for each of the candidate models.
# Get the predicted values for the validation data
yhat_1 = predict(lm.1, newdata = validate)
yhat_2 = predict(lm.2, newdata = validate)
yhat_3 = predict(lm.3, newdata = validate)
yhat_4 = predict(lm.4, newdata = validate)
# Compute residuals and CV-MSE
sum((validate$life_expectancy - yhat_1) ^ 2) / nrow(validate)
[1] 0.7349516
[1] 0.7483797
[1] 0.8547222
[1] 0.8968336
Since the CV-MSE is a measure of model misfit, the candidate model having the smallest CV-MSE is the model that should be adopted. In this case, the values for the CV-MSE suggest adopting the single predictor model.
There are two major problems with the cross-validation we just performed.
Figure 1: Model that has the lowest CV-MSE for 10 random splits of the data.
Leave-one-out cross-validation (LOOCV) is one method that can be used to overcome these issues. The algorithm for performing LOOCV is:
We then have n estimates of the CV-MSE that can be averaged to get an overall estimate for the CV-MSE. That is,
\[ \mathrm{CV\mbox{-}MSE} = \frac{1}{n} \sum_{i=1}^n \mathrm{CV\mbox{-}MSE}_i \]
Since the validation dataset is composed of a single observation (\(n=1\)), \(\mathrm{CV\mbox{-}MSE}_i\) is simply the squared residual value for the validation observation.
\[ \begin{split} \mathrm{CV\mbox{-}MSE}_i = \frac{1}{n} \sum_{i=1}^n e_i^2 \\[0.5em] = \frac{1}{1} \sum_{i=1}^1 e_i^2 \\[0.5em] = e_i^2 \end{split} \] To carry out LOOCV, we will write a function that computes \(\mathrm{CV\mbox{-}MSE}_i\) for each model given a particular case.
# Function to compute CV-MSE for LOOCV
cv_mse_i = function(case_index){
# Create training and validation data sets
train = z_usa %>% filter(row_number() != case_index)
validate = z_usa %>% filter(row_number() == case_index)
# Fit models to training data
lm.1 = lm(life_expectancy ~ -1 + income, data = train)
lm.2 = lm(life_expectancy ~ -1 + income + population, data = train)
lm.3 = lm(life_expectancy ~ -1 + income + population + illiteracy, data = train)
lm.4 = lm(life_expectancy ~ -1 + income + population + illiteracy + murder, data = train)
# Compute fitted value for validation data
yhat_1 = predict(lm.1, newdata = validate)
yhat_2 = predict(lm.2, newdata = validate)
yhat_3 = predict(lm.3, newdata = validate)
yhat_4 = predict(lm.4, newdata = validate)
# Compute CV-MSE_i for each model
cv_mse_1 = (validate$life_expectancy - yhat_1) ^ 2
cv_mse_2 = (validate$life_expectancy - yhat_2) ^ 2
cv_mse_3 = (validate$life_expectancy - yhat_3) ^ 2
cv_mse_4 = (validate$life_expectancy - yhat_4) ^ 2
# Output a data frame
return(data.frame(cv_mse_1, cv_mse_2, cv_mse_3, cv_mse_4))
}
# Test function on Case 1
cv_mse_i(1)
cv_mse_1 cv_mse_2 cv_mse_3 cv_mse_4
1 2.08481 1.988258 1.383972 1.297868
Then we can use this function to compute the four \(\mathrm{CV\mbox{-}MSE}_i\) values for each case in the data. We set up a data frame that includes a column called case that has each case number in the data. Then we use rowwise() to operate on each row separately and mutate() a new column by applying our cv_mse_i() function using map(). This creates a list column where the contents of each cell in the column is actually a list; in our case a data frame of four columns. Finally, we use unnest() to take the contents of the list column and spread them out into separate columns.
# Apply cv_mse_i() function to all cases
my_cv_mse = data.frame(case = 1:52) %>%
rowwise() %>%
mutate(
cv_mse = map(case, cv_mse_i) #New list column that includes the data frame of output
) %>%
unnest(cols = cv_mse) #Turn list column into multiple columns
# View output
my_cv_mse
# A tibble: 52 × 5
case cv_mse_1 cv_mse_2 cv_mse_3 cv_mse_4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 2.08 1.99 1.38 1.30
2 2 0.151 0.0504 0.0681 0.00113
3 3 0.870 0.816 0.956 0.889
4 4 1.20 1.02 0.629 0.616
5 5 2.10 0.773 0.657 0.562
6 6 0.425 0.456 0.315 0.331
7 7 0.0544 0.114 0.0310 0.0537
8 8 0.136 0.0471 0.0230 0.0143
9 9 4.62 3.83 1.80 1.24
10 10 0.820 0.275 0.531 0.450
# … with 42 more rows
Alternatively, we could also have used a for() loop to carry out the LOOCV. Here is some example syntax (not run):
# Set up empty vector to store results
mse_1 = rep(NA, 52)
mse_2 = rep(NA, 52)
mse_3 = rep(NA, 52)
mse_4 = rep(NA, 52)
mse_5 = rep(NA, 52)
# Loop through the cross-validation
for(i in 1:nrow(z_usa)){
train = z_usa %>% filter(row_number() != i)
validate = z_usa %>% filter(row_number() == i)
lm.1 = lm(life_expectancy ~ -1 + income, data = train)
lm.2 = lm(life_expectancy ~ -1 + income + population, data = train)
lm.3 = lm(life_expectancy ~ -1 + income + population + illiteracy, data = train)
lm.4 = lm(life_expectancy ~ -1 + income + population + illiteracy + murder, data = train)
yhat_1 = predict(lm.1, newdata = validate)
yhat_2 = predict(lm.2, newdata = validate)
yhat_3 = predict(lm.3, newdata = validate)
yhat_4 = predict(lm.4, newdata = validate)
mse_1[i] = (validate$life_expectancy - yhat_1) ^ 2
mse_2[i] = (validate$life_expectancy - yhat_2) ^ 2
mse_3[i] = (validate$life_expectancy - yhat_3) ^ 2
mse_4[i] = (validate$life_expectancy - yhat_4) ^ 2
}
We can now compute the average CV-MSE for each of the candidate models and select the candidate model with the lowest average CV-MSE.
# Compute average CV-MSE
my_cv_mse %>%
select(-case) %>%
summarize_all(mean)
# A tibble: 1 × 4
cv_mse_1 cv_mse_2 cv_mse_3 cv_mse_4
<dbl> <dbl> <dbl> <dbl>
1 0.832 0.807 0.769 1.12
The LOOCV results suggest that we adopt the three-predictor model; it has the smallest average CV-MSE. Since LOOCV method is less biased than simple cross-validation (it trains the models on a much larger set of observations) its results are more believable than those from the simple cross-validation. Another advantage of LOOCV is that it will always produce the same results as opposed to simple cross-validation) since there is no randomness in producing the training and validation datasets.
LOOCV can be computationally expensive when you sample is very large; we have to fit the candidate models n times. It turns out, however, that models fit with OLS, can give us the LOOCV results using the following formula:
\[ \mathrm{CV\mbox{-}MSE} = \frac{1}{n} \sum_{i=1}^n \bigg(\frac{e_i}{1-h_{ii}}\bigg)^2 \]
where \(\hat{y}_i\) is the ith fitted value from the original least squares fit, and \(h_{ii}\) is the leverage value. This is similar to the model (biased) MSE, except the ith residual is divided by \(1 − h_{ii}\).
LOOCV is a very general method, and can be used with any kind of modeling. For example we could use it with logistic regression, or mixed-effects analysis, or any of the methods you have encountered in your statistics courses to date. That being said, the shortcut formula does not hold for all these methods, so outside of OLS regression, the candidate models will actually need to be fitted n times. This can, as pointed out earlier, be quite computationally expensive.
One alternative to LOOCV is k-fold cross-validation. The algorithm for k-fold cross-validation is:
We then have k estimates of the MSE that can be averaged to get the overall CV-MSE.
\[ \overline{\mathrm{CV\mbox{-}MSE}} = \frac{1}{k} \sum_{i=1}^k \mathrm{CV\mbox{-}MSE}_i \]
From this algorithm, it is clear that LOOCV is a special case of k-fold cross-validation in which k is set to equal n. In practice we typically use \(k=5\) or \(k=10\).
We can carry out \(k\)-fold cross-validation by using the crossv_kfold() function from the {modelr} package. This function takes the argument k= to indicate the number of folds. Here is some example syntax to carry out a 10-fold cross-validation:
# Set seed for reproducible results
set.seed(100)
# Divide data into 10 folds
my_cv = z_usa %>%
crossv_kfold(k = 10)
# View output
my_cv
# A tibble: 10 × 3
train test .id
<named list> <named list> <chr>
1 <resample [46 x 8]> <resample [6 x 8]> 01
2 <resample [46 x 8]> <resample [6 x 8]> 02
3 <resample [47 x 8]> <resample [5 x 8]> 03
4 <resample [47 x 8]> <resample [5 x 8]> 04
5 <resample [47 x 8]> <resample [5 x 8]> 05
6 <resample [47 x 8]> <resample [5 x 8]> 06
7 <resample [47 x 8]> <resample [5 x 8]> 07
8 <resample [47 x 8]> <resample [5 x 8]> 08
9 <resample [47 x 8]> <resample [5 x 8]> 09
10 <resample [47 x 8]> <resample [5 x 8]> 10
The output of cross_kfold() is a tibble/data frame with three columns: train, test, and .id. The first two columns are named list of resample objects. These contain references to the row numbers used in the training and test data sets.
# View rows used in first fold (train set)
my_cv$train[[1]]
<resample [46 x 8]> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
Then we will use the map() and map2_dbl() functions from the {purrr} package to fit a model to the training (train) data and find the MSE on the validation (test) data created from the crossv_kfold() function. We will have to carry this out for each of the candidate models.
For more detailed information about using the purrr functions, see Jenny Bryan’s fantastic purrr tutorial.
# Best 1-predictor model
cv_1 = my_cv %>%
mutate(
model = map(train, ~lm(life_expectancy ~ 1 + income, data = .)),
MSE = map2_dbl(model, test, modelr::mse),
k = 1
)
# Best 2-predictor model
cv_2 = my_cv %>%
mutate(
model = map(train, ~lm(life_expectancy ~ 1 + income + population, data = .)),
MSE = map2_dbl(model, test, modelr::mse),
k = 2
)
# Best 3-predictor model
cv_3 = my_cv %>%
mutate(
model = map(train, ~lm(life_expectancy ~ 1 + income + population + illiteracy, data = .)),
MSE = map2_dbl(model, test, modelr::mse),
k = 3
)
# Best 4-predictor model
cv_4 = my_cv %>%
mutate(
model = map(train, ~lm(life_expectancy ~ 1 + income + population + illiteracy + murder, data = .)),
MSE = map2_dbl(model, test, modelr::mse),
k = 4
)
Once we have the results, we can stack these into single data frame and then use group_by() and summarize() to obtain the CV-MSE estimates.
# A tibble: 4 × 2
k cv_mse
<dbl> <dbl>
1 1 0.885
2 2 0.859
3 3 0.830
4 4 1.22
The results of carrying out the 10-fold cross-validation suggest that we adopt the best two- or three-predictor model.
Using k-fold cross-validation is computationally less expensive so long as \(k<n\). But this computational gain has a cost in that the results are again dependent on the k random splits. This variation, however, is less than that in the single split simple cross-validation. We can also alleviate some of this by fitting the k-fold cross-validation several times and averaging across the results to get the CV-MSE estimate.
There are several other R packages that will also fit a k-fold cross-validation (e.g., {DAAG}).
The different cross-validation methods suggest we adopt the following candidate models:
| Method | Model |
|---|---|
| FS (t-value) | Income, Population, Illiteracy, Murder |
| FS (AIC) | Income, Population, Illiteracy |
| BE (R2) | Income, Population, Illiteracy |
| All Subsets (AICc) | Income, Population, Illiteracy |
| Simple CV | Income |
| Leave-One-Out CV | Income, Population, Illiteracy |
| 10-Fold CV | Income, Population, Illiteracy |
While the different CV methods can suggest different models, this is usually less problematic when the sample size is larger; we only have 52 observations in the z_usa data.
The LOOCV and k-fold CV are better suited for considering how the model will perform on future data sets, so, here, I would adopt the model that includes income, population, and illiteracy rate. To get the coefficient estimates, we fit whichever model we adopt to the FULL data set (with all the observations) as this will give us the “best” estimates.
# Fit model
lm.3 = lm(life_expectancy ~ -1 + income + population + illiteracy, data = z_usa)
# Get model-level output
glance(lm.3)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.349 0.310 0.823 8.77 0.0000924 3 -62.1 132.
# … with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
# Get coefficient-level output
tidy(lm.3)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 income 0.488 0.115 4.23 0.000102
2 population 0.392 0.145 2.70 0.00944
3 illiteracy -0.310 0.145 -2.13 0.0379
When we report the results of a regression model evaluated using cross-validation, there are some subtle differences in what is reported. At the model-level, we report the \(R^2\) from the adopted model fitted to the full-data. We also report the CV-MSE as a measure of the model error for future observations (generalized error).
At the coefficient-level we typically report the coefficient estimates and standard errors based on fitting the adopted model to the full data. However, we DO NOT REPORT NOR INTERPRET P-VALUES. The p-values do not take into account that the model was selected using cross-validation. Because of this they are incredibly misleading. As with any model, interpretations should also be offered, again typically by way of providing a plot of the fitted model to help facilitate these interpretations. For example:
The regression model using income, population, and illiteracy rate to predict variation in life expectancy was selected using 10-fold cross-validation (CV-MSE = 0.830). The model explains 34.9% of the variation in life expectancies. The fitted equation is:
\[ \begin{split} \hat{\mathrm{Life~Expectancy}}_i = &0.488(\mathrm{Income}_i) + 0.392(\mathrm{Population}_i) \\ &- 0.310(\mathrm{Illiteracy~Rate}_i) \end{split} \]
where, the outcome and all predictors are standardized. This model suggests moderate, positive, partial effects of both income and population and a moderate negative effect of illiteracy rate on life expectancy.
In EPsy 8252, we learned about using information criteria for model selection. In particular, we used AIC, BIC, and AICc for model selection. It turns out the AIC-based model selection and cross-validation are asymptotically equivalent (Stone, 1977). Shao (1997) also proved that for linear models, using BIC for model selection is asymptotically equivalent to leave-v-out cross-validation when
\[ v = n\bigg(1 - \frac{1}{\ln (n) - 1}\bigg) \]
As a practical note, computer simulations have suggested that the results from using AICc for model selection will, on average, be quite similar to those from cross-validation techniques. As such, AICc can be a useful alternative to the computationally expensive cross-validation methods. However, using AICc does not provide a measure of the model performance (MSE) in new datasets like cross-validation does.
# Fit models to all data
lm.1 = lm(life_expectancy ~ -1 + income, data = z_usa)
lm.2 = lm(life_expectancy ~ -1 + income + population, data = z_usa)
lm.3 = lm(life_expectancy ~ -1 + income + population + illiteracy, data = z_usa)
lm.4 = lm(life_expectancy ~ -1 + income + population + illiteracy + murder, data = z_usa)
# Load library
library(AICcmodavg)
# Get AICc for all models
aictab(
cand.set = list(lm.1, lm.2, lm.3, lm.4),
modnames = c("Best 1-Predictor", "Best 2-Predictor", "Best 3-Predictor", "Best 4-Predictor")
)
Model selection based on AICc:
K AICc Delta_AICc AICcWt Cum.Wt LL
Best 3-Predictor 4 133.05 0.00 0.48 0.48 -62.10
Best 4-Predictor 5 134.25 1.20 0.26 0.74 -61.47
Best 2-Predictor 3 135.32 2.27 0.15 0.89 -64.41
Best 1-Predictor 2 136.02 2.97 0.11 1.00 -65.89
The model with the lowest AICc is, indeed, the three-predictor model. There is also some evidence for the four-predictor model given the AICc weight.