9 Information Criteria and Model Selection

In this chapter, you will use information theoretic approaches (e.g., information criteria) to select one (or more) empirically supported model from a set of candidate models. To do so, we will use the mn-schools.csv dataset to examine if (and how) academic “quality” of the student-body (measured by SAT score) is related to institutional graduation rate.

# Load libraries
library(AICcmodavg)
library(broom)
library(educate)
library(patchwork)
library(tidyverse)
library(tidyr)

# Import data
mn = read_csv(file = "https://raw.githubusercontent.com/zief0002/adv-modeling/main/data/mn-schools.csv")

# View data
mn

# A tibble: 33 × 6
      id name                               grad public   sat tuition
   <dbl> <chr>                             <dbl>  <dbl> <dbl>   <dbl>
 1     1 Augsburg College                   65.2      0  10.3    39.3
 2     3 Bethany Lutheran College           52.6      0  10.6    30.5
 3     4 Bethel University, Saint Paul, MN  73.3      0  11.4    39.4
 4     5 Carleton College                   92.6      0  14      54.3
 5     6 College of Saint Benedict          81.1      0  11.8    43.2
 6     7 Concordia College at Moorhead      69.4      0  11.4    36.6
 7     8 Concordia University-Saint Paul    47.9      0   9.9    37.8
 8     9 Crossroads College                 26.9      0   9.7    25.3
 9    10 Crown College                      51.3      0  10.3    33.2
10    11 Gustavus Adolphus College          81.7      0  12.2    43.8
# ℹ 23 more rows

We will re-visit the following research questions:

Is there an effect of student-body quality (median SAT score) on graduation rates?
Is the effect of student-body quality on graduation rates non-linear?
Does the effect of student-body quality on graduation rates persist after controlling for differences in institutional sector?
Is the effect of student-body quality on graduation rates different for public and private institutions?

9.1 Working Hypotheses and Candidate Models

Let’s begin by considering three potential models that explain variation in graduation rates:

\[ \begin{split} \mathbf{Model~0:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \epsilon_i \\[2ex] \mathbf{Model~1:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Median~SAT}_i) + \epsilon_i \\[2ex] \mathbf{Model~2:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Median~SAT}_i) + \beta_2(\mathrm{Median~SAT}^2_i) + \epsilon_i \\ \end{split} \]

where all three models have \(\epsilon_i\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma^2_{\epsilon})\).

# Fit candidate models
lm.0 = lm(grad ~ 1, data = mn)
lm.1 = lm(grad ~ 1 + sat, data = mn)
lm.2 = lm(grad ~ 1 + sat + I(sat^2), data = mn)

Each of these models correspond to a different scientific working hypothesis about what explains institutional differences in graduation rates:

Model 0 represents the working hypothesis (H0:) that graduation rates are not a function of anything systematic and are just a fucntion of the individual institution.
Model 1 represents the working hypothesis (H1:) that graduation rates are a linear function of median SAT scores.
Model 2 represents the working hypothesis (H2:) that graduation rates are a non-linear function of median SAT scores.

These working hypotheses are typically created from the theory and previous empirical work in a substantive area. They need to be translated into a statistical models, which can be quite difficult, especially if there is not a lot of theory to guide this translation.

One method for choosing among a set of candidate models is to pick based on the residuals. In this method, we adopt the model for which the data best meet the assumptions, and employing the principle of parsimony—adopting the more parsimonious model when there are multiple models that produce equally “good” residuals. Additionally, we can use the likelihood ratio test (LRT) to help make this decision. This method provides additional empirical evidence about which model is supported when the candidate models are nested.

Unfortunately, we cannot always use the LRT to select models, since there are many times when the set of candidate models are not all nested. Information criteria give us metrics to compare the empirical support for models whether they are nested or not. In the remainder of these notes, we will examine several of the more popular information criteria metrics used for model selection.

9.2 Akiake’s Information Criteria (AIC)

The AIC metric is calculated by adding a penalty term to the model’s deviance (\(-2\ln(\mathrm{Likelihood})\)).

\[ \begin{split} AIC &= \mathrm{Deviance} + 2k \\[1ex] &= -2\ln(\mathcal{L}) + 2k \end{split} \]

where k is the number of parameters (including the residual standard error) being estimated in the model. (Recall that the value for k is given as df in the logLik() output.)

Remember that deviance is similar to error (it measures model-data misfit), and so models that have lower values of deviance are more empirically supported. The problem with deviance, however, is that more complex models tend to have smaller deviance values and are, therefore, more supported than their simpler counterparts. Unfortunately, these complex models tend not to generalize as well as simpler models (this is called overfitting). The AIC metric’s penalty term penalizes the deviance more when a more complex model is fitted; it is offsetting the lower degree of model-data misfit in complex models by increasing the ‘misfit’ based on the number of parameters.

This penalty-adjusted measure of ‘misfit’ is called Akiake’s Information Criteria (AIC). We compute the AIC for each of the candidate models and the model with the lowest AIC is selected. This model, we say, is the candidate model with the most empirical support given the data and the set of candidate models evaluated. Below we compute the AIC for the first candidate model.

# Compute AIC for Model 1
# logLik(lm.1)
-2*-113.5472 + 2*3

[1] 233.0944

We could also use the AIC() function to compute the AIC value directly. Here we compute the AIC for each of the candidate models.

# Model 0
AIC(lm.0)

[1] 282.5998

# Model 1
AIC(lm.1)

[1] 233.0944

# Model 2
AIC(lm.2)

[1] 227.1166

Based on the AIC values, the candidate model with the most empirical evidence, given the data and the set of candidate models evaluated, is Model 2; it has the lowest AIC.

Lastly, we note that the AIC value is produced as a column in the model-level output from the glance() function. (Note that the df column from glance() does NOT give the number of model parameters.)

# Model-level output for H2
glance(lm.2) |>
  print(width = Inf)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.835         0.824  7.02      76.0 1.81e-12     2  -110.  227.  233.
  deviance df.residual  nobs
     <dbl>       <int> <int>
1    1478.          30    33

9.2.1 Empirical Support is for the Working Hypotheses

Because the models are proxies for the scientific working hypotheses, the AIC ends up being a measure of empirical support for any particular working hypothesis. Using the AIC, we can rank order the models (and subsequently the working hypotheses) based on their level of empirical support. Ranked in order of empirical support, the three scientific working hypotheses are:

H2: Graduation rates are a non-linear function of median SAT scores.
H1: Graduation rates are a linear function of median SAT scores.
H0: Graduation rates are not a function of anything systematic and are just a fucntion of the individual institution.

It is important to remember that the phrase given the data and the set of candidate models evaluated is highly important. The ranking of models/working hypotheses is a relative ranking of the models’ level of empirical support contingent on the candidate models we included in the comparison and the data we used to compute the AIC.

As such, this method is not able to rank any hypotheses that you did not consider as part of the candidate set of scientific working hypotheses. Moreover, the AIC is a direct function of the likelihood which is based on the actual model fitted as a proxy for the scientific working hypothesis. If the predictors used in any of the models had been different, it would lead to different likelihood and AIC values, and potentially a different rank ordering of the hypotheses.

The ranking of models is also based on the data we have. If we had different ranges of the data or a different set of variables, the evidence might support a different model. This is very important. Model 2 is the most empirically supported candidate model GIVEN the three candidate models we compared and the data we used to compute the AIC metric.

The model selection and ranking is contingent on both the set of candidate models you are evaluating, and the data you have.

Based on the AIC values for the three candidate models we ranked the hypotheses based on the amount of empirical support:

Code

cand_models = data.frame(
  Model = c("Model 2", "Model 1", "Model 0"),
  k = c(4, 3, 2),
  AIC = c(AIC(lm.2), AIC(lm.1), AIC(lm.0))
) 

cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AIC),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AIC = md("*AIC*")
  ) |>
  tab_options(
    table.width = pct(40)
  )

Table 9.1: Models rank-ordered by the amount of empirical support as measured by the AIC.

Model	k	AIC
Model 2	4	227.1166
Model 1	3	233.0944
Model 0	2	282.5998

9.3 Corrected AIC (AICc): Adjusting for Model Complexity and Sample Size

Although AIC has a penalty correction that should account for model complexity, it turns out that when the number of parameters is large relative to the sample size, AIC is still biased in favor of models that have more parameters. This led Hurvich & Tsai (1989) to propose a second-order bias corrected AIC measure (AICc) computed as:

\[ \mathrm{AIC_c} = \mathrm{Deviance} + 2k\left( \frac{n}{n - k - 1} \right) \]

where k is, again, the number of estimated parameters (including residual standard error), and n is the sample size used to fit the model. Note that when n is very large (especially relative to k) that the last term is essentially 1 and the AICc value would basically reduce to the AIC value. When n is small relative to k this will add more of a penalty to the deviance.

The statistical recommendation is to pretty much always use AICc rather than AIC when selecting models.

Below, we will compute the AICc for the Model 2. (Note that we use \(n=33\) cases for the computation for all the models in this data.)

n = 33
k = 4

# Compute AICc for linear hypothesis
-2 * logLik(lm.2)[[1]] + 2 * k * (n / (n - k - 1))

[1] 228.5452

In practice, we will use the AICc() function from the {AICcmodavg} package to compute the AICc value directly. Here we compute the AICc for the four candidate models.

# Model 0
AICc(lm.0)

[1] 282.9998

# Model 1
AICc(lm.1)

[1] 233.922

# Model 2
AICc(lm.2)

[1] 228.5452

Based on the AICc values, the model with the most empirical support given the data and three candidate models is, again, Model 2. Using these results, we can, again, rank order the models (which correspond to working hypotheses) based on the empirical support for each.

Code

cand_models = data.frame(
  Model = c("Model 2", "Model 1", "Model 0"),
  k = c(4, 3, 2),
  AIC = c(AIC(lm.2), AIC(lm.1), AIC(lm.0)),
  AICc = c(AICc(lm.2), AICc(lm.1), AICc(lm.0))
)


cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AICc),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AICc = md("*AICc*")
  ) |>
  tab_options(
    table.width = pct(40)
  )

Table 9.2: Models rank-ordered by the amount of empirical support as measured by the AICc.

Model	k	AIC	AICc
Model 2	4	227.1166	228.5452
Model 1	3	233.0944	233.9220
Model 0	2	282.5998	282.9998

9.4 Quantifying Model-Selection Uncertainty

When we adopt one model over another, we are introducing some degree of uncertainty into the scientific process, in this case model selection uncertainty. This is true whether you are selecting models based on p-values or using information criteria. It would be nice if we can quantify and report this uncertainty, and this is the real advantage of using information criteria for model selection; it allows us to quantify the uncertainty we have when we select any particular candidate model.

The amount of model selection uncertainty we have depends on the amount of empirical support each of the candidate models has. For example, if one particular candidate model has a lot of empirical support and the rest have very little empirical support we would have less model selection uncertainty than if all of the candidate models had about the same amount of empirical support.

Since we quantify the empirical support for each model/working hypothesis by computing the AICc, we can also quantify how much more empirical support the most supported hypothesis has relative to each of the other working hypotheses by computing the difference in AICc values between the model with the most empirical support and each of the other candidate models. This measure is referred to as \(\Delta\)AICc.

In our example, the hypothesis with the most empirical support was Model 2. The \(\Delta\)AICc values can then be computed for each candidate model by subtracting the AICc for Model 2 from the AICc for that candidate model.

# Compute delta AICc value for Model 0
AICc(lm.0) - AICc(lm.2)

[1] 54.45462

# Compute delta AICc value for Model 1
AICc(lm.1) - AICc(lm.2)

[1] 5.37686

# Compute delta AICc value for Model 2
AICc(lm.2) - AICc(lm.2)

[1] 0

Code

cand_models = cand_models |>
  mutate(
    delta_a = AICc - AICc(lm.2)
  )

cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AICc, delta_a),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AICc = md("*AICc*"),
    delta_a = html("&#916;AICc")
  ) |>
  tab_options(
    table.width = pct(50)
  )

Table 9.3: Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc for each model.

Model	k	AIC	AICc	ΔAICc
Model 2	4	227.1166	228.5452	0.00000
Model 1	3	233.0944	233.9220	5.37686
Model 0	2	282.5998	282.9998	54.45462

Burnham et al. (2011, p. 25) give rough guidelines for interpreting \(\Delta\)AICc values. They suggest that hypotheses with \(\Delta\)AICc values less than 2 are plausible, those in the range of 4–7 have some empirical support, those in the range of 9–11 have relatively little support, and those greater than 13 have essentially no empirical support. Using these criteria:

Model 2 is plausible.
Model 1 has some empirical support.
Model 0 has virtually no empirical support.

This implies that we would have a fair amount of certainty actually selecting Model 1 and 2 over the Model 0. There is a little uncertainty adopting Model 2 over Model 1. When multiple models are supported by the empirical evidence it suggests that more than one model would need to be adopted since the empirical evidence wouldn’t really differentiate which model is “better”.

In our example, we can safely say that there does seem to be an effect of median SAT score on graduation rates. (This answers our first research question!) We are still a little uncertain whether that effect is linear or non-linear, but the evidence is pointing in the direction of non-linearity.

9.5 Relative Likelihood and Evidence Ratios

One way we can mathematically formalize the strength of evidence for each model is to compute the relative likelihood. The relative likelihood provides the likelihood of each of the candidate models, given the set of candidate models and the data. To compute the relative likelihood, we use:

\[ \mathrm{Relative~Likelihood} = e ^ {−\frac{1}{2} (\Delta AICc)} \]

# Relative likelihood for Model 0
exp(-1/2 * 54.45462)

[1] 1.497371e-12

# Relative likelihood for Model 1
exp(-1/2 * 5.37686)

[1] 0.0679876

# Relative likelihood for Model 2
exp(-1/2 * 0)

[1] 1

Code

cand_models = cand_models |>
  mutate(
    rel_lik = exp(-1/2 * delta_a)
  )

cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AICc, delta_a, rel_lik),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AICc = md("*AICc*"),
    delta_a = html("&#916;AICc"),
    rel_lik = html("Rel(&#8466;)")
  ) |>
  tab_options(
    table.width = pct(70)
  ) |>
  tab_footnote(
    footnote = html("Rel(&#8466;) = Relative likelihood. Values are displayed using scientific notation."),
    locations = cells_column_labels(columns = rel_lik)
  )

Table 9.4: Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc, and relative likelihood for each model.

Model	k	AIC	AICc	ΔAICc	Rel(ℒ)¹
Model 2	4	227.1166	228.5452	0.00000	1.000000e+00
Model 1	3	233.0944	233.9220	5.37686	6.798761e-02
Model 0	2	282.5998	282.9998	54.45462	1.497373e-12
¹ Rel(ℒ) = Relative likelihood. Values are displayed using scientific notation.

The relative likelihood values allow us to compute evidence ratios, which are evidentiary statements for comparing any two model or scientific hypotheses. Evidence ratios quantify how much more empirical support one model/hypothesis has versus another. To obtain an evidence ratio, we divide the relative likelihood for any two models/hypotheses. As an example, the evidence ratio comparing Model 2 to Model 0 is computed as:

# ER comparing Model 2 to Model 0
1 / 1.497373e-12

[1] 667836270589

That is,

The empirical support for Model 2 is 667,836,270,589 (667 billion) times that of the empirical support for Model 0.

We can similarly compute the evidence ratio to make an evidentiary statement comparing the empirical support for any two models. For example to compare the empirical support for Model 2 versus Model 1:

# ER comparing Model 2 to Model 1
1 / 6.798761e-02

[1] 14.70856

The empirical support for Model 2 is 14.71 times that of the empirical support for Model 1. This gives us a more quantitative lens to view the the empirical evidence for the non-linear vs the linear effect of median SAT score on graduation rates. While there is unceratinty in the model selection, knowing that the empirical support for the non-linear effect is almost 15 times that of the empirical support for the linear effect helps us think about that model selection uncertainty a bit differently than just saying there is some support for Model 1.

9.6 Model Probabilities

Another method of quantifying the relative empirical support for each working hypothesis is to compute the model probability for each of the candidate models. Also referred to as an Akaike Weight (\(w_i\)), a model probability provides a numerical measure of the probability of each model given the data and the candidate set of models. It can be computed as:

\[ w_i = \frac{\mathrm{Relative~Likelihood~for~Model~J}}{\sum_j \mathrm{All~Relative~Likelihoods}} \]

Here we compute the model probability for all three candidate models.

# Compute sum of relative likelihoods
sum_rel = 1.497373e-12 + 6.798761e-02 + 1

# Model probability for Model 0
1.497373e-12 / sum_rel

[1] 1.402051e-12

# Model probability for Model 1
6.798761e-02 / sum_rel

[1] 0.06365955

# Model probability for Model 2
1 / sum_rel

[1] 0.9363405

We also add these to our table of model evidence.

Code

cand_models = cand_models |>
  mutate(
    model_prob = rel_lik / sum(rel_lik)
  )

cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AICc, delta_a, rel_lik, model_prob),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AICc = md("*AICc*"),
    delta_a = html("&#916;AICc"),
    rel_lik = html("Rel(&#8466;)"),
    model_prob = md("*AICc Wt.*")
  ) |>
  tab_options(
    table.width = pct(75)
  ) |>
  tab_footnote(
    footnote = html("Rel(&#8466;) = Relative likelihood. Values are displayed using scientific notation."),
    locations = cells_column_labels(columns = rel_lik)
  )

Table 9.5: Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model.

Model	k	AIC	AICc	ΔAICc	Rel(ℒ)¹	AICc Wt.
Model 2	4	227.1166	228.5452	0.00000	1.000000e+00	9.363405e-01
Model 1	3	233.0944	233.9220	5.37686	6.798761e-02	6.365955e-02
Model 0	2	282.5998	282.9998	54.45462	1.497373e-12	1.402051e-12
¹ Rel(ℒ) = Relative likelihood. Values are displayed using scientific notation.

Since the models are proxies for the working hypotheses, the model probabilities can be used to provide probabilities of each working hypothesis as a function of the empirical support. Given the data and the candidate set of working hypotheses:

The probability of the zeroth working scientific hypothesis is 0.0000000000014.
The probability of the first working scientific hypothesis is 0.064.
The probability of the second working scientific hypothesis is 0.936.

This suggests that the second working scientific hypothesis is the most probable. There is a tiny bit of e,mpirical support for Hypothesis 1, and essentially no support for Hypotheses 0 (it has some non-negligible probability). Again, this confirms that we are very certain that there is an effect of median SAT score on graduation rates. And we are pretty convinced that the effect is non-linear.

9.7 Answering RQ 3

The third research question asked about whether the effect of student-body quality on graduation rates persists after controlling for differences in institutional sector. This represents a new scientific hypothesis:

H3: There is an effect of median SAT scores on graduation rates even after accounting for differences in educational sector.

To evaluate this hypothesis, we need to compare a main-effects model that includes the effect of educational sector AND the effect of median SAT scores to a model that only includes educational sector. Since we decided that the effect of median SAT scores was non-linear in the previous analysis, we need to evaluate the empirical support for the following models:

\[ \begin{split} \mathbf{Model~3:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Public}_i) + \epsilon_i \\[2ex] \mathbf{Model~5:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Public}_i) + \\ &\beta_2(\mathrm{Median~SAT}_i) + \beta_3(\mathrm{Median~SAT}^2_i) + \epsilon_i \\ \end{split} \]

If there was still uncertainty about whether the effect of median SAT scores was non-linear or linear, you could also include a model that included the the effect of educational sector and the linear effect of median SAT scores.

\[ \mathbf{Model~4:~} \quad \mathrm{Graduation~Rate}_i = \beta_0 + \beta_1(\mathrm{Public}_i) + \beta_2(\mathrm{Median~SAT}_i) + \epsilon_i \]

We fit these models and evaluate the empirical evidence by computing the AICc, relative likelihood, and model probability for each of the candidate models.

# Fit models
lm.3 = lm(grad ~ 1 + public, data = mn)
lm.4 = lm(grad ~ 1 + public + sat, data = mn)
lm.5 = lm(grad ~ 1 + public + sat + I(sat^2), data = mn)

The table of model evidence is presented below.

Code

cand_models = data.frame(
  Model = c("Model 5", "Model 4", "Model 3"),
  k = c(5, 4, 3),
  AICc = c(AICc(lm.5), AICc(lm.4), AICc(lm.3))
) |>
  mutate(
    delta_a = AICc - AICc(lm.5),
    rel_lik = exp(-1/2 * delta_a),
    model_prob = rel_lik / sum(rel_lik)
  )

cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AICc, delta_a, rel_lik, model_prob),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AICc = md("*AICc*"),
    delta_a = html("&#916;AICc"),
    rel_lik = html("Rel(&#8466;)"),
    model_prob = md("*AICc Wt.*")
  ) |>
  tab_options(
    table.width = pct(70)
  ) |>
  tab_footnote(
    footnote = html("Rel(&#8466;) = Relative likelihood. Values are displayed using scientific notation."),
    locations = cells_column_labels(columns = rel_lik)
  )

Table 9.6: Models rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model.

Model	k	AICc	ΔAICc	Rel(ℒ)¹	AICc Wt.
Model 5	5	215.8278	0.00000	1.000000e+00	9.963040e-01
Model 4	4	227.0215	11.19362	3.709671e-03	3.695960e-03
Model 3	3	279.7700	63.94213	1.303595e-14	1.298777e-14
¹ Rel(ℒ) = Relative likelihood. Values are displayed using scientific notation.

Given the data and the candidate models evaluated, Models 4 (model prob. = 0.996) and 5 (model prob. = 0.003) have more empirical support than Model 4 (model prob. = 0.00000000000001). The empirical evidence supports the persistance of an effect of student-body quality on graduation rates, even after accounting for differences in educational sector. Moreover, the \(\Delta\)AICc values, relative likelihoods, and model probabilities are suggesting overwhelming empirical support for the non-linear effect of student-body quality on graduation rates.

9.8 Answering RQ 4

The fourth research questions asked whether the effect of student-body quality on graduation rates was different for public and private institutions. This again represents a new scientific hypothesis:

H4: There is an effect of median SAT scores on graduation rates, and that effect varies for different educational sectors.

To evaluate this hypothesis, we need to compare a main-effects model that includes the effect of educational sector AND the effect of median SAT scores to an interaction model that includes the main-effects educational sector and median SAT scores, and the interaction effect between them. Again, for completeness, I will explore a set of models that allow for the linear and non-linear effects of median SAT score. Since we have already fitted the main-effects models (Models 4 and 5), here we only add the interaction models.

\[ \begin{split} \mathbf{Model~6:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Public}_i) + \beta_2(\mathrm{Median~SAT}_i) + \\ &\beta_3(\mathrm{Public}_i)(\mathrm{Median~SAT}_i) + \epsilon_i \\[2ex] \mathbf{Model~7:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Public}_i) + \beta_2(\mathrm{Median~SAT}_i) + \\ &\beta_3(\mathrm{Median~SAT}^2_i) +\\ &\beta_4(\mathrm{Public}_i)(\mathrm{Median~SAT}_i) + \epsilon_i \\[2ex] \mathbf{Model~8:~} \quad \mathrm{Graduation~Rate}_i = &\beta_0 + \beta_1(\mathrm{Public}_i) + \beta_2(\mathrm{Median~SAT}_i) + \\ &\beta_3(\mathrm{Median~SAT}^2_i) + \beta_4(\mathrm{Public}_i)(\mathrm{Median~SAT}_i) + \\ &\beta_5(\mathrm{Public}_i)(\mathrm{Median~SAT}^2_i) + \epsilon_i \end{split} \]

We fit these models and evaluate the empirical evidence by computing the AICc, relative likelihood, and model probability for each of the candidate models.

# Fit models
lm.6 = lm(grad ~ 1 + public + sat + public:sat, data = mn)
lm.7 = lm(grad ~ 1 + public + sat + I(sat^2) + public:sat, data = mn)
lm.8 = lm(grad ~ 1 + public + sat + I(sat^2) + public:sat + public:I(sat^2), data = mn)

The table of model evidence is presented below.

Code

cand_models = data.frame(
  Model = c("Model 7", "Model 5", "Model 8", "Model 4", "Model 6"),
  k = c(6, 5, 7, 4, 5),
  AICc = c(AICc(lm.7), AICc(lm.5), AICc(lm.8), AICc(lm.4), AICc(lm.6))
) |>
  mutate(
    delta_a = AICc - AICc(lm.5),
    rel_lik = exp(-1/2 * delta_a),
    model_prob = rel_lik / sum(rel_lik)
  )

cand_models |>
  gt() |>
  cols_align(
    columns = c(Model),
    align = "left"
  ) |>
  cols_align(
    columns = c(k, AICc, delta_a, rel_lik, model_prob),
    align = "center"
  ) |>
  cols_label(
    Model = md("*Model*"),
    k = md("*k*"),
    AICc = md("*AICc*"),
    delta_a = html("&#916;AICc"),
    rel_lik = html("Rel(&#8466;)"),
    model_prob = md("*AICc Wt.*")
  ) |>
  tab_options(
    table.width = pct(70)
  ) |>
  tab_footnote(
    footnote = html("Rel(&#8466;) = Relative likelihood. Values are displayed using scientific notation."),
    locations = cells_column_labels(columns = rel_lik)
  )

Table 9.7: Models evaqluating the interaction effects rank-ordered by the amount of empirical support as measured by the AICc. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model.

Model	k	AICc	ΔAICc	Rel(ℒ)¹	AICc Wt.
Model 7	6	215.7859	-0.04193024	1.021186434	0.4483754119
Model 5	5	215.8278	0.00000000	1.000000000	0.4390730203
Model 8	7	218.5893	2.76143809	0.251397722	0.1103819569
Model 4	4	227.0215	11.19362410	0.003709671	0.0016288165
Model 6	5	229.2266	13.39876393	0.001231673	0.0005407943
¹ Rel(ℒ) = Relative likelihood. Values are displayed using scientific notation.

Given the data and the candidate models evaluated, Models 7, 5, and 8 are all empirically supported. There is virtually no empirical support for Models 4 and 6. While we can rule out the main-effects and interaction models that are based on the linear effect of median SAT scores on graduation rates, it is unclear that in the models that include the quadratic effect of median SAT score whether or not there is an interaction effect between median SAT scores and educational sector nor whether the interaction is between(i.e., there is a fair amount of model selection uncertainty).

In practice I would adopt and report all three quadratic models (Models 7, 5, and 8) and mention the model selection uncertainty that makes it hard to rule any of these out. This uncertainty almost certainly is a function of the small sample size.

9.9 Using R to Create a Table of Model Evidence

We can use the aictab() function from the {AICcmodavg} package to compute and create a table of model evidence values directly from the lm() fitted models. This function takes a list of models in the candidate set (it actually has to be an R list). The argument modnames= is an optional argument to provide model names that will be used in the output. –>

#Create table of model evidence
model_evidence = aictab(
  cand.set = list(lm.4, lm.5, lm.6, lm.7, lm.8),
  modnames = c("Model 4", "Model 5", "Model 6", "Model 7", "Model 8")
  )

# View output
model_evidence


Model selection based on AICc:

        K   AICc Delta_AICc AICcWt Cum.Wt      LL
Model 7 6 215.79       0.00   0.45   0.45 -100.28
Model 5 5 215.83       0.04   0.44   0.89 -101.80
Model 8 7 218.59       2.80   0.11   1.00 -100.05
Model 4 4 227.02      11.24   0.00   1.00 -108.80
Model 6 5 229.23      13.44   0.00   1.00 -108.50

The model evidence provided for each model includes the number of parameters (K), AICc value (AICc), \(\Delta\)AICc value (Delta_AICc), the Akiake weight/model probability (AICcWt), cumulative weight (Cum.Wt), and log-likelihood (LL). The output also rank orders the models based on the AICc criterion. The models are printed in order from the model with the most empirical evidence (Model 7) to the models with the least amount of empirical evidence (Model 6) based on the AICc.

9.10 Evidence Based on All Nine Candidate Models

Another approach to the analysis is to include all nine models in the analysis from the start.

#Create table of model evidence
model_evidence = aictab(
  cand.set = list(lm.0, lm.1, lm.2, lm.3, lm.4, lm.5, lm.6, lm.7, lm.8),
  modnames = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6", "Model 7", "Model 8")
  )

# View output
model_evidence


Model selection based on AICc:

        K   AICc Delta_AICc AICcWt Cum.Wt      LL
Model 7 6 215.79       0.00   0.45   0.45 -100.28
Model 5 5 215.83       0.04   0.44   0.89 -101.80
Model 8 7 218.59       2.80   0.11   1.00 -100.05
Model 4 4 227.02      11.24   0.00   1.00 -108.80
Model 2 4 228.55      12.76   0.00   1.00 -109.56
Model 6 5 229.23      13.44   0.00   1.00 -108.50
Model 1 3 233.92      18.14   0.00   1.00 -113.55
Model 3 3 279.77      63.98   0.00   1.00 -136.47
Model 0 2 283.00      67.21   0.00   1.00 -139.30

Based on this table of model evidence we see:

Model 0 and Model 3 have the least amount of empirical support and their model probabilities are both essentially 0. These are the only models that do not include the effect of median SAT scores. This implies that there is likely an effect of median SAT score on graduation rates.
The models that include the quadratic effect of median SAT score on graduation rates all have more empirical evidence than their counterpart models (with the same covariates or interactions) that only include the linear effect—Model 2 has more empirical evidence than Model 1; Model 5 has more empirical evidence than Model 4; and Models 7 and 8 have more empirical evidence than Model 6. This implies that the effect of median SAT score on graduation rates is likely non-linear.
Both models that include educational sector and the effect of median SAT score on graduation rates (Models 4 and 5) have more empirical evidence than the model that only includes educational sector (Model 3). This implies that the effect of median SAT score on graduation rates persists after accounting for differences in educational sector.
Based on the empirical evidence there is model selection uncertainty when it comes to choosing between Models 5, 7, and 8. All are plausible given the data and candidate models being evaluated. This implies that we do not have enough empirical evidence to say whether or not there is an interaction effect between educational sector and median SAT scores on graduation rates.

9.11 Some Final Thoughts

Recall that the information criteria are a function of the log-likelihood. Log-likelihood values, and thus information criteria, from different models can be compared, so long as:

The exact same data is used to fit the models;
The exact same outcome is used to fit the models; and
The assumptions underlying the likelihood (independence, distributional assumptions) are met for each candidate model.

In all nine models we are using the same data set and outcome. However, the assumptions do not seem tenable for Models 1, 4, and 6¹. That means that we should not really have included Model 1, 4, nor 6 in our candidate set of models/working hypotheses.

Removing models from the candidate set will change some of the metrics in the table of model evidence (e.g., the model probabilities). This is especially true if we remove one of the more empirically supported models. To see this, here we remove Model 7, and the models that don’t meet the assumptions.

#Create table of model evidence
model_evidence = aictab(
  cand.set = list(lm.0, lm.2, lm.3, lm.5, lm.8),
  modnames = c("Model 0", "Model 2", "Model 3", "Model 5", "Model 8")
  )

# View output
model_evidence


Model selection based on AICc:

        K   AICc Delta_AICc AICcWt Cum.Wt      LL
Model 5 5 215.83       0.00    0.8    0.8 -101.80
Model 8 7 218.59       2.76    0.2    1.0 -100.05
Model 2 4 228.55      12.72    0.0    1.0 -109.56
Model 3 3 279.77      63.94    0.0    1.0 -136.47
Model 0 2 283.00      67.17    0.0    1.0 -139.30

In practice, you wouldn’t remove a model that meets the assumptions and has a lot of empirical support, but pedagogically you need to understand that changing which models are included in the candidate set change your results! Similarly, changing the predictors in the candidate models, or adding other models to the candidate set can impact the amount of empirical support and the rank-ordering of hypotheses.

If we had a different set of data, we may also have a whole new ranking of models or interpretation of empirical support. The empirical support is linked to the data. The empirical support is very much relative to the candidate set of models and the data we have.

Lastly, it is important to note that although information criteria can tell you about the empirical support among a candidate set of models, it cannot say whether that is actually a “good” model. For that you need to look at the assumptions and other measures (e.g., \(R^2\)). You still need to do all of the work associated with model-building (e.g., selecting predictors from the substantive literature, exploring functional forms to meet the assumptions).

9.12 Statistical Inference and Information Criteria

Finally, it is important to mention that philosophically, information-criteria and statistical inference are two very different ways of evaluating statistical evidence. When we use statistical inference for variable selection, the evidence, the p-values, is a measure of how rare an observed statistic (e.g., \(\hat\beta_k\), t-value) is under the null hypothesis. The AIC, on the other hand, is a measure of the model-data compatibility accounting for the complexity of the model.

In general, the use of p-values is not compatible with the use of information criteria-based model selection methods; see Anderson (2008) for more detail. Because of this, it is typical to not report p-values when using information criteria for model selection. We should, however, reprt the standard errors or confidence intervals for the coefficient estimates, especially for any “best” model(s). This gives information about the statistical uncertainty that arises because of sampling error.

IMPORTANT

It is important that you decide how you will be evaluating evidence and making decisions about variable and model selection prior to actually examining the data. Mixing and matching is not cool!

9.13 Pretty Printing Tables of Model Evidence for Quarto Documents

We can format the output from aictab() to be used in the gt() function. Because there are multiple classes associated with the output from the aictab() function, we first pipe model_evidence object into the data.frame() function. Viewing this, we see that the data frame, also includes an additional column that gives the relative likelihoods (ModelLik).

#Create table of model evidence
model_evidence = aictab(
  cand.set = list(lm.0, lm.1, lm.2, lm.3, lm.4, lm.5, lm.6, lm.7, lm.8),
  modnames = c("Model 0", "Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6", "Model 7", "Model 8")
  )


# Create data frame to format into table
tab_01 = model_evidence |>
  data.frame()

# View table
tab_01

  Modnames K     AICc  Delta_AICc     ModelLik       AICcWt        LL    Cum.Wt
8  Model 7 6 215.7859  0.00000000 1.000000e+00 4.480116e-01 -100.2776 0.4480116
6  Model 5 5 215.8278  0.04193024 9.792531e-01 4.387168e-01 -101.8028 0.8867284
9  Model 8 7 218.5893  2.80336833 2.461820e-01 1.102924e-01 -100.0546 0.9970208
5  Model 4 4 227.0215 11.23555435 3.632707e-03 1.627495e-03 -108.7964 0.9986483
3  Model 2 4 228.5452 12.75925104 1.695758e-03 7.597192e-04 -109.5583 0.9994080
7  Model 6 5 229.2266 13.44069417 1.206120e-03 5.403555e-04 -108.5022 0.9999483
2  Model 1 3 233.9220 18.13611066 1.152905e-04 5.165149e-05 -113.5472 1.0000000
4  Model 3 3 279.7700 63.98406188 1.276549e-14 5.719088e-15 -136.4712 1.0000000
1  Model 0 2 282.9998 67.21386820 2.539183e-15 1.137583e-15 -139.2999 1.0000000

Then we can use the select() function to drop the LL and Cum.Wt columns from the data frame. The log-likelihood is redundant to the information in the AICc column, since AICc is a function of log-likelihood and the other information in the table. The cumulative weight can also easily be computed from the information in the AICcWt column.

# Drop columns
tab_01 = tab_01 |>
  select(-LL, -Cum.Wt)

# View table
tab_01

  Modnames K     AICc  Delta_AICc     ModelLik       AICcWt
8  Model 7 6 215.7859  0.00000000 1.000000e+00 4.480116e-01
6  Model 5 5 215.8278  0.04193024 9.792531e-01 4.387168e-01
9  Model 8 7 218.5893  2.80336833 2.461820e-01 1.102924e-01
5  Model 4 4 227.0215 11.23555435 3.632707e-03 1.627495e-03
3  Model 2 4 228.5452 12.75925104 1.695758e-03 7.597192e-04
7  Model 6 5 229.2266 13.44069417 1.206120e-03 5.403555e-04
2  Model 1 3 233.9220 18.13611066 1.152905e-04 5.165149e-05
4  Model 3 3 279.7700 63.98406188 1.276549e-14 5.719088e-15
1  Model 0 2 282.9998 67.21386820 2.539183e-15 1.137583e-15

We can then pipe the tab_01 data frame into the gt() function to format the table for pretty-printing in Quarto. For some column labels, I use the html() function in order to use HTML symbols to create the Greek letter Delta and the scripted “L”.

# Create knitted table
tab_01 |>
  gt() |>
  cols_align(
    columns = c(Modnames),
    align = "left"
  ) |>
  cols_align(
    columns = c(K, AICc, Delta_AICc, ModelLik, AICcWt),
    align = "center"
  ) |>
  cols_label(
    Modnames = md("*Model*"),
    K = md("*k*"),
    AICc = md("*AICc*"),
    Delta_AICc = html("&#916;AICc"),
    ModelLik = html("Rel(&#8466;)"),
    AICcWt = md("*AICc Wt.*")
  ) |>
  tab_options(
    table.width = pct(90)
  ) |>
  tab_footnote(
    footnote = html("Rel(&#8466;) = Relative likelihood. Values are displayed using scientific notation."),
    locations = cells_column_labels(columns = ModelLik)
  )

Table 9.8: Models rank-ordered by the amount of empirical support as measured by the AICc after removing Model 1 from the candidate set. Other evidence includes the ΔAICc, relative likelihood, and model probability (AICc weight) for each model.

Model	k	AICc	ΔAICc	Rel(ℒ)¹	AICc Wt.
Model 7	6	215.7859	0.00000000	1.000000e+00	4.480116e-01
Model 5	5	215.8278	0.04193024	9.792531e-01	4.387168e-01
Model 8	7	218.5893	2.80336833	2.461820e-01	1.102924e-01
Model 4	4	227.0215	11.23555435	3.632707e-03	1.627495e-03
Model 2	4	228.5452	12.75925104	1.695758e-03	7.597192e-04
Model 6	5	229.2266	13.44069417	1.206120e-03	5.403555e-04
Model 1	3	233.9220	18.13611066	1.152905e-04	5.165149e-05
Model 3	3	279.7700	63.98406188	1.276549e-14	5.719088e-15
Model 0	2	282.9998	67.21386820	2.539183e-15	1.137583e-15
¹ Rel(ℒ) = Relative likelihood. Values are displayed using scientific notation.

References

Anderson, D. R. (2008). Model based inference in the life sciences: A primer on evidence. Springer.

Burnham, K. P., Anderson, D. R., & Huyvaert, K. P. (2011). AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. Behavioral Ecology and Sociobiology, 65(1), 23–35.

Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples,. Biometrika, 76, 297–307.

Analyses not shown.↩︎