11 Coefficient-Level Inference

In this chapter, you will learn about statistical inference at the coefficient-level for regression models. To do so, we will use the keith-gpa.csv data to examine whether time spent on homework is related to GPA. The data contain three attributes collected from a random sample of $n = 100$ 8th-grade students (see the data codebook). To begin, we will load several libraries and import the data into an object called keith.

# Load libraries
library(broom)
library(corrr)
library(dplyr)
library(ggplot2)
library(readr)

# Import data
keith = read_csv(file = "https://raw.githubusercontent.com/zief0002/modeling/main/data/keith-gpa.csv")

#View data
keith

ABCDEFGHIJ0123456789

gpa <dbl>	homework <dbl>	parent_ed <dbl>
78	2	13
79	6	14
79	1	13
89	5	13
82	3	16
77	4	13
88	5	13
70	3	13
86	5	15
80	5	14

11.1 Data Exploration

We begin by looking at the marginal distributions of both time spent on homework and GPA. We will also examine summary statistics of these variables. Finally, we also examine a scatterplot of GPA versus time spent on homework.

Code

# Marginal distribution of GPA
ggplot(data = keith, aes(x = gpa)) +
  stat_density(geom = "line") +
  theme_bw() +
  xlab("GPA (on a 100-pt. scale)") +
  ylab("Probability density") +
  ggtitle("Outcome: GPA")
# Marginal distribution of homework
ggplot(data = keith, aes(x = homework)) +
  stat_density(geom = "line") +
  theme_bw() +
  xlab("Time spent on homework per week (in hours)") +
  ylab("Probability density")  +
  ggtitle("Predictor: Homework")

Figure 11.1: Density plots of the marginal distribution of GPA.

Figure 11.2: Density plots of the marginal distribution of time spent on homework.

Code

# Scatterplot of GPA versus homework
ggplot( data = keith, aes(x = homework, y = gpa) ) +
  geom_point() +
  theme_bw() +
  xlab("Time spent on homework per week (in hours)") +
  ylab("GPA (on a 100-pt. scale)")

Figure 11.3: Scatterplot showing the relationship between GPA and time spent on homework.

# Summary statistics
keith |>
  summarize(
    M_gpa  = mean(gpa),
    SD_gpa = sd(gpa),
    M_hw   = mean(homework),
    SD_hw  = sd(homework)
    )

ABCDEFGHIJ0123456789

M_gpa <dbl>	SD_gpa <dbl>	M_hw <dbl>	SD_hw <dbl>
80.47	7.623005	5.09	2.055272

# Compute correlation
keith |>
  select(gpa, homework) |>
  correlate()

ABCDEFGHIJ0123456789

term <chr>	gpa <dbl>	homework <dbl>
gpa	NA	0.3273682
homework	0.3273682	NA

We might describe the results of this analysis as follows:

The marginal distributions of GPA and time spent on homework are both unimodal. The average amount of time these 8th-grade students spend on homework each week is 5.09 hours (SD = 2.06). These 8th-grade students have a mean GPA of 80.47 (SD = 7.62) on a 100-pt scale. There is a moderate, positive, linear relationship between time spent on homework and GPA for these students. This suggests that 8th-grade students who spend less time on homework tend to have lower GPAs, on average, than students who spend more time on homework.

We could also present this information in a table. This table could be extended to include additional variables (which we will do in later chapters). We include the mean and standard deviations in the same table as the correlations by placing them on the main diagonal.

Table 11.1: Correlations between 8th-Grade students’ GPA and weekly time spent on homework. Means and standard deviations (in parentheses) are displayed on the main diagonal.

	1.	2.
1. GPA	80.47 (7.62)
2. Time spent on homework	.33	5.09 (2.06)

We will also fitted a model by regressing GPA on time spent on homework and storing those results in an object called lm.a.

# Fit regression model
lm.a = lm(gpa ~ 1 + homework, data = keith)

# View output
lm.a


Call:
lm(formula = gpa ~ 1 + homework, data = keith)

Coefficients:
(Intercept)     homework  
     74.290        1.214

The fitted equation is:

$\hat{{GPA}_{i}} = 74.29 + 1.21 ({Time spent on homework}_{i}),$

Summarizing this,

This model estimated mean GPA for all 8th-grade students who spend 0 hours a week on homework is 74.29. Each additional hour 8th-grade students spend per week on homework is associated with a difference in GPA of 1.21, on average. Differences in time spent on homework explains 10.7% of the variation in students’ GPAs. All this suggests that time spent on homework is related to GPA for the $n = 100$ 8th-graders in the sample.

11.2 Statistical Inference

What if we want to understand the relationship between time spent on homework and GPA for a larger population of 8th-grade students, say all of them in the district? The problem is that if we had drawn a different sample of $n = 100$ 8th-grade students, all the regression estimates ( ${\hat{β}}_{0}, {\hat{β}}_{1},$ and $R^{2}$ ) would be different than the ones we obtained from our sample. This makes it difficult to say, for example, how the conditional mean GPA differs for students with differing amounts of time spent on homework. In our observed sample, ${\hat{β}}_{1}$ was 1.21. But, had we sampled different students, we might have found that ${\hat{β}}_{1}$ was 2.03. And a different random sample of employees we might have produced a ${\hat{β}}_{1}$ of 0.96.

This variation in the estimates arises because of the random nature of the sampling. One of the key findings in statistical theory is that the amount of variation in estimates under random sampling is completely predictable— this variation is called sampling error. Being able to quantify the sampling error allows us to provide a more informative answer to the research question. For example, it turns out that based on the quantification of sampling error in our example, we believe that the actual $β_{1}$ is between 0.51 and 1.92.

Statistical inference allows us to learn from incomplete or imperfect data Gelman & Hill (2007). In many studies, the primary interest is to learn about one or more characteristics about a population. These characteristics must be estimated from sample data. This is the situation in our example, where we have only a sample of 8th-grade students and we want to understand the relationship between time spent on homework and GPA for ALL 8th-grade students in the district.

In the example, the variation in estimates arises because of sampling variation. It is also possible to have variation because of imperfect measurement. This is called measurement error. Despite these being very different sources of variation, in practice they are often combined (e.g., we measure imperfectly and we want to make generalizations).

Regardless of the sources of variation, the goals in most regression analyses are two-fold:

Estimate the parameters from the observed data; and
Summarize the amount of uncertainty (e.g., quantify the sampling error) in those estimates.

The first goal we addressed in the Simple Linear Regression—Description chapter. It is the second goal that we will explore in this chapter.

11.3 Quantification of Uncertainty

Before we talk about estimating uncertainty in regression, let me bring you back in time to your Stat I course. In that course, you probably spent a lot of time talking about sampling variation for the mean. The idea went something like this: Imagine you have a population that is infinitely large. The observations in this population follow some probability distribution. (This distribution is typically unknown in practice, but for now, let’s pretend we know what that distribution is.) For our purposes, let’s assume the population is normally distributed with a mean of $μ$ and a standard deviation of $σ$ .

Sample n observations from that population. Based on the n sampled observations, find the mean. We will call this ${\hat{μ}}_{1}$ since it is an estimate for the population mean (the subscript just says it is the first sample). In all likelihood, ${\hat{μ}}_{1}$ is not the exact same value as $μ$ . It varies from the population mean because of sampling error.

Now, sample another n observations from the population. Again, find the mean. We will call this estimate ${\hat{μ}}_{2}$ . Again, it probably varies from $μ$ , and may be different than ${\hat{μ}}_{1}$ as well. Continue to repeat this process: randomly sample n observations from the population; and find the mean.

Figure 11.4: Thought experiment for sampling samples of size n from the population to obtain the sampling distribution of the mean.

The distribution of the sample means, it turns out, is quite predictable using statistical theory. Theory predicts that the distribution of the sample means will be normally distributed. It also predicts that the mean, or expected value, of all the sample means will be equal to the population mean, $μ$ . Finally, theory predicts that the standard deviation of this distribution, called the standard error, will be equal to the population standard deviation divided by the square root of the sample size. Mathematically, we would write all this as,

${\hat{μ}}_{n} \sim N (μ, \frac{σ}{\sqrt{n}}) .$

The important thing is not that you memorize this result, but that you understand that the process of randomly sampling from a known population can lead to predictable results in the distribution of statistical summaries (e.g., the distribution of sample means). The other crucial thing is that there the sampling variation can be quantified. The standard error is the quantification of that sampling error. In this case, it gives a numerical answer to the question of how variable the sample mean will be because of random sampling.

When we report information about theoretical distributions, we parameterize the distribution by reporting all the necessary parameter values that are needed to reproduce it. For example to parameterize a normal distribution we need to report not only that it is normal, but also the mean and standard deviation. Notice that these two parameters specifically define the center (mean) and variability (standard deviation) of the distribution. In addition to the shape (normal) these give a perfect description of the distribution. Mathematically, we write that description as:

$\sim N (Mean, SD)$

11.3.1 Quantification of Uncertainty in Regression

We can extend these ideas to regression. Now the thought experiment goes something like this: Imagine you have a population that is infinitely large. The observations in this population have two attributes, call them X and Y. The relationship between these two attributes can be expressed via a regression equation as: $Y_{i} = β_{0} + β_{1} (X_{i}) + ϵ_{i}$ . Randomly sample n observations from the population. This time, rather than computing a mean, regress the sample Y values on the sample X values. Since the sample regression coefficients are estimates of the population parameters, we will write this as: $\hat{Y} = {\hat{β}}_{0, 1} + {\hat{β}}_{1, 1} (X)$ . Repeat the process. This time the regression equation is: $\hat{Y} = {\hat{β}}_{0, 2} + {\hat{β}}_{1, 2} (X)$ . Continue this process an infinite number of times.

Figure 11.5: Thought experiment for sampling samples of size n from the population to obtain the sampling distribution of the regression coefficients.

Statistical theory again predicts the characteristics of the two distributions, that of ${\hat{β}}_{0}$ and that of ${\hat{β}}_{1}$ . The distribution of ${\hat{β}}_{0}$ can be expressed as,

${\hat{β}}_{0} \sim N (β_{0}, σ_{ϵ} \sqrt{\frac{1}{n} + \frac{μ_{X}^{2}}{\sum (X_{i} - μ_{X})^{2}}}) .$

Similarly, the distribution of ${\hat{β}}_{1}$ can be expressed as,

${\hat{β}}_{1} \sim N (β_{1}, \frac{σ_{ϵ}}{σ_{x} \sqrt{n - 1}}) .$

Again, don’t panic over the formulae. What is important is that theory allows us to quantify the variation in both ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ that is due to sampling error. In practice, our statistical software will give us the numerical estimates of the two standard errors.

11.4 Hypothesis Testing

Some research questions point to examining whether the value of some regression parameter differs from a specific value. For example, it may be of interest whether a particular population model (e.g., one where $β_{1} = 0$ ) could produce the sample result of a particular ${\hat{β}}_{1}$ . To test something like this, we state the value we want to test in a statement called a null hypothesis. For example,

$H_{0} : β_{1} = 0$

The hypothesis is a statement about the population. Here we hypothesize $β_{1} = 0$ . It would seem logical that one could just examine the estimate of the parameter from the observed sample to answer this question, but we also have to account for sampling uncertainty. The key is to quantify the sampling variation, and then see if the sample result is unlikely given the stated hypothesis.

One question of interest may be: Is there evidence that the average GPA differs for different amounts of time spent on homework? In our example, we have a ${\hat{β}}_{1} = 1.21$ . This is sample evidence, but does 1.21 differ from 0 more than we would expect because of random sampling? If it doesn’t, we cannot really say that the average GPA differs for different amounts of time spent on homework. To test this, we make an assumption that there is no relationship between time spent on homework and GPA, in other words, the slope of the line under this assumption would be 0. The thought experiment underlying this would be to randomly sample n observations from the population where $β_{1} = 0$ . Regress the sample Y values on the sample X values and obtain the sample slope.

Figure 11.6: Thought experiment for sampling samples of size n from the population where there is no effect of X ( $β_{1} = 0$ ) to obtain the sampling distribution of the slopes.

Theory would suggest that the distribution of ${\hat{β}}_{1}$ , assuming the null hypothesis is true, can be expressed as,

${\hat{β}}_{1} \sim N (0, \frac{σ_{ϵ}}{σ_{x} \sqrt{n - 1}}) .$

Note that the only thing that changed is that the mean of the sampling distribution is now 0, reflecting the null hypothesis being true. The key is to determine the standard error (quantify the uncertainty) for the distribution of sample slopes.

11.4.1 Obtaining SEs for the Regression Coefficients

To obtain the estimated standard errors for the regression coefficients, we will use the tidy() function from the {broom} package to display the fitted regression output. We provide the fitted regression object as the input to this function.

# Display the coefficient-level output
tidy(lm.a)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	74.289677	1.9419562	38.25507	1.014372e-60
homework	1.214209	0.3540204	3.42977	8.853515e-04

In the displayed output, we now obtain the estimates for the standard errors in addition to the coefficient estimates. We can use these values to quantify the amount of uncertainty due to sampling error. For example, the distribution for the slope has a standard error of 0.35. So now, the the distribution of ${\hat{β}}_{1}$ , assuming the null hypothesis is true, can be expressed as,

${\hat{β}}_{1} \sim N (0, 0.35)$

One way to envision this is as a distribution.

Figure 11.7: Sampling distribution of the slope coefficient under the hypothesis that $β_{1} = 0$ . The distribution is approximately normal with a mean of 0 and a standard error of 0.35.

Before we talk about how to use this sampling distribution to conduct our hypothesis test, we need to introduce one wrinkle into the procedure.

11.5 Estimating Variation from Sample Data: No Longer Normal

In theory, the sampling distributions for the two regression coefficients were both normally distributed. This is the case when we know the variation parameters in the population. For example, for the sampling distribution of the slope to be normally distributed, we would need to know $σ_{ϵ}$ and $σ_{x}$ .

In practice these values are typically unknown and are estimated from the sample data. Anytime we are estimating things we introduce additional uncertainty. In this case, the uncertainty affects the shape of the sampling distribution.

Figure 11.8: Comparison of two distributions. The normal distribution (solid, blue) and one with additional uncertainty (dashed, orange).

Compare the normal distribution (solid, blue) to the distribution with additional uncertainty (dashed, orange). From the figure you can see that the additional uncertainty slightly changed the shape of the distribution from normal.

It is still symmetric and unimodal (like the normal distribution).
The additional uncertainty makes more extreme values more likely than they are in the normal distribution.
The additional uncertainty makes values in the middle less likely than they are in the normal distribution.

It is important to note that the amount of uncertainty affects how closely the shape of the distribution matches the normal distribution. And, that the sample size directly affects the amount of uncertainty we have. All things being equal, we have less uncertainty when we have larger samples. The following figure illustrates this idea.

Figure 11.9: The normal distribution (solid, blue), a distribution with estimates based on a larger sample size (dashed, orange), and one based on a smaller sample size (dotted, green).

11.5.1 The t-Distribution

As pointed out, the distributions with uncertainty introduced from using a sample of data are not normally distributed. Thus, it doesn’t make sense to use a normal distribution as a model for describing the sampling variation. Instead, we will a t-distribution; a family of distributions that have several advantageous properties:

They are unimodal and symmetric.
They have more variation (uncertainty) than the normal distribution resulting in a distribution that has thicker tails and is shorter in the middle than a normal distribution.
How thick the tails are and how short the middle of the distribution is, is related to the sample size.

Specifically, the t-distribution is unimodal and symmetric with a mean of 0. The variance of the distribution (which also specifies the exact shape), is

$Var = \frac{df}{df - 2}$

for $df > 2$ where df is referred to as the degrees of freedom.

To parameterize a t-distribution, we need to report the degrees-of-freedom which dictate the variation. Since the mean of a t-distribution is always 0, it isn’t necessary to report the center. So by knowing a distribution is t-distributed and also knowing it df, we again have a perfect description of the distribution. Mathematically, we could write:

$\sim t (df)$

11.5.2 Back to the Hypothesis Test

Recall that we are interested in testing the following hypothesis,

$H_{0} : β_{1} = 0$

To test this we compute the number of standard errors that our observed slope ( ${\hat{β}}_{1} = 1.21$ ) is from the hypothesized value of zero (stated in the null hypothesis). Since we already obtained the standard error for the slope ( $S E = 0.354$ ), we just use some straight-forward algebra to compute this:

$\frac{1.21 - 0}{0.354} = 3.42$

Interpreting this, we can say,

The observed slope of 1.21 is 3.42 standard errors from the expected value of 0.

This value is referred to as the observed t-value. (It is similar to a z-value in the way it is computed; it is standardizing the distance from the observed slope to the hypothesized value of zero. But, since we had to estimate the SE using the data, we introduced additional uncertainty; hence a t-value.)

We can evaluate this t-value within the appropriate t-distribution. For regression coefficients, the t-distribution we will use for evaluation has degrees of freedom that are a function of the sample size and the number of coefficients being estimated in the regression model, namely,

$df = n - (number of coefficients) .$

In our example the sample size (n) is 100, and the number of coefficients being estimated in the regression model is two ( ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ). Thus,

$df = 100 - 2 = 98$

Based on this, we will evaluate our observed t-value of 3.42 using a t-distribution with 98 degrees of freedom. Using this distribution, we can compute the probability of obtaining a t-value (under random sampling) at least as extreme as the one in the data under the assumed model. This is equivalent to finding the area under the probability curve for the t-distribution that is greater than or equal to 3.42.¹ This is called the p-value.

Figure 11.10: Plot of the probability curve for the t(98) distribution. The shaded area under the curve represents the p-value for a two-tailed test evaluating whether the population slope is zero using an observed t-value of 73.42.

The p-value is computed for us and displayed in the tidy() output, along with the t-value (provided in the statistic column). In our example, $p = 0.000885$ . (Note that the p-value might be printed in scientific notation. For example, it may be printed as 8.85e-04, which is equivalent to $8.85 \times 10^{- 4}$ .) To interpret this we would say,

The probability of observing a t-value of 3.42, or a t-value that is more extreme, under the assumption that $β_{1} = 0$ is 0.000885.

This is equivalent to saying:

The probability of observing a sample slope of 1.21, or a slope that is more extreme, under the assumption that $β_{1} = 0$ is 0.000885.

This is quite unlikely, and indicates that the empirical data are inconsistent with the hypothesis that $β_{1} = 0$ . As such, it serves as evidence against the hypothesized model. In other words, it is likely that $β_{1} \neq 0$ .

11.5.3 Testing the Intercept

The hypothesis being tested for the intercept is $H_{0} : β_{0} = 0$ . The tidy() output also provides information about this test:

# Coefficient-level output
tidy(lm.a)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	74.289677	1.9419562	38.25507	1.014372e-60
homework	1.214209	0.3540204	3.42977	8.853515e-04

The results indicate that the observed intercept of 74.28 is 38.26 standard errors from the hypothesized value of 0;

$t = \frac{74.28 - 0}{1.94} = 38.26$

Assuming the null hypothesis that $β_{0} = 0$ is true, the probability of observing a sample intercept of 74.28 or one that is more extreme, is $1.01 \times 10^{- 60}$ . (Any p-value less than .001 is typically reported as $p < .001$ .) This is evidence against the hypothesized model. Because of this, we would say the empirical data are inconsistent with the hypothesis that $β_{0} = 0$ ; it is unlikely that the intercept in the population is zero.

How small does our p-value have to be for the empirical data are inconsistent with the null hypothesis? That is something that is decided by the researcher prior to doing any data analysis. Typically in the social and educational sciences we say that for the empirical data are inconsistent with the null hypothesis the p-value needs to be less than 0.05. The value we choose is referred to as the alpha value ( $α = .05$ ).

11.6 ‘Statistical Significance’: An Outdated Idea for Research

You may have read papers or taken statistics courses that emphasized the language “statistically significant”. This adjective was typically used when the empirical evidence was inconsistent with a hypothesized model (when the p-value was less than or equal to the specified alpha level), and the researcher subsequently “rejected the null hypothesis”. In the social sciences this occurred when the p-value was less than .05.

In 2019, the American Statistical Association put out a special issue in one of their premier journals, stating,

…it is time to stop using the term ‘statistically significant’ entirely. Nor should variants such as ‘significantly different,’ ‘p < 0.05,’ and ‘nonsignificant’ survive, whether expressed in words, by asterisks in a table, or in some other way. Regardless of whether it was ever useful, a declaration of ‘statistical significance’ has today become meaningless. (Wasserstein & Schirm, 2019, p. 2)

They went on to say,

…no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. (Wasserstein & Schirm, 2019, p. 2)

This is not to say that p-values should not be reported; they should. But rather that we should not arbitrarily dichotomize a continuous measure into two categories whose labels are at best meaningless and at worst misleading. The goal of scientific inference (which is much broader than statistical inference for a single study) is replicability and empirically generalizable results and findings. And, as Hubbard et al. (2019) point out, declaring findings as ‘significant’ or ‘not significant’ works in direct opposition to the broader culmination of knowledge and evidence in a field.

Instead, we want to begin to see the p-value as a measure of incompatibility between the empirical data and a very specific model, one in which a certain set of assumptions are true. Both the empirical data (which are unique to the specific study) and the model’s set of assumptions often make the p-value unique to the specific study carried out and less useful in the broader goal of scientific inference. As such we need to come to view the p-value for what it is, one measure of evidence, for one very particular model, in one very localized study. As Ron Wasserstein reminds us,

Small p-values are like a right-swipe in Tinder. It means you have an interest. It doesn’t mean you’re ready to book the wedding venue.

11.6.1 What Language Should We Use Instead of “Significance”?

The word “significant” in the phrases “statistically significant” and “significantly different” conveys the finding as important. As Wasserstein & Schirm (2019) remind us, findings are not important just because the have a small p-value, nor are findings with a large p-value unimportant. Instead, what a small p-value indicates is that statistically you can differentiate between your data’s summary measure and what you are testing that against after accounting for some level of uncertainty. For example in testing whether the slope is different from 0 ( $H_{0} : β_{1} = 0$ ), if you obtain a small p-value it means that you have statistically detected a difference between your observed slope and 0 after accounting for uncertainty at a particular $α$ level.

Because of this, one suggestion in writing about the results is to use the words “statistically discernible” rather than “statistically significant”. For example, in writing up results from our example:

The hypothesis test suggests that the effect of homework ( $B = 1.21$ ) is statistically discernible from 0 ( $p = 0.000885$ ).

When reporting p-values for the social sciences, we typically round to three decimal places. So our p-value of 0.000885 would be rounded to 0.000. However, since the p-value is not actually 0, we would instead report this as $p < .001$ .

11.7 Confidence Intervals

Rather than test whether the data are consistent with a particular parameter value (e.g., is $β_{1} = 0$ ), you could provide a range of parameter values that the data are consistent with. This range of compatible parameter values is referred to as a confidence interval or a compatibility interval; typically abbreviated CI. To compute a confidence interval we use:

$CI = Estimate \pm t^{*} (SE)$

where the estimate is the observed value from the data, SE is the estimate of the standard error, and $t^{*}$ is a multiplier that depends on the t-distribution and the confidence level. In the social and educational sciences, it is typical to have a multiplier around two.² The plus/minus means you do the calculation using minus and then re-do it using plus; resulting in two different values. So, to compute the CI for our slope estimate:

$\begin{aligned} CI = 1.21 \pm 2 (0.35) \\ 1.21 - 2 (0.35) = 0.51 \\ 1.21 + 2 (0.35) = 1.91 \end{aligned}$

The two values form the endpoints of our confidence interval, so our $CI = [0.51, 1.91]$ . The CI accounts for the uncertainty due to sampling variation by providing a range of values for the slope that are compatible with the observed data. Interpreting this, we might say,

For all 8th-graders in the district, each one-hour difference in time spent on homework per week is associated with a difference in overall GPA between 0.51 and 1.91, on average.

We can use the conf.int=TRUE argument in the tidy() function to obtain these limits directly. By default this will compute a 95% CI. This can be changed using the conf.level= argument.³

# Include CIs in the coefficient-level output
tidy(lm.a, conf.int = TRUE, conf.level = 0.95)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
(Intercept)	74.289677	1.9419562	38.25507	1.014372e-60	70.4359280	78.143426
homework	1.214209	0.3540204	3.42977	8.853515e-04	0.5116668	1.916751

We could similarly express the uncertainty in the intercept via a CI as,

$CI for β_{0} = [70.4, 78.1]$

Interpreting this, we might say,

The average GPA for all 8th-grade students in the district who spend zero hopurs per week on homework is between 70.4 and 78.1.

When reporting results from hypothesis tests, in addition to the p-value, always include either the standard error or the confidence interval in reporting your results. This allows readers to see how different from 0 the observed result is after accounting for uncertainty. This allows them to judge the scientific importance using their domain knowledge. For example,

The hypothesis test suggests that the effect of homework ( $B = 1.21$ ) is statistically discernible from 0 ( $p < 0.001$ ). The 95% CI suggests that the effect of homework for 8th-graders is likely between 0.51 and 1.91, indicating that each 1-hour of homework is associated with a difference of between 0.51 and 1.91 GPA points.

11.7.1 Confidence Intervals as Compatibility Intervals

One way of interpreting this interval is that every value in the interval is a parameter value that is reasonably compatible with the empirical data. For example, in considering the CI for the slope parameter, population slope $(β_{1})$ values between 0.51 and 1.92 are all reasonably compatible with the empirical data (with the caveat that, again, all the assumptions used to create the interval are satisfied). As applied researchers, we should describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits.

For us this means describing the practical implications of the true slope being 1.21, as low as 0.51, and as high as 1.92. Are these meaningful differences in GPA (measure on a 100-pt. scale)? Given that the SD for GPA was 7.62, a one-hour difference in time spent on homework is associated with at most a 0.25 SD difference in GPA or as little as a 0.07 SD difference in GPA. This is not a large difference, however whether it is meaningful depends on previous research about GPA.⁴

Confidence intervals help us keep an open-mind about uncertainty, after all they suggest several values that are compatible with the empirical data. However, they can also be misleading. Amrhein et al. (2019) point out four key points fo us to remember as we use CIs:

Just because the interval gives the values most compatible with the data, given the assumptions, it does not mean values outside it are incompatible; they are just less compatible.
Not all values inside are equally compatible with the data, given the assumptions. The point estimate is the most compatible, and values near it are more compatible than those near the limits.
Like the 0.05 alpha threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention.
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval. In practice, these assumptions are at best subject to considerable uncertainty.

11.8 References

Amrhein, V., Greenland, S., & McShane, B. (2019). Retire statistical significance. Nature, 567, 305–307.

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

Hubbard, R., Haig, B. D., & Parsa, R. A. (2019). The limited role of formal statistical inference in scientific inference. The American Statistician, 73, 91–98. https://doi.org/10.1080/00031305.2018.1464947

Wasserstein, R., & Schirm, A. (2019). Moving to a world beyond $p < .05$ . Keynote presentation at the United States Conference on Teaching Statistics.

We actually compute the area under the probability curve that is greater than or equal to 3.42 AND that is less than or equal to $- 3.42$ .↩︎
A multiplier around two is associated with a 95% CI. A 99% CI would use a multiplier around 3.↩︎
The actual limits from the 95% CI are computed using a multiplier that is slightly different than two; thus the discrepancy between our off-the-cuff computation earlier and the result from tidy(). Using a multiplier of two is often close enough for practical purposes, especially when the sample size is large.↩︎
It turns out this is quite a complicated question and the effects of homework depend on a variety of student factors, including age, culture, household income, etc. Many studies have also found a non-linear effect of homework, indicating there may be an optimum amount for some groups of students.↩︎