In this chapter, you will learn about how to include categorical predictors with more than two categories in the regression model. To do so, we will use the pew.csv data to examine whether American’s news knowledge differs based on the source of their news (see the data codebook). To begin, we will load several libraries and import the data into an object called pew.
The tidyverse library is a meta-package that includes dplyr, forcats, ggplot2, purrr, readr, stringr, tibble, and tidyr. Loading tidyverse allows you to use all of the functionality in these other packages without having to load eight different packages.
# Load librarieslibrary(broom)library(corrr)library(ggridges)library(tidyverse)# Read in datapew =read_csv(file ="https://raw.githubusercontent.com/zief0002/modeling/main/data/pew.csv")# View datapew
ABCDEFGHIJ0123456789
id
<dbl>
knowledge
<dbl>
news_source
<chr>
news
<dbl>
ideology
<dbl>
engagement
<dbl>
age
<dbl>
education
<dbl>
female
<chr>
1
50
Conservative
63
39.3
79.5
44.4
14
Yes
2
32
All
59
35.6
69.2
67.9
9
Yes
3
45
None
49
75.7
61.0
45.7
12
Yes
4
26
None
22
23.5
66.8
39.3
15
Yes
5
79
None
54
69.5
80.8
59.5
15
No
6
31
Conservative
65
18.0
99.6
43.8
12
Yes
7
22
Conservative
63
33.2
37.4
29.6
10
No
8
93
Conservative
50
29.7
88.6
29.6
16
No
9
23
None
12
52.5
82.4
45.1
12
Yes
10
41
Liberal
16
14.0
87.3
18.7
11
No
19.1 Data Exploration
To begin, as always, we will plot the marginal distributions of news knowledge (knowledge) and new source (news_source).
# Density plot of news knowledgep1 =ggplot(data = pew, aes(x = knowledge)) +stat_density(geom ="line", color ="#c62f4b") +theme_bw() +xlab("News knowledge") +ylab("Probability density")# Bar plot of news sourcep2 =ggplot(data = pew, aes(x = news_source)) +geom_bar(fill ="#c62f4b") +theme_bw() +xlab("News source") +ylab("Frequency")# Layout plotsp1 | p2
Figure 19.1: LEFT: Density plot of news knowledge. RIGHT: Bar plot of news source.
The distribution of new knowledge is slightly left-skewed with the majority of respondents scoring around 50-75. The distribution of news source indicates that the sample is quite unbalanced among the different news sourcess.1 The majority of the people in the sample do not watch/listen to any of the three news sources or get their news from a conservative source.
What does the distribution of news knowledge look like once we condition on news source? We will explore this by creating a scatterplot of news knowledge versus news source. We will also use the geom_density_ridges() layer from the {ggridges} package to create conditional density plots of news knowledge. It is easier to compare the shape of distributions using conditional density plots. We will also compute the summary measures of news knowledge conditioned on news source.
Figure 19.2: LEFT: Scatterplot of the news knowledge versus news source. RIGHT: Density plots of news knowledge conditioned on news source.
Figure 19.3: LEFT: Scatterplot of the news knowledge versus news source. RIGHT: Density plots of news knowledge conditioned on news source.
After conditioning on news source, the data suggest that there are differences in the Americans’ knowledge about current affairs based on their source of news. In the sample, those who get their news from liberal and comedy sources have the highest news knowledge scores, on average. Not surprisingly, those who do not get heir news from any of the sources have the lowest news knoledge scores, on average. Also, those who get their news from liberal or comedy sources tend to score higher (on average) than thos who get their news from conservative sources.
There is, however, a great deal of variation in all the distributions—the SDs range between 16 and 23. This variation makes it difficult to be certain about the trends/differences we saw between the groups without carrying out any inferential analyses (e.g., CIs or hypothesis tests).
19.2 Does News Source Predict Variation in American’s News Knowledge?
To examine whether the sample differences we observed in news knowledge is more than we would expect because of chance, we can fit a regression model using news source to predict variation in news knowledge. Before fitting this model, however, we need to create a set of dummy variables; one for EACH category of the news_source variable. For our analysis, we will need to create EIGHT dummy variables: The mapping for these eight indicators are:
Below we write the syntax to create all eight dummy variables and save the new columns in the pew data frame.2
If you do not know the actual names of the categories (or you want to check capitalization, etc.) use the unique() function to obtain the unique category names.
# Get the categoriespew |>select(news_source) |>unique()
ABCDEFGHIJ0123456789
news_source
<chr>
Conservative
All
None
Liberal
Conservative_Liberal
Conservative_Comedy
Liberal_Comedy
Comedy
It turns out that all eight categories of the predictor are completely identified using any seven of the eight indicator variables. For example, consider the news source for the sample of six people below.
Table 19.1: News source and values on seven of the eight dummy-coded indicator variables for a sample of six people.
News Source
all
con
com
lib
con_com
con_lib
lib_com
Liberal
0
0
0
1
0
0
0
None
0
0
0
0
0
0
0
Conservative
0
1
0
0
0
0
0
Liberal
0
0
0
1
0
0
0
Conservative_Liberal
0
0
0
0
0
1
0
None
0
0
0
0
0
0
0
By using any seven of the indicators, we can identify the news source for everyone in the sample. For example, in the data shown in Table 19.1, we used all of the indicators except for none. For Americans whose news source is “None”, they would have a 0 for all of the seven indicators used. In other words, we don’t need the information in the none indicator to identify people whose news source is “None”, so long as we have the other seven indicators.
To examine the effect of news source, we will fit the regression using any seven of the eight dummy-coded indicator variables you created. The indicator you leave out of the model will correspond to the reference category. For example, in the model fitted below, we include all the predictors except none as predictors in the model. As such, people whose news source is “None” is our reference group.
Ultimately we will need to fit several models, so I often name my regression objects using the reference group. For example, in the model where “None” is the reference group, I will name my regression object lm.none.
# News source = None is reference grouplm.none =lm(knowledge ~1+ all + con + com + lib + con_com + con_lib + lib_com, data = pew)# Model-level infoglance(lm.none) |>print(width =Inf)
At the model-level, differences in news source explain 10.3% of the variation in American’s news knowledge. This is statistically discernible from 0 (i.e., the empirical data are not consistent with the hypothesis that news source does not explain any variation in Americans’ news knowledge, , .
In other words, the data suggest there is an effect of news source on news knowledge. Recall that an effect of a categorical predictor means that there are differences in the average value of the outcome between different levels of the predictor. In our example, there are differences in the average news knowledge based on news source. The key question in exploratory research is then: Which news sources show differences in the average news knowledge? In order to answer this, we need to look at the coefficient-level output.
# Coefficient-level infotidy(lm.none)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
48.661972
0.976286
49.843971
4.706469e-320
all
16.885973
2.728055
6.189749
7.768816e-10
con
7.802314
1.501509
5.196315
2.313551e-07
com
14.972175
3.536521
4.233588
2.440132e-05
lib
16.947403
1.849422
9.163620
1.608391e-19
con_com
11.879012
2.952768
4.023010
6.033242e-05
con_lib
10.821749
1.776633
6.091155
1.423257e-09
lib_com
28.134638
2.997012
9.387563
2.206575e-20
The fitted regression equation is
Recall from the previous chapter, the intercept coefficient is the average Y-value for the reference group. Each partial slope is the difference in average Y-value between the reference group and the group represented by the dummy variable. In our example,
The average news knowledge score for Americans who have no news source is 48.7.
Americans who get their news from all three sources (conservative, comedy, and liberal) have a news knowledge score that is 16.9 points higher, on average, than Americans who have no news source.
Americans who get their news from a conservative source have a news knowledge score that is 7.8 points higher, on average, than Americans who have no news source.
Americans who get their news from a comedy source have a news knowledge score that is 15 points higher, on average, than Americans who have no news source.
Americans who get their news from a liberal source have a news knowledge score that is 16.9 points higher, on average, than Americans who have no news source.
Americans who get their news from conservative and comedy sources have a news knowledge score that is 11.9 points higher, on average, than Americans who have no news source.
Americans who get their news from conservative and liberal sources have a news knowledge score that is 10.8 points higher, on average, than Americans who have no news source.
Americans who get their news from liberal and comedy sources have a news knowledge score that is 28.1 points higher, on average, than Americans who have no news source.
The statistical hypothesis associated with each of the parameters in the model are:
These relate to the following scientific hypotheses, respectively:
The average news knowledge score for Americans who have no news source (reference group) is 0.
The average news knowledge score for Americans who get their news from all three sources (conservative, comedy, and liberal) is not different than the average news knowledge score for Americans who have no news source.
The average news knowledge score for Americans who get their news from conservative sources is not different than the average news knowledge score for Americans who have no news source.
The average news knowledge score for Americans who get their news from comedy sources is not different than the average news knowledge score for Americans who have no news source.
The average news knowledge score for Americans who get their news from liberal sources is not different than the average news knowledge score for Americans who have no news source.
The average news knowledge score for Americans who get their news from conservative and comedy sources is not different than the average news knowledge score for Americans who have no news source.
The average news knowledge score for Americans who get their news from conservative and liberal sources is not different than the average news knowledge score for Americans who have no news source.
The average news knowledge score for Americans who get their news from liberal and comedy sources is not different than the average news knowledge score for Americans who have no news source.
Because the scientific hypotheses are really about comparisons of conditional means, the statistical hypotheses can also be written to reflect this as:
or equivalently
or equivalently
or equivalently
or equivalently
or equivalently
or equivalently
or equivalently
where represents the average news knowledge for Americans who get their news from source j (e.g., indicates the average news knowledge score for Americans who get their news from conservative sources).
It is evaluation of the hypotheses associated with the partial slopes in the model that allow us to answer our question about which news sources show differences in the average news knowledge. The p-values associated with the partial slope coefficients indicate whether the observed differences in news scores (relative to the reference group) are simply due to sampling error, or whether there is a statistically discernible difference in the news knowledge between the groups. Based on the p-values, all seven groups have an average news knowledge score that is statistically discernible than the average news knowledge score for Americans who have no news source. Moreover, based on the positive coefficients, we can assume that those scores are higher than the average news knowledge score for Americans who have no news source.
19.2.1 Alternative Expression of the Model-Level Null Hypothesis
Recall that one expression of the null hypothesis associated with the model-level test in multiple regression is that all the partial slopes are zero. In general,
When we use multiple dummy-coded indicator variables to represent a categorical predictor, each partial slope represents the mean difference between two groups, and the effect of that categorical predictor is composed of ALL sets of differences between two groups (pairwise differences). In our example,there are 28 pairwise differences! (Think of all the ways we can compare two of the eight different groups.)
All sources vs. conservative sources
All sources vs. comedy sources
All sources vs. liberal sources
All sources vs. conservative/comedy sources
All sources vs. conservative/liberal sources
All sources vs. liberal/comedy sources
All sources vs. no sources
Conservative sources vs. comedy sources
Conservative sources vs. liberal sources
Conservative sources vs. conservative/comedy sources
Conservative sources vs. conservative/liberal sources
Conservative sources vs. liberal/comedy sources
Conservative sources vs. no sources
Comedy sources vs. liberal sources
Comedy sources vs. conservative/comedy sources
Comedy sources vs. conservative/liberal sources
Comedy sources vs. liberal/comedy sources
Comedy sources vs. no sources
Liberal sources vs. conservative/comedy sources
Liberal sources vs. conservative/liberal sources
Liberal sources vs. liberal/comedy sources
Liberal sources vs. no sources
Conservative/comedy sources sources vs. conservative/liberal sources
Conservative/comedy sources sources vs. liberal/comedy sources
Conservative/comedy sources sources vs. no sources
Conservative/liberal sources sources vs. liberal/comedy sources
Conservative/liberal sources sources vs. no sources
Liberal/comedy sources sources vs. no sources
The model-level null hypothesis can be expressed as:
The test at the model-level is considering all 28 pairwise differences simultaneously. If the model-level test is statistically discernible from 0, any one (or more than one) of the 28 differences may not be zero.
In our example, we found that the model-level test had a small p-value, so at least one of the 28 comparisons would have results that are statistically discernible from 0.
Note that in our coefficient-level output, we found that 7 of the potential comparisons (those that compare to the “None” group) yielded results that were statistically discernible from zero. That is, based on those results we believe:
Unfortunately, the coefficient-level output from the regression we fitted, does not give us any information about the other 21 comparisons. In order to get information about the other pairwise comparisons, we need to fit additional regression models using a different reference group.
When you have more than two levels in your categorical predictor in order to examine ALL potential coefficient-level differences, you will need to fit many regression models using different reference groups.
19.3 Pairwise Comparisons for Americans who get their News from Conservative Sources
Consider the pairwise comparisons between Americans who get their News from conservative sources and those who get their news from other sources. There are 7 such comparisons. We already evaluated one of those, namely the comparison between Americans who get their News from conservative sources and those who have no news source. To evaluate the other 6 differences, we need to fit a model in which con is the reference group. By doing this, the partial slopes will indicate the difference in average news knowledge between Americans who get their news each of the other sources and those who get their news from conservative sources. Below, we fit this model (using con as the reference group) to predict variation in news knowledge.
# Conservative news sources is reference grouplm.con =lm(knowledge ~1+ all + com + lib + con_com + con_lib + lib_com + none, data = pew)# Model-level infoglance(lm.con) |>print(width =Inf)
Note that the model-level output for this fitted model is exactly the same as that for the model in which none was the reference group. That is because at the model-level, this model is testing the exact same hypothesis as the previous model (to examine whether the 28 sets of pairwise differences explain variation in news knowledge). The results suggest that at least one of the pairwise differences is statistically discernible from 0; that is there are differences in the average amount of news knowledge between at least two of the groups.
# Coefficient-level infotidy(lm.con)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
56.464286
1.140787
49.495905
3.204196e-317
all
9.083659
2.791154
3.254445
1.161555e-03
com
7.169861
3.585421
1.999726
4.571061e-02
lib
9.145089
1.941294
4.710821
2.696460e-06
con_com
4.076698
3.011162
1.353862
1.759852e-01
con_lib
3.019435
1.872081
1.612876
1.069826e-01
lib_com
20.332324
3.054561
6.656383
3.929678e-11
none
-7.802314
1.501509
-5.196315
2.313551e-07
The fitted regression equation, which is different than the previous fitted equation, is:
Interpreting these values,
The average news knowledge score for Americans who get their news from conservative sources is 56.5. The empirical evidence suggests that this is statistically discernible from 0; , .
Americans who get their news from all three sources (conservative, comedy, and liberal) have a news knowledge score that is 9.1 points higher, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is statistically discernible from 0; , .
Americans who get their news from a comedy source have a news knowledge score that is 7.2 points higher, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is statistically discernible from 0; , .
Americans who get their news from a liberal source have a news knowledge score that is 9.2 points higher, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is statistically discernible from 0; , .
Americans who get their news from conservative and comedy sources have a news knowledge score that is 4 points higher, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is NOT statistically discernible from 0; , .
Americans who get their news from conservative and liberal sources have a news knowledge score that is 3 points higher, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is NOT statistically discernible from 0; , .
Americans who get their news from liberal and comedy sources have a news knowledge score that is 20.3 points higher, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is statistically discernible from 0; , .
Americans who don’t get their news from any source have a news knowledge score that is 7.8 points lower, on average, than Americans who get their news from conservative sources. The empirical evidence suggests that this is statistically discernible from 0; , .
Note that the last comparison (with the none group) is the comparison we already evaluated in the previous model. The coefficient-level output for this gives redundant information that from the previous model—the only difference being that the signs are opposite on the coefficient and t-value since we changed the reference group. (The p-values are identical.)
19.4 Presenting Results of the Pairwise Comparison Tests
Now we have statistical results from 13 of the 28 comparisons. Before we go on and fit other regression models to evaluate our other pairwise comparisons, let’s consider how we might want to present these results to a reader. As always, we can do this by presenting the results in the text of the manuscript, in a table, or in a visualization. If the number of pairwise comparisons is small (say three or fewer), presenting them in the text is reasonable as doing so would take less space than using a table or plot. When you have a larger number of comparisons to present, doing so in text is likely not a good choice due to the cognitive burden this would place on the reader.
A table is a good choice for a small to moderate number of comparisons (e.g., comparisons). For example, Table 19.2 presents the results for three pairwise comparisons carried out as part of a research study examining the effects of family structure on teen substance use.
Table 19.2: Pairwise comparisons of adolescent substance use between three family structures.
Comparison
Mean
Difference
p
Two-parent vs. Parent/guardian
0.164
0.079
Two-parent vs. Single-parent
0.221
0.005
Parent/guardian vs. Single-parent
0.057
0.609
Once you have more than 10 comparisons, presenting this information in a table can be overwhelming. (If you really need to do this, consider putting the table in an appendix or in online resources rather than in the manuscript itself.) In our case, presenting the results from 28 comparisons would not only make for a long table, but also probably result in readers just skipping over the table. Instead, we will consider presnting our results in a visualization.
There are several options for such a visualization. One that I like, which shows the same pairwise comparisons presented in tbl-mean-diffs, is the following:
Figure 19.4: Pairwise comparisons of average teen substance use for different family structures.
Instructions: Read across the row for a family structure to compare substance use with the family structures listed along the top of the chart. The symbols indicate whether the average substance use of the family structure in the row is significantly lower than that of the comparison family structure, significantly higher than that of the comparison family structure, or if there is no statistically discernible difference between the average substance use of the two countries.
The column of mean substance use is handy for readers to see the sample differences. (Including the SD here is not necessary, especially if those are included in a different place in the manuscript.) Note, I listed the family structures from largest mean to smallest mean in the figure (in both the rows and columns). This will put all of the down-pointing triangles in the upper right and the up-pointing triangles in the lower left of the figure. This makes it easier to quickly determine which comparison groups a particular group is hgher or lower than. If you have many, many groups listing them alphabetically makes it easier for readers to find the group they are interested in making comparisons for.
Here we create a similar type of chart to visualize the pairwise comparisons for our news knowledge example. For now, we will fill in only the 13 pairwise comparisons that we have already made. (We will add to this as we evaluate additional pairwise differences.)
Figure 19.5: Pairwise comparisons of average news knowledge for different news sources.
Instructions: Read across the row for a news sources to compare news knowledge with the news sources listed along the top of the chart. The symbols indicate whether the average news knowledge of the news source in the row is significantly lower than that of the comparison news source, significantly higher than that of the comparison news source, or if there is no statistically discernible difference between the average news knowledge of the two news sources.
Table-like graphics, such as the pairwise comparison visualizations are often easier to create in Excel (or some other spreadsheet program) than in R. I created these in Excel and then, after turning off the grid lines, took a screenshot and inserted that into this chapter.
19.5 Pairwise Comparisons for Americans who get their News from the Other Sources
Here we fit the additional regression models needed to get the remaining pairwise comparisons.
# Reference group = All Newslm.all =lm(knowledge ~1+ con + com + lib + con_com + con_lib + lib_com + none, data = pew)# Reference group = Comedy Newslm.com =lm(knowledge ~1+ all + con + lib + con_com + con_lib + lib_com + none, data = pew)# Reference group = Liberal Newslm.lib =lm(knowledge ~1+ all + con + com + con_com + con_lib + lib_com + none, data = pew)# Reference group = Conservative and Comedy Newslm.con_com =lm(knowledge ~1+ all + con + com + lib + con_lib + lib_com + none, data = pew)# Reference group = Conservative and Liberal Newslm.con_lib =lm(knowledge ~1+ all + con + com + lib + con_com + lib_com + none, data = pew)
Below we display the coefficient-level output for each of the model (click the tab of the model you want to see the results for). We do not write out the fitted equations, but this should be old hat for you by now. 🤗
Based on these results, we can now update our visualization of the pairwise comparisons.
Figure 19.6: Pairwise comparisons of average news knowledge for different news sources.
Instructions: Read across the row for a news sources to compare news knowledge with the news sources listed along the top of the chart. The symbols indicate whether the average news knowledge of the news source in the row is significantly lower than that of the comparison news source, significantly higher than that of the comparison news source, or if there is no statistically discernible difference between the average news knowledge of the two news sources.
19.5.1 Link to the Analysis of Variance Methodology for Testing Mean Differences
In your statistics journey, you may have encountered the one-factor analysis of variance (ANOVA). This method is often introduced as a way of testing mean differences when you have more than two groups. The null hypothesis for this method (referred to as the omnibus null hypothesis) is that all group means are equal.
Recall that the model-level hypothesis could be written such that the difference in each pair of group means is 0, namely:
Note that the only way for each of these differences to be 0 is if all the means for every news source are equal. This implies that we could also write the model-level null hypothesis as:
This is the same null hypothesis that is associated with the one-factor analysis of variance.
Fitting a regression model with dummy-coded indicator variables gives the exact same results as carrying out a one-factor ANOVA. The difference is that the output from the multiple regression gives -terms associated with mean differences (to the reference group), and ANOVA is concerned more directly with the group means. But the model-level regression results are identical to those from the ANOVA. Asking whether the model explains variation in the outcome () is the same as asking whether there are mean differences (); these are just different ways of writing the model-level null hypothesis!
19.6 Does News Source Predict Variation in News Knowledge After Accounting for Other Covariates?
One question we may have is whether the differences we saw in Americans’ average news knowledge persist after we account for other covariates that also explain differences in news knowledge (e.g., age, education, amount of news consumed, and political engagement). To evaluate this, we will fit a regression model that includes these covariates, along with seven of the eight dummy-coded news source predictors to explain variability in news knowledge.
# News source = None is reference grouplm.none.2=lm(knowledge ~1+ age + education + news + engagement + all + con + com + lib + con_com + con_lib + lib_com, data = pew)# Model-level infoglance(lm.none.2) |>print(width =Inf)
At the model-level, differences in news source, age, education level, amount of news consumption, and political engagement explain 35.9% of the variation in American’s news knowledge. This is statistically discernible from 0 (i.e., the empirical data are not consistent with the hypothesis that this set of predictors does not explain any variation in Americans’ news knowledge), , .
tidy(lm.none.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-29.0291254
3.39691924
-8.5457214
3.111825e-17
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
all
4.1908556
2.66342729
1.5734823
1.158195e-01
con
1.3846847
1.42293557
0.9731183
3.306525e-01
com
8.8069808
3.02919350
2.9073682
3.698775e-03
lib
6.7157173
1.64661795
4.0784915
4.772547e-05
con_com
8.1553389
2.65671829
3.0697040
2.181411e-03
The fitted regression equation is
The interpretations are provided below. While we offer them for the four covariates, we note that the only interpretations that we care about are those for our focal predictors of news source.
Intercept and Covariates
The average news knowledge score for Americans who have no news source, are 0 years old, have 0 years of education, have no news exposure, and are not politically engaged is -29.0 (extrapolation).
Each one-year difference in age is associated with a 0.2-point difference in news knowledge score, on average, after controlling for the other predictors in the model.
Each one-year difference in education is associated with a 3.3-point difference in news knowledge score, on average, after controlling for the other predictors in the model.
Each one-unit difference in news consumption is associated with a 0.2-point difference in news knowledge score, on average, after controlling for the other predictors in the model.
Each one-unit difference in political engagement is associated with a 0.2-point difference in news knowledge score, on average, after controlling for the other predictors in the model.
Focal Predictors
Americans who get their news from all three sources (conservative, comedy, and liberal) have a news knowledge score that is 4.19 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Americans who get their news from conservative sources have a news knowledge score that is 1.4 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Americans who get their news from comedy sources have a news knowledge score that is 8.8 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Americans who get their news from liberal sources have a news knowledge score that is 6.7 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Americans who get their news from conservative and comedy sources have a news knowledge score that is 8.2 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Americans who get their news from conservative and liberal sources have a news knowledge score that is 0.2 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Americans who get their news from liberal and comedy sources have a news knowledge score that is 16.1 points higher than Americans who have no news source, on average, after controlling for the other predictors in the model.
Another methodology you may have encountered in your statistics journey is analysis of covariance (ANCOVA). This method is often introduced as a way of testing mean differences while controlling for other covariates. Again, this is exactly what we are evaluating in the regression model when we included the covariates of age, education level, amount of news consumption, and political engagement. Just as you can carry out ANOVA using regression, you can also carry out an ANCOVA using regression.
19.6.1 Adjusted Mean Differences and Adjusted Means
The mean differences we obtained from the regression model that included our covariates are referred to as controlled mean differences. In the language of ANCOVA, the controlled mean differences are referred to as Adjusted Mean Differences. To seperate this from the mean differences we obtained from the model with no covariates, we refer to those differences as Unadjusted Mean Differences.
For example, in the model that did not include any covariates (lm.none), the difference between the news knowledge scores for Americans who get their news from liberal and comedy sources and those that have no news source is 28.1 points. This is the unadjusted mean difference between these two groups of Americans. Once we account for differences explained by age, education level, amount of news consumption, and political engagement, the difference in news knowledge is 16.1 points. This is the controlled (or adjusted) mean difference between these two groups of Americans.
The unadjusted mean differences are based on the difference between the unadjusted means. That is we computed the means for the two groups and then computed the difference between them. Earlier we found that the mean news knowledge score for Americans who get their news from liberal and comedy sources was 76.8 and that fr Americans that have no news source was 48.7. The difference (unadjusted mean difference) is . This is exactly the value that the unadjusted regression model (i.e., the model with no covariates) produces.
Similarly, the adjusted mean differences are based on the difference between the adjusted means. So how do we compute the adjusted means? Remember that the predicted values from regression models are means. The predicted values from the unadjusted regression model produces unadjusted means, and the predicted values from the adjusted regression model (i.e., the model with covariates) produces adjusted means. So we need to substitute values into our fitted equation to produce the predicted news knowledge values for each of the different news sources.
For the news source predictors we will substitute 0 or 1 into each of the predictors based on the news source we are computing the adjusted mean for. For the covariates, we can use any value we want, but it is typical to substitute the mean values of the covariates when computing adjusted means. Below we compute the adjusted means for each news source.
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for none source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*0+8.8*0+6.7*0+8.2*0+0.2*0+16.1*0
[1] 51.959
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for all source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*1+1.4*0+8.8*0+6.7*0+8.2*0+0.2*0+16.1*0
[1] 56.159
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for conservative source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*1+8.8*0+6.7*0+8.2*0+0.2*0+16.1*0
[1] 53.359
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for comedy source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*0+8.8*1+6.7*0+8.2*0+0.2*0+16.1*0
[1] 60.759
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for liberal source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*0+8.8*0+6.7*1+8.2*0+0.2*0+16.1*0
[1] 58.659
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for conservative and comedy source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*0+8.8*0+6.7*0+8.2*1+0.2*0+16.1*0
[1] 60.159
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for conservative and liberal source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*0+8.8*0+6.7*0+8.2*0+0.2*1+16.1*0
[1] 52.159
# Mean age = 50.94# Mean education = 13.95# Mean news consumption = 50.27# Mean political engagement = 73.41# Compute adjusted mean for liberal and comedy source-29.0+0.2*50.94+3.3*13.95+0.2*50.27+0.2*73.41+4.2*0+1.4*0+8.8*0+6.7*0+8.2*0+0.2*0+16.1*1
[1] 68.059
Here we present the adjusted mean news knowledge scores in diminishing order. (In a manuscript these are often presented, along with the unadjusted means, in a table.)
Liberal and Comedy: 68.06
Comedy: 60.76
Conservative and Comedy: 60.16
Liberal: 58.66
All: 56.16
Conservative: 53.36
Conservative and Liberal: 52.16
None: 51.96
Using these, we can compute the adjusted mean differences. For example the adjusted mean difference between Americans who get their news from liberal and comedy sources and Americans who have no news source is . This is, again, exactly the value that the adjusted regression model (i.e., the model with covariates) produces as a coefficient.
19.7 Obtaining the Other Adjusted Pairwise Comparisons
Here we will fit several other models (using the same set of covariates) to obtain the remaining adjusted pairwise comparisons. We will use these results to build another visualization presenting the adjusted mean values and the adjusted comparisons.
I will enter the news source predictors in the model in order from largest to smallest adjusted mean values. While the order is irrelevant to obtaining the results, it will be easier to construct our visualization since the predictor order will math the order they are included in the visualization.
# News source = Liberal and Comedy is reference grouplm.lib_com.2=lm(knowledge ~1+ age + education + news + engagement + com + con_com + lib + all + con + con_lib + none, data = pew)# Coefficient-level outputtidy(lm.lib_com.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-12.9130715
4.40945812
-2.928494
3.457838e-03
age
0.2225307
0.02722194
8.174684
6.276044e-16
education
3.3341011
0.20989679
15.884479
1.309937e-52
news
0.1731191
0.02604181
6.647738
4.164092e-11
engagement
0.2306146
0.03003191
7.678985
2.886167e-14
com
-7.3090731
3.78453199
-1.931302
5.363523e-02
con_com
-7.9607151
3.38955978
-2.348599
1.897429e-02
lib
-9.4003366
2.78566060
-3.374545
7.583678e-04
all
-11.9251983
3.26823544
-3.648819
2.725359e-04
con
-14.7313693
2.61423730
-5.635054
2.089358e-08
# News source = Comedy is reference grouplm.com.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + con_com + lib + all + con + con_lib + none, data = pew)# Coefficient-level outputtidy(lm.com.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-20.2221446
4.53608811
-4.4580582
8.889728e-06
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
7.3090731
3.78453199
1.9313017
5.363523e-02
con_com
-0.6516419
3.77620038
-0.1725655
8.630164e-01
lib
-2.0912635
3.18022261
-0.6575840
5.109070e-01
all
-4.6161252
3.73717480
-1.2351911
2.169541e-01
con
-7.4222962
3.06380740
-2.4225727
1.552963e-02
# News source = Conservative and Comedy is reference grouplm.con_com.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + com + lib + all + con + con_lib + none, data = pew)# Coefficient-level outputtidy(lm.con_com.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-20.8737865
4.09574489
-5.0964567
3.903523e-07
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
7.9607151
3.38955978
2.3485985
1.897429e-02
com
0.6516419
3.77620038
0.1725655
8.630164e-01
lib
-1.4396216
2.79025547
-0.5159461
6.059686e-01
all
-3.9644832
3.23744722
-1.2245708
2.209306e-01
con
-6.7706542
2.57165245
-2.6328030
8.555893e-03
# News source = Liberal is reference grouplm.lib.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + com + con_com + all + con + con_lib + none, data = pew)# Coefficient-level outputtidy(lm.lib.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-22.3134081
3.89149357
-5.7338931
1.186321e-08
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
9.4003366
2.78566060
3.3745448
7.583678e-04
com
2.0912635
3.18022261
0.6575840
5.109070e-01
con_com
1.4396216
2.79025547
0.5159461
6.059686e-01
all
-2.5248617
2.70841685
-0.9322279
3.513698e-01
con
-5.3310327
1.68593832
-3.1620568
1.598163e-03
# News source = all is reference grouplm.all.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + com + con_com + lib + con + con_lib + none, data = pew)# Coefficient-level outputtidy(lm.all.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-24.8382698
4.31200319
-5.7602624
1.018521e-08
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
11.9251983
3.26823544
3.6488186
2.725359e-04
com
4.6161252
3.73717480
1.2351911
2.169541e-01
con_com
3.9644832
3.23744722
1.2245708
2.209306e-01
lib
2.5248617
2.70841685
0.9322279
3.513698e-01
con
-2.8061710
2.46426770
-1.1387444
2.549928e-01
# News source = conservative is reference grouplm.con.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + com + con_com + lib + all + con_lib + none, data = pew)# Coefficient-level outputtidy(lm.con.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-27.6444408
3.59359370
-7.6927007
2.603531e-14
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
14.7313693
2.61423730
5.6350543
2.089358e-08
com
7.4222962
3.06380740
2.4225727
1.552963e-02
con_com
6.7706542
2.57165245
2.6328030
8.555893e-03
lib
5.3310327
1.68593832
3.1620568
1.598163e-03
all
2.8061710
2.46426770
1.1387444
2.549928e-01
# News source = conservative and liberal is reference grouplm.con_lib.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + com + con_com + lib + all + con + none, data = pew)# Coefficient-level outputtidy(lm.con_lib.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-28.8022907
3.84255779
-7.4956038
1.126868e-13
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
15.8892192
2.74123626
5.7963698
8.257194e-09
com
8.5801461
3.25122172
2.6390529
8.400483e-03
con_com
7.9285041
2.69549904
2.9413864
3.317918e-03
lib
6.4888826
1.99424162
3.2538096
1.164207e-03
all
3.9640209
2.50798552
1.5805597
1.141910e-01
# News source = conservative and liberal is reference grouplm.none.2=lm(knowledge ~1+ age + education + news + engagement + lib_com + com + con_com + lib + all + con + con_lib, data = pew)# Coefficient-level outputtidy(lm.none.2)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)
-29.0291254
3.39691924
-8.5457214
3.111825e-17
age
0.2225307
0.02722194
8.1746836
6.276044e-16
education
3.3341011
0.20989679
15.8844787
1.309937e-52
news
0.1731191
0.02604181
6.6477381
4.164092e-11
engagement
0.2306146
0.03003191
7.6789846
2.886167e-14
lib_com
16.1160539
2.69644781
5.9767721
2.843054e-09
com
8.8069808
3.02919350
2.9073682
3.698775e-03
con_com
8.1553389
2.65671829
3.0697040
2.181411e-03
lib
6.7157173
1.64661795
4.0784915
4.772547e-05
all
4.1908556
2.66342729
1.5734823
1.158195e-01
Figure 19.7: Pairwise comparisons of adjusted mean news knowledge for different news sources. Means were adjusted for differences in age, education level, amount of news consumed, and political engagement.
Instructions: Read across the row for a news sources to compare adjusted mean news knowledge with the news sources listed along the top of the chart. The symbols indicate whether the adjusted mean news knowledge of the news source in the row is significantly lower than that of the comparison news source, significantly higher than that of the comparison news source, or if there is no statistically discernible difference between the adjusted mean news knowledge of the two news sources.
Comparing the adjusted pairwise comparisons presented in Figure 19.7 and the unadjusted pairwise comparisons presented in Figure 19.63, we observe several things:
The level of news knowledge (i.e., the mean) for the groups changed once we adjusted for our set of covariates.
Based on the adjusted means, the order of the groups (when we rank them from largest to smallest amount of news knowledge) also changed.
Using the unadjusted means, we observed statistically discernible differences in 18 of the 28 paired comparisons. But, after controlling for our set of covariates, we now only observe statistically discernible differences in 15 of the 28 paired comparisons. (This suggests that in those 3 comparisons the reason we were observing differences in news knowledge was not really the source of news, but because those groups differed in age, education level, amount of news consumption, or political engagement.)
19.8 Optional: Another Visualization—Confidence Intervals for Adjusted Means
One other common method for visually presenting the results from group comparisons is to produce confidence intervals for the adjusted means and then plot those for each group. To do this we are going to create a data frame that we can use (along with a fitted model) to produce adjusted means and standard errors for those means. We will use the fitted model lm.none.2 to produce the adjusted means and standard errors. As a reminder, the syntax for producing lm.none.2 was:
# News source = None is reference grouplm.none.2=lm(knowledge ~1+ age + education + news + engagement + all + con + com + lib + con_com + con_lib + lib_com, data = pew)
The data frame we will create needs to have the values that we will be substituting into the fitted equation to get the adjusted mean values. Each row in the data frame will correspond to a person from a particular group (American who gets their news from a particular news source). Because we have 8 news sources, there will be 8 rows in the data frame. Each column will correspond to a different predictor from the fitted lm(). In our case, since we will be using lm.none.2 to produce the adjusted means, we would need 11 columns. Here is the syntax to create our data frame.
And, here are some tips for creating the data frame:
Remember that the data.frame() function defines columns in the data frame.
The column names in the data frame have to match the predictor names in the lm() exactly.
The column names should be given in the same order as the predictors in the lm(). (While this isn’t strictly necessary it will help to keep things organized.)
Each column will include 8 values (one value for each row).
The columns names that are covariates will be set equal to a single value, namely the mean of that covariate. You could also give this a vector and repeat the same value 8 times, but R will do this automatically.
The column names corresponding to each of the groups (i.e., our focal dummy variables) will be given a vector of eight values (one for each group we have).
These eight values will all be 0s except for one value which will be 1.
The key is that each value from the vector will be put in a different row, and each row needs to correspond to the dummy coding associated with one of the groups.
Each row in d represents an American who gets there news from each of the eight news sources. Notice that the dummy coding in each row corresponds to a particular group. For example, the dummy coding in Row 1 corresponds to the ‘None’ group, while that in Row 2 corresponds to the ‘Conservative News’ group.
When you are creating your data frame, the vector values are filling a column, not a row. But, you need the dummy coding to correspond to rows. It can be helpful to sketch out the data frame on a piece of paper first, and then write the code to create that data frame. Always look at the resulting data you create to be sure it has the correct structure!
Once we have the data, we can use the predict() function to get predicted values (i.e., adjusted means) and standard errors. To do this, we give the function the name of the regression object we are using (lm.none.2). We also include the newdata= argument, which takes the name of the data frame we want to use for the predictions. Finally, we include se = TRUE to output the standard errors.
# Obtain adjusted means and standard errors for each row in dpredict(lm.none.2, newdata = d, se =TRUE)
The adjusted mean values are provided in the $fit part of the output. There should be 8 values, one for each row in d. So the first fitted value (54.44942) is the adjusted mean for the first row in d, which corresponds to the ‘None’ group.4 Similarly there are 8 standard errors (in the $se.fit part of the output), each corresponding to a row in d. So the SE for the ‘None’ group is 0.9739073.
We are going to create a second data frame that includes the 8 group names (based on the order in d), the adjusted mean values, and the SEs. This will be the data that we use to create our plot of the CIs.
# Create data frameplot_data =data.frame(source =c("None", "All", "Conservative", "Comedy", "Liberal", "Conservative and Comedy", "Conservative and Liberal", "Liberal and Comedy"),m =c(54.44942, 58.64027, 55.83410, 63.25640, 61.16513, 62.60476, 54.67625, 70.56547),se =c(0.9739073, 2.3062504, 0.9757177, 2.8935646, 1.3579693, 2.3997123, 1.4044781, 2.4375348))# View dataplot_data
ABCDEFGHIJ0123456789
source
<chr>
m
<dbl>
se
<dbl>
None
54.44942
0.9739073
All
58.64027
2.3062504
Conservative
55.83410
0.9757177
Comedy
63.25640
2.8935646
Liberal
61.16513
1.3579693
Conservative and Comedy
62.60476
2.3997123
Conservative and Liberal
54.67625
1.4044781
Liberal and Comedy
70.56547
2.4375348
Now, we need to create the lower and upper limits of the CI as new columns in the data frame. To do this, recall that the formula for creating a CI is:
where the estimate in our example is the adjusted mean, SE is the standard error, and is a multiplier value based on the residual degrees of freedom for the regression model.5 To determine , we are going to use the qt() function. This function takes two arguments:
The first argument is a value between 0 and 1 that corresponds to the confidence level you want. For a 95% CI this value will be 0.975.6
The second argument, df=, provides the residual degrees-of-freedom for the regression model. (Note this is also given in the predict() output.)
# Compute t-starqt(.975, df =1490)
[1] 1.961557
Now we can mutate() on the lower and upper limits of the CI for each adjusted mean.
# Compute CI limitsplot_data = plot_data |>mutate(lower = m -1.961557*se,upper = m +1.961557*se )# View dataplot_data
ABCDEFGHIJ0123456789
source
<chr>
m
<dbl>
se
<dbl>
lower
<dbl>
upper
<dbl>
None
54.44942
0.9739073
52.53905
56.35979
All
58.64027
2.3062504
54.11643
63.16411
Conservative
55.83410
0.9757177
53.92017
57.74803
Comedy
63.25640
2.8935646
57.58051
68.93229
Liberal
61.16513
1.3579693
58.50140
63.82886
Conservative and Comedy
62.60476
2.3997123
57.89759
67.31193
Conservative and Liberal
54.67625
1.4044781
51.92129
57.43121
Liberal and Comedy
70.56547
2.4375348
65.78411
75.34683
Lastly, we can create our visualization of the CIs.
Figure 19.8: 95% confidence intervals for the adjusted mean news knowledge scores for Americans who get their news from eight different sources. The means are adjusted for differences in age, education level, amount of news consumed, and political engagement.
Intervals that overlap indicate that those groups are not different in their adjusted mean scores. For example, the ‘None’ group and the ‘Conservative and Liberal’ intervals overlap each other. This suggests that their adjusted mean news knowledge scores might be the same (no statistical difference is discernible).
In a balanced sample, the sample size would be equal across categories. This typically happens only when participants are randomly assigned to levels of the categorical predictor. Almost all observational data is unbalanced.↩︎
Since the columns conservative_news, liberal_news and comedy_news already exist in the data, and they are dummy variables, you would onlyneed to create the five other dummy variables. Here we create all eight for completeness.↩︎
Be careful as the order of groups in the visualization is different in the two visualizations!↩︎
Note that these values are what we obtained for the adjusted means within rounding.↩︎