Regression Recap

Preparation

In this set of notes, we will review ideas from regression learned in EPsy 8251. To do this, we will use the data in broadband.csv to examine variation in the over-/under-estimates of the true percentage of the population in the United States with access to high-speed broadband.

In particular, we will evaluate whether this variability is explainable by differences in the percentage of people living in poverty. In this analysis, we will also control for region of the country and whether or not the county is a metropolitan area (two variables that are related to education, which impacts poverty). Although not supported by any prior research, we will also evaluate potential interactions of these two control covariates with poverty level based on an exploration of the data.

Packages

You will need to make sure all of these packages are installed. You can install most of the packages from CRAN. Use Packages > Install in RStudio to do this.

Recall that the you will need to install the {educate} package from GitHub. To do this, see the *Installing Packages from GitHub” section of the resource Computational Toolkit for Educational Scientists

# Load libraries
library(tidyverse)    #Plotting, wrangling, basically everything
library(broom)        #Fitted regression results, creating residuals
library(corrr)        #Correlations
library(educate)      #Normal confidence envelope
library(patchwork)    #For layout with more than one plot
library(modelsummary) #Summary of variables
library(texreg)       #Creating table of regression results

Data Import

First we will import the broadband data.

broadband = read_csv("https://raw.githubusercontent.com/zief0002/epsy-8252/master/data/broadband.csv")

# View data
broadband

Finally, we will summarize the variables in the data set using the datasummary_skim() function from the {modelsummary()} package.

datasummary_skim(broadband)

	Unique (#)	Mean	SD	Min	Median	Max
fips	3103	30427.6	15059.5	1001.0	29177.0	56045.0
fcc_availability	101	0.8	0.2	0.0	0.8	1.0
microsoft_useage	94	0.3	0.2	0.0	0.2	1.0
pct_poverty	301	14.4	5.8	2.7	13.4	47.7
fcc_diff	113	0.5	0.2	−0.8	0.5	1.0

From these results we see that there is no missing data. We also see from the FIPS variable that we have a sample size of 3103 counties (since the FIPS value is unique to each county). The outcome (fcc_diff) is slightly left skewed with a potential outlier. Our focal predictor (pct_poverty) is right-skewed.

The mean for each of the dummy variables, recall, gives the proportion of 1s. Thus, since we coded metropolitan counties as 1, 40% of the counties in our sample are metropolitan. Note that there are some counties in the New England and Mid-Atlantic regions, so the means of 0 are due to rounding.¹

Plotting Relationships

To further the analysis, we next explore the relationship between the outcome versus various predictors. Initially, we will plot the outcome versus our focal predictor (pct_poverty).

ggplot(data = broadband, aes(x = pct_poverty, y = fcc_diff)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light() +
  xlab("Perecntage in Poverty") +
  ylab("Over/Under Estimate of High Speed Broadband Access")

Over-/under-estimate of highspeed broadband access (in percentage points) as a function of the percentage of people in poverty in the county.

From this plot, we can see that most of the counties access is being over-estimated, regardless of the degree of poverty (i.e., most fcc_diff values are above 0). We also see that there is a positive relationship between these variables. In general, counties with a higher percentage of residents living in poverty tend to have their access to highspeed broadband over-estimated. Finally, although there is one county that seems to be a potential outlier, Loving County in Texas (\(7.1,~-0.81\)), the overall relationship, nontheless, seems linear.

Next we will look at the same relationship, except this time we will condition on whether the county is considered a metropolitan area.

ggplot(data = broadband, aes(x = pct_poverty, y = fcc_diff)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light() +
  xlab("Percentage in Poverty") +
  ylab("Over/Under Estimate of High Speed Broadband Access") +
  facet_wrap(~county_type, nrow = 1)

Over-/under-estimate of highspeed broadband access (in percentage points) as a function of the percentage of people in poverty in the county conditioned on whether or not the county is a metropolitan area.

From this plot, we see that the positive relationship persists for both metropolitan and non-metropolitan counties. Although, the magnitude seems much larger for metropolitan counties than for non-metropolitan counties, the latter of which seems quite close to 0. That is, for metropolitan counties, higher degrees of poverty are associated with bigger over-estimates of access to highspeed broadband. For non-metropolitan counties, while the access to highspeed broadband is being over-estimated, this does not seem to grow nearly as much for counties with higher percentages of people living in poverty. This suggests that we might want to fit a model that includes an interaction between pct_poverty and metro.

Next we will look at the same relationship by region of the country.

ggplot(data = broadband, aes(x = pct_poverty, y = fcc_diff)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light() +
  xlab("Percentage in Poverty") +
  ylab("Over/Under Estimate of High Speed Broadband Access") +
  facet_wrap(~region)

Over/under-estimate of highspeed broadband access (in percentage points) as a function of the percentage of people in poverty in the county conditioned on region.

From this plot, we see that there is variability in both the direction and magnitude of the relationship across regions. For many regions (e.g., East North Central, Pacific, West South Central), the slope is close to 0, indicating again that while the access to highspeed broadband is generally being over-estimated, this does not seem to change much for counties with higher percentages of people living in poverty. Other regions (e.g., New England, Mid-Atlantic) show the earlier positive relationship we observed. The Mountain region, in contrast, shows a negative relationship between degree of poverty and the amount of estimation error. This suggests that we might want to fit a model that includes an interaction between pct_poverty and region.

Finally, for completeness, we will look at the same relationship by both metropolitan area and region of the country.

ggplot(data = broadband, aes(x = pct_poverty, y = fcc_diff, color = county_type)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light() +
  scale_color_manual(
    name = "",
    values = c("#2C6DAC", "#C60F7B")
    ) +
  xlab("Percentage in Poverty") +
  ylab("Over/Under Estimate of High Speed Broadband Access") +
  facet_wrap(~region)

From this plot, we see that there is variability in both the direction and magnitude of the relationship between metropolitan and non-metroplitan counties with each region. Moreover, these are different across regions. This might suggest that we might want to fit a model that includes a second-order interaction between pct_poverty, metro, and region.

Preparing for the Regression Analyses

From the plots we have several models that we might be interested in fitting to explain variation in our outcome:

Model A: A simple regression model that uses our focal predictor (always a good model to start with);
Model B: A main-effects model that includes our focal predictor and the two covariates (a pre-cursor to the interaction models);
Model C: An interaction model that includes the interaction between metro and our focal predictor, and also controls for region;
Model D: An interaction model that includes the interaction between region and our focal predictor, and also controls for metro;
Model E: An interaction model that includes the second-order interaction between metro, region, and our focal predictor.;

Because both region and metro are categorical predictors, we need to create dummy variables to use them in the analyses. Here, we create the metro dummy variable to encode metropolitan area and nine region dummy variables to encode region of the country.

broadband = broadband %>%
  mutate(
    # Create metro dummy
    metro = if_else(county_type == "Metro", 1, 0),
    # Create region dummies
    ne = if_else(region == "New England", 1, 0),
    ma = if_else(region == "Mid-Atlantic", 1, 0),
    enc = if_else(region == "East North Central", 1, 0),
    wnc = if_else(region == "West North Central", 1, 0),
    mtn = if_else(region == "Mountain", 1, 0),
    pac = if_else(region == "Pacific", 1, 0),
    sa = if_else(region == "South Atlantic", 1, 0),
    esc = if_else(region == "East South Central", 1, 0),
    wsc = if_else(region == "West South Central", 1, 0)
  )

# View data
broadband

Model A

Model A will only include our focal predictor (pct_poverty). The model equation is:

\[ \mathrm{Estimate~Difference}_i = \beta_0 + \beta_1(\mathrm{Poverty~Percentage}_i) + \epsilon_i \]

where, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_{\epsilon})\). Note that the model equation describes both the relationships and the assumptions. Next we will fit the model using the lm() function and then use the glance() and tidy() functions from the {broom} package to peruse the model-level and coefficient-level output, respectively.

# Fit model
model_a = lm(fcc_diff ~ 1 + pct_poverty, data = broadband)

# Model-level output
glance(model_a)

# Coefficient-level output
tidy(model_a)

Interpreting this output:

Difference in poverty explain 1.39% of the variation in the over-/under-estimates of high speed broadband access; \(F(1,~3101)=43.8,~p<.001\). Each one-percentage point difference in poverty is associated with a 0.004 percentage-point change in the over-/under-estimates of high speed broadband access, on average; \(t(3101)=6.62,~p<.001\). This suggests that counties with higher levels of poverty are more likely to have their estimate of high speed broadband access over estimated.

Model B

Model B includes our focal predictor (pct_poverty) and the covariates (metro and eight of the nine region dummies). The model equation is:

\[ \begin{split} \mathrm{Estimate~Difference}_i = &\beta_0 + \beta_1(\mathrm{Poverty~Percentage}_i) + \beta_2(\mathrm{Metro}_i) + \beta_3(\mathrm{ENC}_i) + \beta_4(\mathrm{ESC}_i) + \beta_5(\mathrm{MA}_i) + \\ &\beta_6(\mathrm{Mtn}_i) + \beta_7(\mathrm{NE}_i) + \beta_8(\mathrm{Pac}_i) + \beta_9(\mathrm{SA}_i) + \beta_{10}(\mathrm{WNC}_i) + \epsilon_i \end{split} \]

where, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_{\epsilon})\). In this model, the reference group is non-metropolitan counties in the West South Central region of the country.

# Fit model
model_b = lm(fcc_diff ~ 1 + pct_poverty + metro + enc + esc + 
               ma + mtn + ne + pac + sa + wnc, data = broadband)

# Model-level output
glance(model_b)

# Coefficient-level output
tidy(model_b)

Interpreting this output:

Difference in poverty, whether a county is a metropolitan area, and region of the country explain 5.19% of the variation in the over-/under-estimates of high speed broadband access; \(F(10,~3092)=16.9,~p<.001\). Each one-percentage point difference in poverty is associated with a 0.004 percentage-point change in the over-/under-estimates of high speed broadband access, on average—\(t(3092)=6.62,~p<.001\)—controlling for differences in whether a county is a metropolitan area and region of the country. This suggests that, even after controlling for important covariates, counties with higher levels of poverty are more likely to have their estimate of high speed broadband access over estimated.

While we focus on the focal varaible of poverty in the interpretation, we could also interpret the covariates:

The over-/under-estimates of high speed broadband access for metropolitan counties is 0.008 percentage points lower, on average, than that for non-metropolitan counties controlling for differences in the degree of poverty in the county and region of the country. This is not more than we expect because of sampling error—\(t(3092)=-1.08,~p=.280\)—indicating that there may not be a difference between metropolitan and non-metropolitan counties degree of estimation error once we account for region and poverty level.

Each of the region effects should is a comparison to the reference region of West South Central. For example, to interpret the enc effect:

The over-/under-estimates of high speed broadband access for counties in the East North Central region is 0.034 percentage points higher, on average, than that for counties in the West South Central region, controlling for differences in the degree of poverty in the county and whether the county is a metropolitan area. This is more than we expect because of sampling error—\(t(3092)=2.69,~p=.007\).

Mathematical and Computational Convenience

Let us return to the equation we wrote for Model B. Note that we can use mathematical notation to re-write the set of effects for the dummy variables encoding region:

\[ \begin{split} \sum_{j=3}^{10}\beta_j(\mathrm{Region}_{ij}) = &\beta_3(\mathrm{ENC}_i) + \beta_4(\mathrm{ESC}_i) + \beta_5(\mathrm{MA}_i) + \beta_6(\mathrm{Mtn}_i) + \beta_7(\mathrm{NE}_i) +\\ & \beta_8(\mathrm{Pac}_i) + \beta_9(\mathrm{SA}_i) + \beta_{10}(\mathrm{WNC}_i) \end{split} \]

Here the j subscript denotes Region j, that is, we can have a different beta-value and region variable depending on the region. Using this, we can re-write the model for Model B more succinctly as:

\[ \mathrm{Estimate~Difference}_i = \beta_0 + \beta_1(\mathrm{Poverty~Percentage}_i) + \beta_2(\mathrm{Metro}_i) + \sum_{j=3}^{10}\beta_j(\mathrm{Region}_{ij}) + \epsilon_i \]

where, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_{\epsilon})\).

We can also use the categorical variable directly in the lm() syntax rather than creating the dummy variables.

model_b = lm(fcc_diff ~ 1 + pct_poverty + metro + region, data = broadband)

When we do this, R will choose the reference category for us (it is chosen alphabetically). In our case, this is the East North Central region. We can verify this by examining the tidy() output to see which region was omitted.

tidy(model_b)

In general, if the categorical variable is a covariate, it is fine to just use it in the lm() without creating the dummy variables. This controls for that covariate. If, however, you are interested in comparisons between different levels of the categorical variable, you will need to create the dummy variables because not all of the comparisons are shown in the output.

Model C

Model C includes the interaction between pct_poverty and metro and region as a set of covariates. The model equation is:

where, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_{\epsilon})\).

# Fit model
model_c = lm(fcc_diff ~ 1 + pct_poverty + metro + region + 
               pct_poverty:metro, data = broadband)

# Model-level output
glance(model_c)

# Coefficient-level output
tidy(model_c)

Interpreting this output:

Model C explains 6.57% of the variation in the over-/under-estimates of high speed broadband access; \(F(11,~3091)=19.8,~p<.001\). The interaction effect between poverty level and metropolitan area is more than we expect because of chance, \(t(3091)=6.78,~p<.001\). This implies that, after controlling for regional differences, the effect of poverty level on the amount of error in the estimates of high speed broadband access depends on whether the county is a metropolitan area.

We can compute the effect of poverty as:

Non-Metropolitan Counties (reference group): \(\beta_{\mathrm{Poverty}} = 0.0009\)
Metropolitan Counties: \(\beta_{\mathrm{Poverty}} = 0.0009 + .00934 = 0.010\)

For counties considered metropolitan areas, each one-percentage point difference in poverty is associated with a 0.010 percentage-point change in the over-/under-estimates of high speed broadband access, on average. While for non-metropolitan counties, each one-percentage point difference in poverty is only associated with a 0.0009 percentage-point change in the over-/under-estimates of high speed broadband access, on average.

Model D

Model D includes the interaction between pct_poverty and region, and metro as a covariate. The model equation is:

where, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_{\epsilon})\).

# Fit model
model_d = lm(fcc_diff ~ 1 + pct_poverty + metro + region + 
               pct_poverty:region, data = broadband)

# Model-level output
glance(model_d)

# Coefficient-level output
tidy(model_d)

Interpreting this output:

Model D explains 6.28% of the variation in the over-/under-estimates of high speed broadband access; \(F(18,~3084)=11.5,~p<.001\). At least one interaction effects between poverty level and region is more than we expect because of chance, indicating that after controlling for whether the county is a metropolitan area, the effect of poverty level on the amount of error in the estimates of high speed broadband access depends on region.

There are many interaction effects in this model, and not all of them are displayed in the tidy() output. The interactions outputted are slope comparisons to the reference group. For example, the pct_poverty:regionEast South Central effect is a comparison of the poverty level effect between the East South Central region and the reference group of East North Central. It suggests that the effect of level of poverty in the East South Central region is 0.002 percentage points higher than it is in the East North Central region. If you wanted to know whether, say, the effect of level of poverty in the Pacific region was different from that in the Mid-Atlantic region you would need to use a different reference group.

Model E

Model E includes the second-order interaction between pct_poverty, region, and metro. The model equation is:

\[ \begin{split} \mathrm{Estimate~Difference}_i = &\beta_0 + \beta_1(\mathrm{Poverty~Percentage}_i) + \beta_2(\mathrm{Metro}_i) + \sum_{j=3}^{10}\beta_j(\mathrm{Region}_{ij}) + \\ &\beta_{11}(\mathrm{Metro}_i)(\mathrm{Poverty~Percentage}_i) + \\ &\sum_{j=12}^{19}\beta_j(\mathrm{Region}_{ij})(\mathrm{Poverty~Percentage}_i) +\\ &\sum_{j=20}^{27}\beta_j(\mathrm{Region}_{ij})(\mathrm{Metro}_i) + \\ &\sum_{j=28}^{35}\beta_j(\mathrm{Region}_{ij})(\mathrm{Metro}_i)(\mathrm{Poverty~Percentage}_i) + \epsilon_i \end{split} \]

where, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_{\epsilon})\).

# Fit model
model_e = lm(fcc_diff ~ 1 + pct_poverty + metro + region + 
               pct_poverty:metro + pct_poverty:region + metro:region +
               region:metro:pct_poverty, data = broadband)

# Model-level output
glance(model_e)

# Coefficient-level output
tidy(model_d)

Interpreting this output:

Model E explains 8.05% of the variation in the over-/under-estimates of high speed broadband access; \(F(35,~3067)=7.67,~p<.001\). At least one of the second-order interaction effects between poverty level, metropolitan area, and region is more than we expect because of chance, indicating that the effect of poverty level on the amount of error in the estimates of high speed broadband access depends on region and whether or not the county is considered a metropolitan area.

Adopting a Model

From a statistical inference (p-value) point-of-view, we would adopt Model E. Some of the second-order interactions are statistically relevant. However, this model was not under consideration until AFTER we looked at the data. Similarly Models C and D were also proposed after looking at the data. This makes them suspect. A priori, before looking at the data, we only considered Models A and B.

We can safely put aside Model A—it was really just a starting point to include the focal predictor, but wasn’t really under consideration since the RQ suggested we had to control for other covariates. To decide between Models B, C, D, and E, we will examine the residuals to see which of these models best meets the assumptions.

# Get the residuals
out_b = augment(model_b)
out_c = augment(model_c)
out_d = augment(model_d)
out_e = augment(model_e)

# Plots
plot_b = ggplot(data = out_b, aes(x = .std.resid)) +
  stat_density_confidence(model = "normal") +
  geom_density() +
  theme_light() +
  xlab("Standardized residuals") +
  ylab("Density") +
  ggtitle("Model B")

plot_c = ggplot(data = out_c, aes(x = .std.resid)) +
  stat_density_confidence(model = "normal") +
  geom_density() +
  theme_light() +
  xlab("Standardized residuals") +
  ylab("Density") +
  ggtitle("Model C")

plot_d = ggplot(data = out_d, aes(x = .std.resid)) +
  stat_density_confidence(model = "normal") +
  geom_density() +
  theme_light() +
  xlab("Standardized residuals") +
  ylab("Density") +
  ggtitle("Model D")

plot_e = ggplot(data = out_e, aes(x = .std.resid)) +
  stat_density_confidence(model = "normal") +
  geom_density() +
  theme_light() +
  xlab("Standardized residuals") +
  ylab("Density") +
  ggtitle("Model E")

# Layout the plots using patchwork
(plot_b | plot_c) / (plot_d | plot_e)

Marginal density plots of the standardized residuals for each of four candidate models. A confidence envelope for the normal distribution (blue shaded area) is also displayed in each plot.

The density plots indicate that the normality assumption may be violated for all four candidate models. The density curve deviates from what is expected if the population of standardized residuals was normal (i.e., the curve falls outside the shaded area).

# Plots
plot_b = ggplot(data = out_b, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0) +
  geom_smooth() +
  theme_light() +
  xlab("Fitted values") +
  ylab("Standardized residuals") +
  ggtitle("Model B")

plot_c = ggplot(data = out_c, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0) +
  geom_smooth() +
  theme_light() +
  xlab("Fitted values") +
  ylab("Standardized residuals") +
  ggtitle("Model C")

plot_d = ggplot(data = out_d, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0) +
  geom_smooth() +
  theme_light() +
  xlab("Fitted values") +
  ylab("Standardized residuals") +
  ggtitle("Model D")

plot_e = ggplot(data = out_e, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0) +
  geom_smooth() +
  theme_light() +
  xlab("Fitted values") +
  ylab("Standardized residuals") +
  ggtitle("Model E")

# Layout the plots
(plot_b | plot_c) / (plot_d | plot_e)

Scatterplots of the standardized residuals versus the fitted values for each of four candidate models. A horizontal line at \(Y=0\) has been added to the plot to aid interpretation. A loess smoother (blue line) with 95% confidence bands has also been included on each plot.

Models D and E show some deviation from the assumption that the mean residuals is 0 for all fitted values. All models suggest compatibility with the homoskedasticity assumption. Based on these findings I would adopt both Model B and Model C.

Presenting the Regression Results

In a presentation of the results, I would create a regression table reporting the results of Model A, B, and C. I would write about the effect of level of poverty based on these models’ results, and also probably show these effects for Model B and C in a picture. Finally, I would emphasize that Model C is exploratory since it was suggested by the data.

Table of Regression Results

In a table of regression results, we want to show the estimated parameters for each of the models we are presenting. These parameters include the estimated coefficients and the estimate of the residual standard error; a.k.a., the residual mean square error (RMSE). We should also present measures of uncertainty for the estimated coefficients (e.g., standard errors, confidence intervals). Finally, it is also common to present the R-squared values.

Coefficients (standard errors) for the predictors from a taxonomy of regression models fitted to explain variation in the error in highspeed broadband estimates. Each model was fitted with 3103 cases.
	Model A	Model B	Model C
Percentage in poverty	0.004	0.004	0.001
	(0.001)	(0.001)	(0.001)
Metropolitan county\(^\dagger\)		-0.008	-0.134
		(0.008)	(0.020)
Region\(^\dagger\dagger\)

East South Central		0.060	0.067
		(0.015)	(0.015)
Mid-Atlantic		-0.016	-0.011
		(0.018)	(0.018)
Mountain		-0.064	-0.064
		(0.015)	(0.015)
New England		-0.036	-0.030
		(0.025)	(0.025)
Pacific		-0.060	-0.060
		(0.016)	(0.016)
South Atlantic		-0.002	0.002
		(0.012)	(0.012)
West North Central		0.036	0.033
		(0.012)	(0.012)
West South Central		-0.035	-0.035
		(0.013)	(0.013)
Percentage in poverty x Metropolitan county			0.009
			(0.001)
Intercept	0.428	0.441	0.482
	(0.009)	(0.013)	(0.015)
R²	0.014	0.052	0.066
RMSE	0.194	0.190	0.189
\(^\dagger\)Metropolitan county is a dummy-coded indicator (0 = Non-metroplitan; 1 = Metropolitan county). \(^\dagger\dagger\)Region is a set of eight dummy-coded indicators (reference group = East North Central).

Plot of the Fitted Lines

We will create a plot of the fitted lines for Models B and C. In both models, region is a covariate, so we can control for it by just picking a single region. Any region will work, but it is easiest if we choose the reference group, that way all the region coefficients drop out of the fitted equation. Because Model C includes an interaction between metropolitan area and percentage in poverty, we should show the metropolitan effect for both fitted models.

First we write the fitted equation for Model B (assuming the reference region):

\[ \widehat{\mathrm{Estimate~Difference}_i} = 0.441 + 0.004(\mathrm{Poverty~Percentage}_i) - 0.008(\mathrm{Metro}_i) \]

Then we substitute 0 (non-metro) and 1 (metro) in for metropolitan area to get the two fitted equations we will use in the plot.

\[ \begin{split} \mathbf{Non\mbox{-}Metro:~} \widehat{\mathrm{Estimate~Difference}_i} &= 0.441 + 0.004(\mathrm{Poverty~Percentage}_i) - 0.008(0) \\ &= 0.441 + 0.004(\mathrm{Poverty~Percentage}_i) \\[2ex] \mathbf{Metro:~} \widehat{\mathrm{Estimate~Difference}_i} &= 0.441 + 0.004(\mathrm{Poverty~Percentage}_i) - 0.008(1) \\ &= 0.433 + 0.004(\mathrm{Poverty~Percentage}_i) \end{split} \]

We carry out a similar process for Model C.

\[ \widehat{\mathrm{Estimate~Difference}_i} = 0.482 + 0.001(\mathrm{Poverty~Percentage}_i) - 0.134(\mathrm{Metro}_i) + 0.009(\mathrm{Poverty~Percentage}_i)(\mathrm{Metro}_i) \]

\[ \begin{split} \mathbf{Non\mbox{-}Metro:~} \widehat{\mathrm{Estimate~Difference}_i} &= 0.482 + 0.001(\mathrm{Poverty~Percentage}_i) - 0.134(0) + 0.009(\mathrm{Poverty~Percentage}_i)(0) \\ &= 0.482 + 0.001(\mathrm{Poverty~Percentage}_i) \\[2ex] \mathbf{Metro:~} \widehat{\mathrm{Estimate~Difference}_i} &= 0.482 + 0.001(\mathrm{Poverty~Percentage}_i) - 0.134(1) + 0.009(\mathrm{Poverty~Percentage}_i)(1) \\ &= 0.348 + 0.002(\mathrm{Poverty~Percentage}_i) \end{split} \]

# Plots
plot_b = ggplot(data = broadband, aes(x = pct_poverty, y = fcc_diff)) +
  geom_point(alpha = 0) +
  geom_abline(intercept = 0.441, slope = 0.00358, linetype = "dashed") +  #Non-metro
  geom_abline(intercept = 0.441 - 0.00838, slope = 0.00358, linetype = "solid") +  #Metro
  theme_light() +
  xlab("Percentage in poverty") +
  ylab("Over/under-estimate of highspeed broadband access") +
  ggtitle("Model B")

plot_c = ggplot(data = broadband, aes(x = pct_poverty, y = fcc_diff)) +
  geom_point(alpha = 0) +
  geom_abline(intercept = 0.482, slope = 0.000863, linetype = "dashed") +  #Non-metro
  geom_abline(intercept = 0.482 - 0.134, slope = 0.000863 + 0.00934, linetype = "solid") +  #Metro
  theme_light() +
  xlab("Percentage in poverty") +
  ylab("Over/under-estimate of highspeed broadband access") +
  ggtitle("Model C")


# Layout the plots
plot_b | plot_c

Fitted lines showing the Over/under-estimate of highspeed broadband access (in percentage points) as a function of percentage of county residents living in poverty for metropolitan (solid) and non-metropolitan (dashed) areas based on Model B (LEFT) and Model C (RIGHT). In both plots, the fitted equations are based on using the East North Central region.

These plots show the fitted lines for Models B and C. The interaction effect seen in Model C’s fitted lines was discovered during the exploratory phase of the analysis. While it suggests that the effect of poverty level is more pronounced for metropolitan counties, this needs to be established further in future work.

You could use the mean() function to compute the actual proportion of counties if you desire.↩︎

Regression Recap

Andrew Zieffler

January 07, 2022

Preparation

Packages

Data Import

Plotting Relationships

Preparing for the Regression Analyses

Model A

Model B

Mathematical and Computational Convenience

Model C

Model D

Model E

Adopting a Model

Presenting the Regression Results

Table of Regression Results

Plot of the Fitted Lines