Stat Chat Book Club

“The Book of Why”

Danny Kaplan & Milo Schield, discussion leaders

2/13/2019

GAISE and multiple variables

GAISE examples

GAISE SAT example (pp 38-39)

Consider an example where statewide data from the mid-1990s are used to assess the association between average teacher salary in the state and average SAT (Scholastic Aptitude Test) scores for students (Guber 1999; Horton 2015). These high stakes high school exams are sometimes used as a proxy for educational quality.

The following figure displays the (unconditional) association between these variables. There is a statistically significant negative relationship (\(\hat{\beta}\) = -5.54 points (sic: should be points per thousand dollars), \(p < 0.001\)). The model predicts that a state with an average salary that is one thousand dollars higher than another would have SAT scores that are on average 5.54 points lower.

But the real story is hidden behind one of the “other factors” that we warn students about but do not generally teach how to address! The proportion of students taking the SAT varies dramatically between states, as do teacher salaries. In the Midwest and Plains states, where teacher salaries tend to be lower, relatively few high school students take the SAT. Those that do are typically the top students who are planning to attend college out of state, while many others take the alternative standardized ACT test that is required for their state. For each of the three groups of states defined by the fraction taking the SAT, the association is non-negative. The net result is that the fraction taking the SAT is a confounding factor.

This problem is a continuous example of Simpson’s paradox. Statistical thinking with an appreciation of Simpson’s paradox would alert a student to look for the hidden confounding variables. To tackle this problem, students need to know that multivariable modeling exists but not all aspects of how it can be utilized.

Within an introductory statistics course, the use of stratification by a potential confounder is easy to implement. By splitting states up into groups based on the fraction of students taking the SAT it is possible to account for this confounder and use bivariate methods to assess the relationship for each of the groups.

The variables

What do you think causes what? Draw the arrows for your hypothesis.

Story 1

Funding is the basis for test scores, but the parents’ educational level and the fraction taking the test play a role.

Consequences:

The components of a causal diagram:

  1. What does a node stand for?
  2. What does the arrow in a DAG represent?
  3. What special structure do the diagrams have?
    • acyclic
    • causal and non-causal directed paths

Story 2

The success of the schools (measured by scores) sets the funding level and, by attracting wealthier people to the system, sets the wealth of the community and also influences the funding.

Consequences:

Story 3

The success of the schools (measured by scores) and the community’s wealth sets the funding level. Parents education and the fraction taking the test are the main determinants of score.

Consequences:

It’s all in the data

Quote from Pearl.

If Pearson were alive today, he would say exactly this: the answers are all in the data. - p. 88

What’s an experiment in terms of the diagram?

Story 2 experiment with do(funding)

Eliminate all arrows leading into funding.

Consequences

What does Pearl see as the equivalent of an experiment?