Imagine that your school district is about to adopt a new ed tech product. The senior decisionmakers in the district see themselves as data-driven leaders, so they consult research on the impact of the new product. Maybe they commission a pilot test of their own. But the research does not provide the clarity you hoped for. The outcomes for students or teachers who used the product look fine, but the analysts in the research office tell you that the impact estimates are not statistically significant. Something along the lines of p-values being too large, the sample too small, or the test inconclusive. So you go with your gut, or just ask some teachers if they enjoyed using the product to guide your decision. And you get discouraged and turned off from conducting pilots or using others’ quantitative research to decide what works in the future.
We might take away from this story that the research failed, but what failed was the interpretation of the research. The classical approach—commonly known as a frequentist approach—that every evaluator learned in graduate school involves formally testing if a null hypothesis is rejected with overwhelming evidence or equivalently constructing an oddly mysterious thing called a confidence interval, which can be a large and unhelpful range of numbers that almost certainly has the right answer in there somewhere.
Read more on the Brookings Chalkboard blog.