When the results from the 2022 National Assessment of Educational Progress (or NAEP, also known as The Nation’s Report Card) were released last Monday, the buzz was immediate. Educators, policymakers, and researchers (ourselves included) have been anxiously awaiting these results, because they provide the most comprehensive nationwide picture of reading and math performance in the wake of the pandemic. And the results are undeniably concerning, with declines in achievement in both tested grades (4th and 8th) and both subjects (reading and math). Nationwide, the drop in math scores was the largest ever seen. There is no real dispute about the clearly negative post-pandemic decline across the country.
Things get more complicated when we start digging into the data in depth. Data can become more actionable for states and districts at a smaller grain size. Educators, policymakers, and journalists need to understand differences in results across the country, explore patterns of variation, and identify places that may have done better, bucking the trend. But understanding that variation is not as straightforward as understanding the overall national story, and it is easy to be misled—which could lead to chasing the wrong solutions.
In particular, a common misunderstanding of the meaning and implications of “statistical significance” can lead to identifying patterns that are not meaningful and failing to identify patterns that are. Below I’ll describe how misinterpreting statistical significance (and particularly the implications of non-significant individual results across a large set of findings) can lead us astray. And then I’ll conclude by proposing another way to make sense of the variation in the data that can both avoid these common misunderstandings and provide better information about the finer-grained results in states and districts.
A non-significant difference doesn’t mean there was no change
In reporting NAEP results, the National Center for Education Statistics (NCES) describes a change in scores as a decline or increase only if the change is statistically significant. When the data were released, many commentators and journalists immediately tried to identify patterns and differences by comparing the number of statistically significant and non-significant results for states and participating districts. For example, some news reports have suggested that the NAEP reading results for the sample of urban districts were a bright spot because most of these districts did not have statistically significant declines in reading.
Unfortunately, as the American Statistical Association has warned, statistical significance is difficult to define clearly and is ripe for misinterpretation. The lack of statistical significance doesn’t mean that a change isn’t real or meaningful. Failure to find a significant change does not imply that there was no change. And a non-significant result might obscure substantial changes.
Consider the suggestion that NAEP’s urban districts outperformed the national trend in reading because, compared to states, cities had a lower proportion of statistically significant declines. But statistical significance is affected by the number of students tested as well as the size of the effect. Fewer students take NAEP tests in cities than statewide, which makes it inevitable that cities will be less likely than states to show statistically significant changes, even if their actual score declines are the same.
We can see how this is misleading by looking at the overall average results for the urban districts versus the country as a whole. In grade 4 reading, 58 percent of states had statistically significant declines, compared to only 35 percent of NAEP’s urban districts. But this does not mean that more of the urban districts “held steady.” The average grade 4 reading scores declined by 3 points nationwide and 3 points in the urban districts—a wash. The fact that fewer districts had statistically significant declines is a result of smaller sample sizes in districts versus states, not better performance.
An individual city could experience real—and substantial—declines that nonetheless are not statistically significant. Careful calculations done by the Stanford Education Data Archive indicate that one year of typical learning is equivalent to about 10 points on NAEP scales for reading and math (see Table 6 of this paper). The city-level NAEP results include multiple cases, in different grades and subjects, where measured declines of 4 or 5 points are not statistically significant. In other words, district-level declines of almost half a year of learning could be dismissed because they are non-significant given the limited number of students tested. The point is not that we should ignore the possibility that some differences might be random rather than real, only that a non-significant finding should not be interpreted as an indication there was no change.
Let’s take a look at the results for Detroit. The city showed statistically significant declines in grade 4 math (12 points), grade 8 math (6 points), and grade 4 reading (6 points). Detroit’s grade 8 reading scores declined by 5 points, or approximately half a year of learning—but that decline did not achieve statistical significance. Yet it would be a major mistake to conclude that Detroit held steady in grade 8 reading.
The solution: Bayes!
Happily, there is a solution to this problem: Bayesian analysis, which makes better use of all available data to produce results that are more informative, while at the same time circumventing the misinterpretation of statistical significance. For state-level and district-level NAEP results, Bayesian methods can draw on historical data, on data from other grades and subjects, or on data from other states and districts. Incorporating this additional information can produce more accurate results for each state, district, grade, and subject—regardless of whether changes are statistically significant.
Indeed, most of us are implicit Bayesians when we look at data. If you found the Detroit example persuasive, it is probably because your interpretation of the non-significant 5-point decline in grade 8 reading scores is—appropriately—informed by the significant declines that are at least as large on the other three tests. Your interpretation could be informed by other relevant data as well: the (significant) declines in grade 8 reading scores for the nation (3 points) and for Michigan (4 points). Interpreting the results for an individual city or state in this larger context is what Bayesian analysis does, in a more formal way.
In addition, Bayesian analysis can provide much clearer information than a statistical significance test about how much confidence we should have in any particular result. It can tell us how likely it is that Detroit’s grade 8 reading scores actually declined; it can also tell us how likely it is that they declined by an educationally meaningful (rather than statistically significant) amount. In addition, Bayesian analysis can help us clarify how much stock we should place in outliers, providing better information about whether we should trust results for any particular district or state that seemed to do extraordinarily well or extraordinarily badly—information that is critical for understanding local trends and the search for effective solutions.
Fortunately, NCES provides enough data in its public reporting of NAEP scores to allow a formal Bayesian re-analysis. So consider this a teaser, and watch this space: Soon we will report Bayesian-adjusted results on the changes in every participating district and state. Next up: a conversation about local results, patterns, and outliers. We hope you’ll join us.