In a recent post on the Brookings Brown Center Chalkboard, Helen Ladd urges states to experiment with replacing test-based accountability with school inspections, visits by trained experts who rate the schools they visit and then issue reports. It’s an appealing idea, and as a researcher, I would never argue against running a formal experiment to evaluate its worth. But why is it either/or? There is no reason not to include test-based measures of school performance (or of teacher or principal performance) as a part of the inspector’s report. Maybe a big part.
Ladd discusses the benefits of inspections and the drawbacks of test-based accountability. However, there are pros and cons of both approaches to be considered. Others have already provided a balanced assessment of test-based accountability (see here and here, for example). Ladd identified many potential benefits of inspections, but there are several drawbacks to consider as well:
- Inspections can be just as “top-down,” if not more so, than test-based accountability.
- Even if you follow a rubric, inspections are subjective. Just like classroom observations, which can say more about the students than the educators being observed, they are either “high inference,” which means they depend on the skills and biases of the inspector, or they are formulaic and simplistic, which means they are easily gamed and don’t serve their purpose well.
- Inspections focus on inputs, not outputs, an approach that could stifle innovation.
The inspections idea in Ladd’s blog post leaves many questions unanswered. How frequent are the visits? How good are the artifacts collected? Are the surveys scientifically valid? How much nonresponse is acceptable? Would the surveys be validated and tested for reliability and biases, such as socially desirable response bias and conflicts of interest (self-evaluation)? Depending on the answers to these questions, the inspection approach could be very expensive. What will inspections cost? Are they scalable? A journal article about inspection systems in New Zealand and the Netherlands raises some doubts about whether the systems can be implemented well, and whether they can properly focus on outcomes instead of just process.
If the implementation challenges of an inspection system can be worked out, Ladd’s idea of rigorously testing effectiveness is a good one. After all, there is conflicting evidence from Europe. One study found inspections in the Netherlands did not improve performance, but two other studies based on English data (here and here) show that actually failing a school may have a positive impact.
If we want to use the flexibility in the new national education law, the Every Student Succeeds Act, to evaluate an inspection system, we should be clear about the policy choice. Is it inspections versus no inspections, inspections versus test-based performance measures? Or do states need to figure out how to reconcile different types of evidence in an accountability system, whether objective or subjective; survey-based or observation-based; highly differentiating or coarse-grained categories; quantitative or qualitative?
Maybe the tradeoff isn’t between test-based accountability and inspections after all. Maybe we should be using ESSA flexibility to get the right balance of subjective versus objective measures, input-based versus outcome-based measures, and summative (reach a judgment) versus formative (aid improvement) purposes of a performance measurement system. The rigorous experiments we should be running have to do with the different decisions we make – like whether and how to reward and remediate – and how they are tied to each of these performance measurement systems. It may be possible to fit any of these measures within the structure of a scalable, cost-effective inspection system. But as long as we have the capacity to measure the impact of schools, or better yet teachers, on outcomes, we should make these measures one of the criteria that gets explained clearly in the inspector’s written reports, or at least test the value of using different criteria with different levels of emphasis.