The timing of R. Barker Bausell’s “Probing the Science of Value-Added Evaluations” could not have been better. Writing on the heels of the Gates Foundation’s Measures of Effective Teaching (MET) final report, Bausell, a biostatistics professor emeritus at the University of Maryland, critiques the science methodology of value-added research. While not mentioning the MET specifically, his commentary explains why the Gates methodology fails to measure up.
Bausell does not argue the pros and cons of value-added teacher evaluations, but he explains what it would take for research to address the policy issues regarding the high stakes use of test score growth for individual teachers.
“From a methodological perspective,” Bausell writes, “both medical and teacher-evaluation trials are designed to generate causal conclusions.” He further observes that it shouldn’t take a statistics degree to understand how value-added advocates flunk that test. Random assignment is the gold standard in medical research, but public school students cannot be randomly assigned to teachers between schools and are rarely assigned randomly within schools. He writes, “suffice it to say that no medical trial would ever be published in any reputable journal (or reputable newspaper) which assigned its patients in the haphazard manner in which students are assigned to teachers at the beginning of a school year.”
The MET at least tried to address the least important aspect of self-sorting of students by randomly assigning some of their relatively affluent sample of students within their schools. But, even then, the powerful Gates Foundation made a feeble effort to meet scientific protocol. It did a better job than most high-profile researchers who embrace value-added of following up on circumstances that undercut the randomness, but Bausell writes:
Explicit guidelines exist for the reporting of medical experiments, such as the (a) specification of how many observations were lost between the beginning and the end of the experiment (which is seldom done in value-added experiments, but would entail reporting student transfers, dropouts, missing test data, scoring errors, improperly marked test sheets, clerical errors resulting in incorrect class lists, and so forth for each teacher); and (b) whether statistical significance was obtained—which is impractical for each teacher in a value-added experiment since the reporting of so many individual results would violate multiple statistical principles.
The MET published a separate document that admits that as few as 27% of students remained in their assigned classes.
Bausell then contrasts medical experiments designed to purposefully minimize the occurrence of extraneous events that change outcomes. No comparable procedures are attempted in most (or all?) value-added teacher-evaluation experiments. Bausell lists a number of those factors, including auxiliary tutoring, being helped at home (as is more likely for students in the Gates's relatively affluent sample,) or “any number of naturally occurring positive or disruptive learning experiences.”
Being an inner city teacher, I am particularly shocked that the MET and others ignore the effect of school-wide disruptions, as well as the damage done by top down Cover Your Asses blame-the-teacher policies (such as more test prep and scripted instruction) that are common in the toughest schools. Did it never occur to the Gates scholars that they needed to control for funerals? In schools where almost none of the students are being raised in traditional families, the percentages of kids, each year, who bury the grandparents who raised them is not predictable, but it is huge. How could they not understand the effects of gang wars, as well as the routine ebb and flow of violence in high poverty neighborhood schools? And, how would they account for policies in the toughest schools that make it impossible to enforce academic, behavioral, and attendance policies?
Bausell also notes the effect that team teaching can have in introducing positive learning experiences. Perhaps that was not a factor in the Gates sample, but neither did the MET account for the uninvited member of the instructional teams in many or most troubled schools. How do you distinguish between the growth – or lack of growth – that was produced by an individual teachers, as opposed to the mandatory remediations that are ubiquitous in troubled schools. The most influential, undoubtedly, are bogus “credit recovery” programs designed to “pass kids on.” I doubt the Gates researchers had any idea of the harm done by policies that tell students that they will be awarded credit regardless of whether they come to class, and how they would need to be controlled for.
And when the Gates people descend on a favored school, I doubt that principals are as brazen as they often are in “juking the stats” in ways that undercut instructional effectiveness.
Bausell concludes that even if value-add proponents upgraded the quality of their methodology, that would not change the fact “that a value-added analysis constitutes a series of personal, high-stakes experiments conducted under extremely uncontrolled conditions and reported quite cavalierly.” He calls for “experimentally oriented professionals” to “argue that experiments such as these (the results of which could potentially result in loss of individual livelihoods) should meet certain methodological standards and be reported with a scientifically acceptable degree of transparency.”
I would also add that those scholars, as well as the MET researchers, need to be willing to do so under oath as teachers fight this junk science in court.