As U.C.L.A. School of Management Professor Samuel Culbert explained, a performance evaluation is as much of, "an expression of the evaluator's self-interests as it is a subordinate's attributes." Culbert argued that the institution of professional performance reviews is, "as destructive and fraudulent as it is ubiquitous." It is not just in educational evaluations where "almost every person and every person reviewing it knows it is bogus." Throughout the business world, there are plenty of people who believe that improved performance evaluations can improve productivity, but it is also hard to deny that evaluations are often about ego, control, intimidation, cronyism, and just enduring something that is a part of the job.
My experiences in with performance evaluations are consistent with Culbert’s analysis. Even so, economists who have no teaching experience often assume that better evaluations can drive school improvement.
The National Bureau of Economic Research (NBER) working paper, National Bureau of Economic Research James Wyckoff’s and Thomas Dee’s “Incentives, Selection, and Teacher Performance” has been spun as evidence supporting Washington D.C.’s teacher evaluation system, IMPACT. Even though the paper provides no evidence in that IMPACT has benefited students, the authors’ narrative and statements to the press make it clear that they are hoping that IMPACT will succeed. They seem to argue that carrots and sticks are a good management tool and IMPACT is doing what carrots and sticks, especially sticks, are supposed to do.(Given the quality of their work, it is no criticism of the authors to say that they would like IMPACT to be successful; Clearly, they are objective scholars.)
I don’t deny that I also have a perspective, but here’s a “thought experiment” about Wyckoff’s and Dee’s study. What if an objective reader read the evidence presented by Wyckoff and Dee? Wouldn’t an impartial reader conclude that they buried the lede? Does their evidence not indicate that IMPACT is failing even according to its own terms?
Two things jump out of their Table Three. Firstly, among “Minimally Effective” teachers, 19% were teachers (known as Group 1) whose value-added counted as 50% of their evaluations. Only 15% of Highly Effective teachers were Group 1. In other words, if you teach a tested subject and are thus subjected to value-added, you are more likely to have your career threatened and less likely to gain rewards for being highly effective.
Secondly, the table reports the IMPACT results for two samples of teachers. The first includes the 14% of teachers who were categorized as Highly Effective and more than 1700 teachers who were deemed Effective. The other sample Includes those same Effective teachers with about 300 (also about 14%) of teachers who were rated Minimally Effective. Since the whole point of value-added is that it is supposed to differentiate more precisely the effectiveness of teachers, shouldn't the sample that includes Highly Effective teachers produce a higher value-added than the sample with about the same number of Minimally Effective teachers? In fact, the less effective teachers had a lower value-added (by a statistically insignificant amount.)
None of the differences in IMPACT’s quantitative results, to this historian at least, seem be significant. Educators had to generate some evaluation metrics for the sake of having evaluation metrics. They can include homemade quantitative rubrics for the individual teacher (TAS) or for the school (CSC.) From this outsider’s perspective, they sound like busy work, and I bet a lot of D.C. teachers feel the same way. The less effective sample received a mean score of 2.98 on the TAS, in contrast to 3.10 for the highly effective sample. On the CSC, the less effective sample supposedly earned a 3.25 in comparison to the highly effective mean score of 3.30. The metric that should be more reliable, the TLF teacher observation, determined that the less effective teachers have a mean score that was only .11 lower.
The only differences that seem to be significant were produced by the Core Professionalism segment, which holds teachers accountable for the attendance and their professional conduct. Common sense says that that should be the most reliable metric and, sure enough, its mean difference was .37 or three times as great as the observation’s mean differences. That raises the question of whether a more modest approach would have been preferable. Rather than harass all teachers in an effort to root out bad teachers, would it not have been smarter to hold teachers accountable for their behavior, and terminate those who were not doing their jobs? Would it not have been smarter to have focused on what teachers actually do and hold teachers accountable for actually teaching, as opposed to creating such a divisive and stressful experiment?
The part of IMPACT that most impressed Wyckoff and Dee is that it imposes stress on teachers who are lower-rated. For argument's sake, however, let’s say that all of the 14% of the district’s teachers who were judged to be “Minimally Effective” were accurately placed in that category. Wyckoff and Dee proclaim IMPACT a success because about 20% just above the threshold for “Effective” left the school system at the end of a year while about 30% of teachers just below that threshold quit. Was it a good bargain for the D.C. schools to impose all of the stress and the other negative byproducts of IMPACT in order to speed the exit of such a small number?
The economists brag that a teacher who received a low rating is more likely to leave the district. But, wouldn’t a good teacher who received an unfair value-added rating be equally or more likely to leave? Since value-added models are systematically biased against teachers with classes of English language learned, special education students, and poor students, it seems likely that IMPACT will help drive teaching talent from schools where it is harder to raise test scores.
Common sense indicates that IMPACT and other value-added evaluations will soon result in a predictable behavior. Systems will have to play statistical games to inflate parts of teachers’ evaluations so they do not have to fire all the teachers who, fairly and unfairly, receive low rating on test score growth. As the authors acknowledge, D.C. has had much more money and thus a larger pool of replacement teachers. Even D.C., however, may already see the need to start playing those games so that they don’t lose irreplaceable teachers. The study includes a footnote which says:
In 2009-10 and 2010-11, the mean teacher value-added was equated to an IVA score of 2.5 with relatively few teachers receiving either a 1.0 or a 4.0. In 2011-12 the mean teacher value-added score was equated to an IVA score of 3.0 and the relatively more teachers were assigned to 1 and 4. This had the net effect of increasing average IVA scores by 0.25 in 2011-12. Because of these adjustments, we avoid any year-to-year comparisons for IMPACT scores or their components.
In other words, D.C. quietly reduced the short-term impact of the value-added component of IMPACT. In doing so, D.C. also talked tough about the rigor of future evaluations. But, talk is just talk. Or, at least, economists haven't started to quantify reformers' spin as they back away growling.
The bottom line for value-added evaluations is that they cannot determine whether a teacher’s failure to meet his growth target is due to his own ineffectiveness or the school’s ineffectiveness, peer pressure, or other factors. The bottom line for IMPACT is that it imposes a great deal of anguish, it is bound to increase teachers’ IMPACT scores, but there is no way of determining whether those numbers reflect any real improvements in instruction. Before long, I expect the overwhelming consensus of observers will recognize that the facts are the opposite of the reformers’ spin.