Posted earlier today at Substance News:
Evidence Presented in the Case Against Growth Models for High Stakes Purposes
Evidence Presented in the Case Against Growth Models for High Stakes Purposes
Denise Wilburn and Jim Horn
The following article quotes liberally from The Mismeasure of Education, and it represents an overview of the research-based critiques of value-added modeling, or growth models. We offer it here to Chicago educators [and educators everywhere] with the hope that it may serve to inspire and inform the restoration of fairness, reliability, and validity to the teacher evaluation process and the assessment of children and schools.
In the fall of 2009 before the final guidance was issued to all the cash-strapped states lined up for some of the $3.4 billion in Race to the Top grants, the Board of Testing and Assessment issued a 17-page letter to Arne Duncan, citing the National Research Council’s response to the RTTT draft plan. BOTA cited reasons to applaud the DOEd’s efforts, but the main purpose of the letter was to voice, in unequivocal language, the NRC’s concern regarding the use of value-added measures, or growth models, for high stakes purposes specifically related to the evaluation of teachers:
BOTA has significant concerns that the Department’s proposal places too much emphasis on measures of growth in student achievement (1) that have not yet been adequately studied for the purposes of evaluating teachers and principals and (2) that face substantial practical barriers to being successfully deployed in an operational personnel system that is fair, reliable, and valid (p. 8).
In 1992 when Dr. William Sanders sold value-added measurement to Tennessee politicians as the most reliable, valid, and fair way to measure student academic growth and the impact that teachers, schools, and districts have on student achievement, the idea seemed reasonable, more fair, and even scientific to some. But since 1992, leading statisticians and testing experts who have scrutinized value-added models have concluded that these assessment systems for measuring test score growth do not meet the reliability, validity, and fairness standards established by respected national organizations, such as the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (Amrein-Beardsley, 2008). Nonetheless, value-added modeling for high-stakes decision making now consumes significant portions of state education budgets, even though there is no oversight agency to make sure that students and teachers are protected:
Who protects [students and teachers in America’s schools] from assessment models that could do as much harm as good? Who protects their well-being and ensures that assessment models are safe, wholesome, and effective? Who guarantees that assessment models honestly and accurately inform the public about student progress and teacher effectiveness? Who regulates the assessment industry? (Amrein-Beardsley, 2008, p. 72)
If value-added measures do not meet the highest standards established for reliable, valid and fair measurement, then the resulting high stakes decisions made based on these value-added measures are also unreliable, invalid and unfair. Therefore, legislators, policymakers, and administrators who require high-stakes decisions based on value-added measures are equally culpable and liable for the wrongful termination of educators mismeasured with metrics derived from tests that were never intended to do anything but give a ballpark idea of how students are progressing on academic benchmarks for subject matter concepts at each grade level.
Is value-added measurement reliable? Is it consistent in its measurement and free of measurement errors from year to year? As early as 1995, the Tennessee Office of Education Accountability (OEA) reported “unexplained variability” in Tennessee’s value-added scores and called for an outside evaluation of all components of the Tennessee Value-Added Assessment System (TVAAS), including the achievement tests used in calculating the value-added scores. The outside evaluators, Bock, Wolfe & Fisher (1996), questioned the reliability of the system for high-stakes decisions based on how the achievement tests were constructed. These experts recognized that test makers are engaged in a very difficult and imprecise science when they attempt to rank order learning concepts by difficulty.
This difficulty is compounded when they attempt to link those concepts across a grade-level continuum so that students successfully build subject matter knowledge and skill from grade to grade. An extra layer of content design imprecision is added when test makers create multiple forms of the test at each grade level to represent those rank-ordered test items. Bock, Wolfe, & Fisher found that variation in test construction was, in part, responsible for the “unexplained variability” in Tennessee’s state test results.
Other highly respected researchers (Ballou, 2002; Lockwood, 2006; McCaffrey & Lockwood, 2008; Briggs, Weeks & Wiley, 2008) have weighed in on the issue of reliability of value-added measures based on questionable achievement test construction. As an invited speaker to the National Research Council workshop on value-added methodology and accountability in 2010, Ballou pointedly went to the heart of the test quality matter when he acknowledged the “most neglected” question among economists concerned with accountability measures:
The question of what achievement tests measure and how they measure it is probably the [issue] most neglected by economists…. If tests do not cover enough of what teachers actually teach (a common complaint), the most sophisticated statistical analysis in the world still will not yield good estimates of value-added unless it is appropriate to attach zero weight to learning that is not covered by the test. (National Research Council and National Academy of Education, 2010, p. 27).
In addition to these test issues, the reliability of the teacher effect estimates in high-stakes applications are compromised by a number of other recurring problems:
1) the timing of the test administration and summer learning loss (Papay, 2011);
2) missing student data (Bock & Wolfe, 1996; Fisher, 1996; McCaffrey et al, 2003; Braun, 2005; National Research Council, 2010);
3) student data poorly linked to teachers (Dunn, Kadane & Garrow, 2003; Baker et al, 2010);
4) inadequate sample size of students due to classroom arrangements or other school logistical and demographic issues (Ballou, 2005; McCaffrey, Sass, Lockwood, & Mihaly, 2009).
Growth models such as the Sanders VAM use multiple years of data in order to reduce the degree of potential error in gauging teacher effect. Sanders justifies this practice by claiming that a teacher’s effect on her students learning will persist into the future and, therefore, can be measured with consistency.
However, research conducted by McCaffrey, Lockwood, Koretz, Louis, & Hamilton (2004) and subsequently by Jacob, Lefgrens and Sims (2008) shatters this bedrock assumption. These researchers found that “only about one-fifth of the test score gain from a high value-added teacher remains after a single year…. After two years, about one-eighth of the original gain persists” (p. 33).
Too many uncontrolled factors impact the stability and sensitivity of value-added measurement for making high-stakes personnel decisions for teachers. In fact, Schochet and Chiang (2010) found that the error rates for distinguishing teachers from the average teaching performance using three years of data was about 26 percent. They concluded
more than 1 in 4 teachers who are truly average in performance will be erroneously identified for special treatment, and more than 1 in 4 teachers who differ from average performance by 3 months of student learning in math or 4 months in reading will be overlooked (p. 35).
Schochet and Chiang also found that to reduce error in the variance in teachers’ effect scores to 12 percent, rather than 26 percent, ten years of data would be required for each teacher (p. 35). When we consider the stakes for altering school communities and students’ and teachers’ lives, the utter impracticality of reducing error to what may be argued as an acceptable level makes value-added modeling for high-stakes decisions simply unacceptable.
Besides this brief summary above of reliability problems, growth models also have serious validity issues caused by the effect of nonrandom placement of students from classroom to classroom, from school to school within districts, or from system to system. Non-random placement of students further erodes Sanders’ causal claims for teacher effects on achievement, as well as his claims that the impact of student characteristics on student achievement is irrelevant. For a teacher’s value-added scores to be valid, she must have “an equal chance of being assigned any of the students in the district of the appropriate grade and subject” and “a teacher might be disadvantaged [her scores might be biased] by placement in a school serving a particular population” year after year (Ballou, 2005, p. 5). In Tennessee, no educational policy or administrative rule exists that requires schools to randomly assign teachers or students to classrooms, thereby denying an equal chance of randomly placed teachers or students within a school, within a district, or across the state.
To underscore the effect of non-random placement of disadvantaged students on teacher effect estimates, Kupermintz (2003) found in reexamining Tennessee value-added data that “schools with more than 90% minority enrollment tend to exhibit lower cumulative average gains” and school systems’ data showed “even stronger relations between average gains and the percentage of students eligible for free or reduced-price lunch” (p. 295).
Value-added models like TVAAS that assume random assignment of students to teachers’ classrooms “yield misleading [teacher effect] estimates, and policies that use these estimates in hiring, firing, and compensation decisions may reward and punish teachers for the students they are assigned as much as for their actual effectiveness in the classroom” (Rothstein, 2010, p. 177).
In his value-added primer from 2005, Henry Braun (2005) stated clearly and boldly that the “fundamental concern is that, if making causal attributions (of teacher effect on student achievement performance) is the goal, then no statistical model, however complex, and no method of analysis, however sophisticated, can fully compensate for the lack of randomization” (p. 8).
Any system of assessment that claims to measure teacher and school effectiveness must be fair in its application to all teachers and to all schools. Because teaching is a contextually-embedded, nonlinear activity that cannot be accurately assessed by using a linear, context-independent value-added model, it is unfair to use such a model at this time. Consensus among VAM researchers recommends against the use of growth models for high stakes purposes. Any assessment system that can misidentify 26 percent or more of the teachers as above or below average when they are neither is unfair when used for decisions of dismissal, merit pay, granting or revoking tenure, closing a school, retaining students, or withholding resources for poor performance.
When almost two-thirds of teachers who do not teach subjects where standardized tests are administered will be rated based on the test score gains of other teachers in their schools, then the assessment system has led to unfair and unequal treatment (Gonzalez, 2012).
When the assessment system intensifies teaching to the test, narrowing of curriculum, avoidance of the neediest students, reduction of teacher collaboration, or the widespread demoralization of teachers (Baker, E. et al, 2010), then it has unfair and regressive effects.
Any assessment system whose proprietary status limits access by the scholarly community to validate its findings and interpretations is antithetical to the review process upon which knowledge claims are based. An unfair assessment system is unacceptable for high stakes decision-making.
In August, 2013, the Tennessee State Board of Education adopted a new teacher licensing policy that ties teacher license renewal to value-added scores. However, the implementation of this policy was delayed by a very important presentation made public by the Tennessee Education Association. Presented by TEA attorney, Rick Colbert, and based on individual teachers sharing their value-added data for additional analysis, the TEA demonstrated that 43 percent of the teachers who would have lost their licenses due to declining value-added scores in one year had higher scores the following year, with twenty percent of those teachers scoring high enough in the following year to retain their licenses. The presentation may be viewed at YouTube: http://www.youtube.com/watch?v=l1BWGiqhHac
After 20 years of using value-added assessment in Tennessee, educational achievement does not reflect an added value in the information gained from an expensive investment in the value-added assessment system. With $326,000,000 spent for assessment, the TVAAS, and other costs related to accountability since 1992, the State’s student achievement levels remain in the bottom quarter nationally (Score Report, 2010, p. 7). Tennessee received a D on K–12 achievement when compared to other states based on NAEP achievement levels and gains, poverty gaps, graduation rates, and Advanced Placement test scores (Quality Counts 2011, p. 46). The Public Education Finances Reports (U.S Census Bureau) ranks Tennessee’s per pupil spending as 47th for both 1992 and 2009. When state legislators and policymakers were led to believe in 1992 that the teacher is the single most important factor in improving student academic performance, they found reason to justify lowering education spending as a priority and increasing accountability.
Finally, the evidence from twenty years of review and analysis by leading national experts in educational measurement, accountability, lead to the same conclusion when trying to answer Dr. Sanders’ original question: Can student test data be used to determine teacher effectiveness? The answer: No, not with enough certainty to make high-stakes personnel decisions. In turn, when we ask the larger social science question (Flyvbjerg, 2001): Is the use of value-added modeling and high-stakes testing a desirable social policy for improving learning conditions and learning for all students? The answer must be an unequivocal “no,” and it must remain so until assessments measure various levels of learning at the highest levels of reliability and validity, and with the conscious purpose of equality in educational opportunity for all students.
We have wasted much time, money, and effort to find out what we already knew: effective teachers and schools make a difference in student learning and students’ lives. What the TVAAS and the EVAAS do not tell us, and what supporters of growth models seem oddly uncurious to know is what, how or why teachers make a difference? While test data and value-added analysis may highlight strengths and/or areas of needed intervention in school programs or subgroups of the student population, we can only know the “what,” “how” and “why” of effective teaching through careful observation by knowledgeable observers in classrooms where effective teachers engage students in varied levels of learning across multiple contexts. And while this kind of knowing may be too much to ask of any set of algorithms developed so far for deployment in schools, it is not at all alien to great educators who have been asking these questions and doing this kind of knowledge sharing since Socrates, at least.
Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System. Educational Researcher, 37(2), 65-75. doi: 10.3102/0013189X08316420
Baker, A., Xu, D., and Detch, E. (1995). The measure of education: A review of the Tennessee value added assessment system. Nashville, TN: Comptroller of the Treasury, Office of Education Accountability Report.
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shephard, L. A. (2010, August 29). Problems with the use of student test scores to evaluate teachers (Briefing Paper #278). Washington, DC: Economic Policy Institute.
Ballou, D. (2002). Sizing up test scores. Education Next. Retrieved from www.educationnext.org
Ballou, D. (2005). Value-added assessment: Lessons from Tennessee. Retrieved from http://dpi.state.nc.us/docs/superintendents/quarterly/2010-11/20100928/ballou-lessons.pdf
Bock, R. and Wolfe, R. (1996, Jan. 23). Audit and review of the Tennessee value-added assessment system (TVAAS): Preliminary report. Nashville, TN: Comptroller of the Treasury, Office of Education Accountability Report.
Braun, H. I. (2005). Using student progress to evaluate teachers (Policy Information Perspective). Retrieved from Educational Testing Service, Policy Information Center website: http://www.ets.org/Media/Research/pdf/PICVAM.pdf
Briggs, D. C., Weeks, J. P. & Wiley, E. (2008, April). The sensitivity of value-added modeling to the creation of a vertical scale score. Paper presented at the National Conference on Value-Added Modeling, Madison, WI. Retrieved from http://academiclanguag.wceruw.org/news/events/VAM%20Conference%20Final%20Papers/SensitivityOfVAM_BriggsWeeksWiley.pdf
Dunn, M., Kadane, J., & Garrow, J. (2003). Comparing harm done by mobility and class absence: Missing students and missing data. Journal of Educational and Behavioral Statistics, 28, 269–288.
Fisher, T. (1996, January). A review and analysis of the Tennessee value-added assessment system. Nashville, TN: Tennessee Comptroller of the Treasury, Office of Education Accountability Report.
Flyvbjerg, B. (2001). Making social science matter: Why social inquiry fails and how to make it succeed again. Cambridge: Cambridge University Press.
Gonzalez, T. (2012, July 17). TN education reform hits bump in teacher evaluation. The Tennessean. Retrieved from
Jacob, B. A., Lefgrens, L. & Sims, D. P. (2008, June). The persistence of teacher-induced learning gains (Working Paper 14065). Retrieved from the National Bureau of Economic Research website: http://www.nber.org/papers/w14065
Kupermintz, H. (2003). Teacher effects and teacher effectiveness: A validity investigation of the Tennessee Value Added Assessment System. Educational Evaluation and Policy Analysis, 25(3), 287-298.
Lockwood, J. R., McCaffrey, D. F., Hamilton, L. S., Stecher, B. Le, V. & Martinez, F. (2006). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Retrieved from The Rand Corporation website: http://www.rand.org/content/dam/rand/pubs/reports/2009/RAND_RP1269.pdf
McCaffrey, D. F. & Lockwood, J. R. (2008, November). Value-added models: Analytic Issues. Paper presented at the National Research Council and the National Academy of Education, Board of Testing and Accountability Workshop on Value-Added Modeling, Washington DC.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M. & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Retrieved from The Rand Corporation website: http://www.rand.org/pubs/monographs/MG158.html
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., &Hamilton, L. (2004). Models for Value-Added Modeling of Teacher Effects. Journal of Educational and Behavioral Statistics, 29(1), 67-101.
McCaffrey, D. F., Sass, T. R., Lockwood, J. R. & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572-606.
National Academy of Sciences. (2009). Letter report to the U. S. Department of Education on the Race to the Top fund. Washington, DC: National Academies of Sciences. Retrieved from http://www.nap.edu/catalog.php?record_id=12780
National Research Council and National Academy of Education. (2010). Getting Value Out of Value-Added: Report of a Workshop. Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Educational Accountability, Henry Braun, Naomi Chudowsky, and Judith Koenig, Editors. Center for Education, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.
Papay, J. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1),163-193.
Quality counts, 2011: Uncertain forecast. (2011, January 13). Education Week. Retrieved from http://www.edweek.org/ew/toc/2011/01/13/index.html
Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achievement. The Quarterly Journal of Economics, 125(1), 175-214.
Schochet, P. Z. & Chiang, H. S. (2010). Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.
State Collaborative on Reforming Education. (2010). The state of education in Tennessee (Annual Report). Retrieved from http://www.tnscore.org/wp-content/uploads/2010/06/Score-2010-Annual-Report-Full.pdf
U. S. Census Bureau. (2011). Public Education Finances: 2009 (G09-ASPEF). Washington, DC: U.S. Government Printing Office.