With numerous court cases filed to challenge the fairness, reliability, and/or validity of the various mis-applications of value-added modeling (VAM) to reward and punish teachers, it is appropriate to consider the huge body of evidence that may be shaped to drive a stake through the heart of the monstrous chimera that the tobacco-chewing Bill Sanders conjured just over 30 years ago from his ag desk at the University of Tennessee. The following excerpt from The Mismeasure of Education considers the reliability issue, alone. The other elements of the VAM critique comprise the remainder of Part 3 of the book.
. . . . In 1983, William Sanders was a station statistician at the UT Agricultural Campus and an adjunct professor in UT’s College of Business. Based on his own experimentation with modeling the growth of farm animals and crops, Sanders proposed a hypothetical and reductive question: Can student achievement growth data be used to determine teacher effectiveness? He then built a statistical model, ran the data of 6,890 Knox County students through his model and answered his own question with an unequivocal affirmative. Proceeding from this single study, Sanders’ claims went beyond the customary correlational relationship between or among variables that statisticians find as patterns or trends in data. He pronounced that teachers not only contributed to the rate of student growth, but that teacher effectiveness was in fact the most important variable in the rate of student growth:
If the purpose of educational evaluation is to improve the educational process, and if such improvement is characterized by improved academic growth of students, then the inclusion of measures of the effectiveness of schools, schools systems, and teachers in facilitating such growth is essential if the purpose is to be realized. Of these three, determining the effectiveness of individual teachers holds the most promise because, again and again, findings from TVAAS research show teacher effectiveness to be the most important factor in the academic growth of students. (Sanders & Horn, 1998, p. 3).
This provocative pronouncement sent other statisticians, psychometricians, mathematicians, economists, and educational researchers into a bustle of activity to examine the Sanders’ statistical model and his various claims based on that model, which became the Tennessee Value-Added Assessment System (TVAAS), and which is now marketed by SAS as Education Value Added Assessment System SAS® EVAAS®. In sharing the findings of that body of research, it becomes clear that using TVAAS to advance education policy is 1) undesirable when examined for high standards of reliability, validity and fairness, and 2) counterproductive in reaching high levels of academic achievement.
At first glance, Sanders’ question seems reasonable and his answer logical. Why entrust students to teachers for seven or more a day and expect children to grow academically, if teachers do not contribute significantly to student learning? In the Aristotelian tradition of logical conclusions “if a=b and b=c, then c=a,” Sanders’ argument goes something like this:
a) if student test scores are indicative of academic growth, and
b) academic growth is indicative of teacher effectiveness, then
c) test scores are indicative of teacher effectiveness; and therefore, they should be used in teacher evaluation.
If test scores are improving, then, academic growth has increased and teacher effectiveness is greater. On the contrary, if test scores are not improving, academic growth has decreased and teacher effectiveness is diminished.
When Sanders made his causal pronouncements, he made assumptions that, in turn, made the teaching and learning context largely irrelevant. Using complex statistical methods previously applied to business and agriculture to study inputs and outputs of systems, Sanders developed formulae that eliminated student background, educational resources, district curricula and adopted instructional practices, and the learning environment of the classroom as variables that impact student learning. Sanders’ expressed his rationale for attempting to isolate teacher effect from the myriad of effects on student learning at an online listserv discussion with Gene Glass and others in 1994:
The advantage of following growth over time is that the child serves as his or her own “control.” Ability, race, and many other factors that have been impossible to partition from educational effects in the past are stable throughout the life of the child. [http://gvglass.info/TVAAS/]
What we know, of course, is that a child’s economic, social, and familial conditions can and do change depending on the larger contexts of national recessions, mobility, divorce, crime, and changing education policies. It stands to reason, then, that the high stakes claims made while using statistical models such as TVAAS must be held to higher standards of proof when making high-stakes declarations of causation that simultaneously render contexts irrelevant. These standards are shaped by the following questions: Are the value-added assessment findings reliable? Are the findings valid? Are the findings fair? Based on critical reviews of leading statisticians, mathematicians, psychometricians, economists, and education researchers, the Sanders Model does not meet these standards of proof when used for high-stakes purposes, with the most egregious shortcomings apparent when used for the dual purposes of diagnostics and evaluation. Briefly, there are three reasons the Sanders Model falls short: Sanders assumes 1) that tests and test scores are a reliable measure of student learning; 2) that characteristics of students, classrooms, schools, school systems, and neighborhoods can be made irrelevant by comparing a student’s test scores from year to year; 3) that value-added modeling can capture the expertise of teachers fairly—just as fairly as Harrington Emerson thought he could apply Frederick Taylor’s (1911) Principles of Scientific Management to the inefficiency of American high schools in 1912 (Callahan, 1962). As explained in Part II, the TVAAS is simply the latest iteration of business efficiency formulas misapplied to educational settings in hopes of producing a standard product in a cost effective fashion. Sanders made this clear in 1994 while distinguishing TVAAS from other teacher evaluation formats,
TVAAS is product oriented. We look at whether the child learns—not at everything s/he learns, but at a portion that is assessed along the articulated curriculum, a portion each parent is entitled to expect an adequately instructed child will learn in the course of a year ([http://gvglass.info/TVAAS/]).
This quote is even more telling for what remains implicit, rather than expressed: “the portion that is assessed” is the portion that can be reduced to the standardized test format, which is required for Dr. Sanders to be able to perform his statistical alchemy to begin with. Not only does Dr. Sanders claim to speak here for the millions of parents who may have greater expectations for their children’s learning than those reflected in Dr. Sanders’ minimalist expectations, but he also tips his hand as to a deeper accountability and efficiency motive that is exposed by his concern for the “adequately instructed child.” Once the other contextual factors (resources, poverty levels, parenting, social and cultural capital, leadership, etc.) have been excised from Dr. Sanders’ formulae, the only remaining contextual factor (the teacher) must absorb the full weight of any causative change. Such contextual cleansing may make for beautiful statistical results, but it performs a devastating reduction to what is considered learning in schools, all the while acknowledging to not care a whit for either what is taught or how it is taught. All these basic shortcomings are reinforced by assuming that learning is linear, a demonstrably false assumption that will be discussed later in this section.
While there were distinct statistical and psychometric challenges to the efficacy of TVAAS in the educational measurement and evaluation research literature from 1994 to 2012, one overarching theme ran through them all: value-added modeling at the teacher-effect level is not stable enough to determine individual teacher contributions to student academic performance, especially as it is related to personnel decisions, i. e., evaluation, performance pay, tenure, hiring, or dismissal decisions. As early as 1995, scholars (Baker, Xu, & Detch, 1995) offered a strong warning that the use of TVAAS for high-stakes might create unintended consequences for both teachers and students, such as teaching the test (thus narrowing the curriculum), teaching test skills instead of academic skills, over-enrolling students in special education since special education scores were not counted in TVAAS calculations, cheating to raise test scores, and using poor test performance to hurt teachers professionally. By 2011 researchers (Corcoran, Jennings, & Beveridge, 2011) offered empirical evidence that all teachers do not teach to the test, but when they do, student learning depreciates more quickly than when teachers teach to general knowledge domains and expect students to master concepts and apply skills
In the following examination of these three standards of proof (reliability, validity, and fairness), we summarize the findings of national experts in statistical modeling, value-added assessment, education policy, and accountability practices. Taken together, they provide irrefutable evidence that Sanders fails to meet these standards by using TVAAS for high-stakes decision-making such as reducing resources to schools, closing low-scoring schools, or sanctioning and/or rewarding teachers. Most importantly, however, is the Sanders-sanctioned myth that if students are making some yearly growth on tests that were constructed for diagnostic rather than evaluative purposes, then those students will have received an education sufficient for a successful life, economically, socially, personally.
The Tennessee Value-Added Assessment Model—Reliability Issues
The Tennessee Comprehensive Assessment Program (TCAP) achievement test is a standardized, multiple-choice test composed of criterion-referenced items administered to 3-8th grades. It is purported to measure student mastery of the general academic concepts and skills as well as specific Tennessee learning standard objectives. Controversy surrounding the use of achievement tests stems from the degree of test reliability needed for the high stakes purposes for which they are used. To achieve reliability, achievement test scores must be consistent over repeated test measurements and free of errors of measurement (RAND, 2010). The degree of reliability is biased by test construction such as vertical scales or test equating and test use such as diagnostic versus evaluative.
As early as 1995, the Tennessee Office of Education Accountability (OEA) reported “unexplained variability” in the value-added scores and called for an outside evaluation of all components of the TVAAS that included the tests used in calculating the value-added scores (p. iv). A three-person outside evaluation team included R. Darrell Bock, a distinguished professor in design and analysis of educational assessment and professor at the University of Chicago; Richard Wolfe, head of computing for the Ontario Institute for Studies in Education; and Thomas H. Fisher, Director of Student Assessment Services for the Florida Department of Education. The outside evaluation team investigated the 1995 OEA concern over the achievement tests used by TVAAS. Of particular interest to TVAAS evaluators were the test constructions of equal interval and vertical scales and the process of test equating [needs layman’s terms or explanation].
Bock and Wolfe found the scaling properties acceptable for the purpose of determining student academic gain scores from year to year, but unacceptable for determining district, school, and teacher effect scores (p. 32). All three evaluators had concerns about test equating. Fisher’s (1996) concerns focused on the testing contractor, CTB/McGraw-Hill, having the sole responsibility for developing multiple test forms of equal difficulty at each grade level, stating that “[t]est equating is a procedure in which there are many decisions not only about initial test content but also about the statistical procedure used. If care is not exercised, the content design will change over time and the equating linkages will drift” (p. 23). And indeed, Bock and Wolfe found in their examination of Tennessee’s equated tests forms that test form difficulty (due to item selection) created unexpected variation in gain scores at some grade levels. Bock and Wolfe also emphasized the importance of how the scale scores, used in calculating value-added gain scores, are derived (pp. 12-13). Why are the scaling properties of tests important?
Achievement tests are measurement tools designed to determine where on a continuum of learning a student’s performance falls. The equal interval scale of a test is the continuum of knowledge and skills divided into equal units of “learning” value. If one thinks of measuring learning along a number line ranging from 1 to 100, the assumption of equal intervals of learning would be that the same “amount” of learning occurs whether the student’s scores move from 1 to 2 or from 50 to 51 or from 98 to 99 on the number line or measurement scale. The leap of faith here is that the student who scores 1 to 10 at the less difficult end of the testing continuum has learned a greater amount than the student who increases his or her score by fewer intervals, say 95 to 99, at the most difficult end of the continuum. Dale Ballou (2002), an economics professor at Vanderbilt University who collaborated with Sanders during the 1990s, has maintained that the equal intervals used to measure student ability are really measuring the ordered difficulty of test items, with the ordering of difficulty determined by the test constructor (p. 15). If, for example, a statistics and probably question requiring students make a prediction based on various representations is placed on a third grade test, it is considered more difficult than an item requiring the student to simply add or substract. It is difficult to say who has learned more, the third grade gifted student who answers the statistical question correctly but makes less progress than the student who answers all the calculation questions correctly and appears from the test score to make more progress. Therefore, student ability is inferred from test scores and not truly observed, thus making value-added teacher effect estimates better or worse depending on how these equal interval scales are designed for consistently measuring units of “learning” at every grade level over time. Differences in units of measurement from scale to scale yield differences in teacher effect scores, even though the selection and use of the scales are beyond the control of any teacher. Ballou has concluded that an built-in imprecision in scales leads to quite arbitrary results, and that “our efforts to determine which students gain more than others—and thus which teachers and schools are more effective—turn out to depend on conventions (arbitrary choices) that make some educators look better than others” (p. 15).
The vertical scaling of an achievement test is based on the measurement of increasingly difficult test items from year to year on the same academic content and skills. Vertical scaling is important to Sanders’ TVAAS model because he uses student test scores over multiple years in estimating teacher effectiveness. Therefore, the content and skills at one grade level must be linked to the content and skills at the next grade level in order to measure changes in student performance on increasingly difficult or more complex concepts and skills from third grade through eighth grade. Problems arise, however, when there is a shift in the learning progression of content and skills. For example, third grade reading may focus on types and characteristics of words and the retelling of narratives, fifth grade on types and characteristics of literary genres and interpretation of non-fiction texts, and eighth grade on evaluation of texts for symbolic meaning, bias, and connections to other academic subjects such as history or science. While the content and skills are related from grade to grade, there may be not be sufficient linkage between content and skills or consistency in the degree of difficulty across grades and subjects to render accurate performance portraits of student and the resulting teacher effect estimates, even if one has faith that test results can mirror teacher efforts in the best of all possible worlds: “Shifts in the mix of constructs across grades can distort test score gains, invalidate assumptions of perfect persistence of teacher effects and the use of gain scores to measure growth, and bias VAM [value-added model] estimates” (McCaffrey & Lockwood, 2008, p. 9).
The same kinds of inconsistencies can occur when using the same tests for other high stakes measure such as school effectiveness. Using the Sanders Model and eight different vertical scales for the same CTB/McGraw Hill tests at consecutive grade levels, Briggs, Weeks, and Wiley (2008) found that
the numbers of schools that could be reliably classified as effective, average or ineffective was somewhat sensitive to the choice of the underlying vertical scale. When VAMs are being used for the purposes of high-stakes accountability decisions, this sensitivity is most likely to be problematic (p. 26).
Lockwood (2006) and his colleagues at RAND found that variation within teachers’ effect scores persisted, even when the internal consistency reliability between the procedures subtest and the problem-solving subtest from the same mathematics test was high. In fact, there was greater variation from one subtest to the next than there was in the overall variation among teachers (p. 14). The authors cited the source of this variation as “the content mix of the test” (p. 17), which simply means that test construction and scaling are imperfect enough to warrant great care and prudence when applying even the most perfect statistical treatments under the most controlled conditions.
For students to show progress in a specific academic subject (low-stakes) and for Sanders to isolate the teacher effect based on student progress (high-stakes), tests require higher degrees of reliability in equal interval, vertical scaling and test equating. Tests are designed and constructed to do a number of things, from linking concepts and skills for annual diagnostic purposes to determining student mastery of assigned standards of learning. They are not, however, designed or constructed to reliably fulfill the value-added modeling demands placed on them. Though teachers cannot control the reliability of test scaling or the test item selection that represents what they teach, they can control their teaching to the learning objectives of the standards most likely to be on tests at their grade levels—those learning objectives that lend themselves easily to multiple-choice tests. For example, an eighth teacher might have students identify bias in different reading selections, easily tested in a multiple choice format, instead of studying the effect of reporting bias in the news and research literature on current political issues and policy decisions in the students’ community. Or if she does teach the later lesson, she and her students get no credit on a multiple-choice test for their true level of expertise in teaching and understanding the concept of bias. In fact, if she spends the time to examine reporting bias as part of the student’s social and political environment at the expense of another test objective, student scores and her resulting value-added effect designation may suffer. This is an unintended consequence of high-stakes testing and a survival strategy for teachers whose position and salary are bound to policies and practices that focus on high test scores. As an invited speaker to the National Research Council workshop on value-added methodology and accountability, Ballou pointedly went to the heart of the matter when he acknowledged the “most neglected” question among economists concerned with accountability measures:
The question of what achievement tests measure and how they measure it is probably the [issue] most neglected by economists…. If tests do not cover enough of what teachers actually teach (a common complaint), the most sophisticated statistical analysis in the world still will not yield good estimates of value-added unless it is appropriate to attach zero weight to learning that is not covered by the test. (Braun, Chudowsky, & Koneig, 2010, p. 27).
In addition to these scaling issues, the reliability of the teacher effect estimates is a problem in high-stakes applications when compromised by the timing of the test administration, summer learning loss, missing student data, and inadequate sample size of students due to classroom arrangements or other school logistical and demographic issues.
Achievement tests used for value-added modeling are generally administered once a year. Scores from these tests are then compared in one of three ways: (1) from spring to spring, (2) from spring to fall, or (3) from fall to fall. Spring to spring and fall to fall schedules introduce what has become known as summer learning loss—what students forget during summer vacation. This loss is different for different students depending on what learning opportunities they have or do not have during the summer, e. g., summer tutoring programs, camps, family vacations, access to books and computers. What John Papay (2011) found in comparing different test administration schedules was that “summer learning loss (or gain) may produce important differences in teacher effect” and that even “using the same test but varying the timing of the baseline and outcome measure introduces a great deal of instability to teacher rankings” (p. 187). Papay’s warned policymakers and practitioners wishing to use value-added estimates for high-stakes decision making that they “must think carefully about the consequences of these differences, recognizing that even decisions seemingly as arbitrary as when to schedule the test within the school year will likely produce variation in teacher effectiveness estimates” (p. 188).
In addition to the test schedule problem for pre/post test administration, achievement tests are usually administered before an entire school year is completed, meaning the students’ achievement test scores impact two teachers’ effect scores each year instead of just one. By using multiple years of student data to estimate teacher effect scores, Sanders has remained unconcerned with this issue by assuming the persistence of teacher effect on student performance is an assumption of his model. Ballou (2005) described Sanders’ assumption in the following way “. . . teacher effects are layered over time (the effect of the fourth grade teacher persists into fifth grade, the effects of the fourth and fifth grade teachers persist into sixth grade, etc.)” (p. 6). However, the possible “contamination” of other teachers’ influence on an individual teacher’s effect estimate was noted in the first outside evaluation of TVAAS by Bock and Wolfe (1996), who questioned the three years of data that Sanders used in his model. Bock and Wolfe agreed that three years of data would help stabilize the estimated gain scores, but they were concerned, nonetheless, that “the sensitivity of the estimate as an indicator of a specific teacher’s performance would be blunted” (p. 21). Fourteen years after Bock and Wolfe’s neglected warning, the empirical research presented in a study completed for the U.S. Department of Education’s Institute of Education Sciences (Schochet & Chiang, 2010) found that the sensitivity of the estimate of a specific teacher’s effect was, indeed, blunted. They found, in fact, that the error rates for distinguishing teachers from the average teaching performance using three years of data was about 26 percent. They concluded
more than 1 in 4 teachers who are truly average in performance will be erroneously identified for special treatment, and more than 1 in 4 teachers who differ from average performance by 3 months of student learning in math or 4 months in reading will be overlooked (p. 35).
Schochet and Chiang (2010) also found that to reduce the effect of test measurement errors to 12 percent of the variance in teachers’ effect scores would take 10 years of data for each teacher (p. 35), an utter impracticality when using value-added modeling for high-stakes decisions that alter school communities and students’ and teachers’ lives.
McCaffrey, Lockwood, Koretz, Louis, & Hamilton (2004) challenged Sanders’ assumption of the persistence of a teacher’s effect on future student performance. In noting the “decaying effects” that are common in social science research, they concluded the Sanders claim of teacher effect immutability over time “is not empirically or theoretically justified and seems on its face not to be entirely plausible (p. 94). In fact, in earlier research, McCaffrey and his colleagues at RAND (2003) developed a value-added model that allowed for the “estimation of the strength of the persistence of teacher effects in later years” (p. 59) and found that “teacher effects dampen [decay] very quickly” (p. 81). As a result, they called for more research concerning the assumption of persistence. Mariano, McCaffrey, and Lockwood’s (2010) research concerning the persistence of teacher effect showed that “complete persistence of teacher effects across future years is not supported by data” (in Lipscomb et al, 2010, p. A14). Using statistical methods to measure teacher persistence effect on student performance across multiple years in math and reading, Jacob, Lefgrens and Sim (2010) determined that “only about one-fifth of the test score gain from a high value-added teacher remains after a single year…. After two years, about one-eighth of the original gain persists” (p. 33). They went on to say that “if value-added test score gains do not persist over time, adding up consecutive gains [over multiple years] does not correctly account for the benefits of higher value-added teachers” (p. 33). In light of these more recent research studies, Sanders’ unwavering claims have proven more persistent than the teacher effect persistence that he claims. In light of the mounting body of research that, at a minimum, acknowledges deep uncertainty regarding the persistence of teacher effect, the claim by Sanders and Rivers (1996) that the “residual effects of both very effective and ineffective teachers were measurable two years later, regardless of the effectiveness of teachers in later grades,” (p. 6) clearly needs to be reexamined and explicated further.
By claiming the persistence of a teacher’s influence on student performance, Sanders is able to assume that access to three years of data lessens the statistical noise created by missing student test scores, socioeconomic status, or other factors that affect teacher effect scores (Sanders, Wright, Rivers, & Leandro, 2009). Missing data was a primary issue in the 1996 evaluation of TVAAS by Bock, Wolfe and Fisher. Their examination of the data quality showed that missing data could cause distortion to the TVAAS results and the “linkage from students to teachers is never higher than about 85 percent, and worse in grades 7-8, especially in reading” (p.18). It is important, of course, that test scores of every student in every classroom in every school are accounted for and attached or linked to the correct teacher when computing teacher effect scores. Poor linking commonly occurs in schools, however, due to student absences, students being pulled out of class for special education, student mobility, and team teaching arrangements, just to name a few (Baker et al, 2010). Missing or faulty data contributes to teachers having incomplete sets of data points (student test scores) and “can have a negative impact on the precision and stability of value-added estimates and can contribute to bias” (Braun, Chudowsky & Koneig, 2010, p. 46). A small set of student test scores, for example, can be impacted by an overrepresentation of a subgroup of students (i.e., socioeconomic status or disabilities), or that same set of scores may be significantly impacted by a single student with very different scores, high or low, from the other students in a class.
In addition to assuming multiple years of data will increase the number of data points per teacher sufficiently, Sanders assumes that missing data are random, but this is doubtful as students whose test data are missing are most often low-scoring students (McCaffrey et al, 2003, p. 83) who missed school or moved, entered personal data incorrectly, or took the test under irregular circumstances (i.e. special education modifications, make-up exams) and may have been improperly matched to teachers (Dunn, Kadane, & Garrow, 2003). Sanders attempts to capture all available data for teacher effect estimates by recalculating those estimates to include missing data from previous years that is eventually matched properly to the correct teacher (Eckert & Dabrowski, 2010).[i] This causes variability in past effect scores for the same teacher and increases skepticism about TVAAS accuracy, especially when such retrospective conversions come too late to alter teacher evaluation decisions based on the earlier version of scores. In his primer on value-added modeling, Braun (2005) pointed out that the Sanders claim that multiple years of data can resolve the impact of missing data, “required empirical validation” (p. 13). No such validation has been forthcoming from the Sanders Team.
Ballou (2005) explained that imprecision arises when teacher effect scores are based on too few data points (student test scores linked to a particular teacher). The number of data points can be too few based on the number of years the data are collected, whether one, two, or three years. The number of data points can be reduced, too, by small class size, missing student data, classes with a large percentage of special education students whose scores do not count in teacher effect data, or students who have not attended a teacher’s class for at least 150 days of instruction, and by shifting teaching assignments for either grade level and subject area. Data points may be reduced, too, by team teaching situations whereby only one teacher on the team is linked to student data. Ballou’s (2005) research indicated further that the “imprecision in estimated effectiveness due to a changing mix of students would still produce considerable instability in the rank-ordering of teachers [from least effective to most effective]” (p. 18). Ballou recommended adjusting TVAAS to account for all of a teacher’s grade level and subject area data, as too few teachers teach the same grade level and same subject over a three-year period (p. 23). Ballou concluded, too, that only one year of student test data makes teacher effect data too imprecise to be meaningful or fair for its use in teacher evaluation.
In 2009, McCaffrey, Sass, Lockwood and Mihaly published their research concerning the year-to-year variability in value-added measures applied to teachers assigned small numbers of students such as special education teachers since special education students are often exempted from taking the tests. The small number of student scores are impacted by extremely high or extremely low scores resulting is extremes in teacher value-added scores, “so rewarding or penalizing the top or bottom performers would emphasize these teachers and limit the efficacy of polices designed to identify teachers whose performance is truly exceptional” (p. 601). Even though using multiple years of data helps reduce the variability in teacher effect scores, “one must recognize that even when multiyear estimates of teacher effectiveness are derived from samples of teachers with large numbers of students per year, there will still be considerable variability over time” (p. 601).
With these unresolved issues and deep skepticism related to test reliability, Sanders’ logic in justifying the use of value-added modeling for teacher accountability weakens, as in our slightly modified syllogism:
a) if student test scores are unreliable measures of student growth and
b) unreliable measures of student growth are the basis for calculating teacher effectiveness, then
c) test scores are unreliable measures for calculating teacher effectiveness, at least for high-stakes decisions concerning teachers’ livelihoods and schools’ existence.
[i] “[The current] year’s estimates of previous years’ gains may have changed as a result of incorporating the most recent student data. Re-estimating all years in the current year with the newest data available provides the most precise and reliable information for any year and subject/grade combination. Find district and school information at the following: TVAAS Public https://tvaas.sas.com/evaas/public_welcome.jsp, TVAAS Restricted: https://tvaas.sas.com/evaas/login.jsp” (Eckert & Dabrowski, 2010, p. 90).
The false assumption the 'reformers' make is that correlation is not causation. VAM scores can vary wildly from one year to the next because the scores do not measure the value of the teachers. Instead, the scores more likely reflect the socio-economic levels of the students which may vary from year to year. VAM is junk science based on wrong assumptions, and it has been debunked by the American Statistical Society.ReplyDelete