Friday, October 23, 2009

National Academy of Sciences Releases Report: Will Duncan/Obama Listen

Today, the National Academy of Sciences issued a report critiquing Obama/Duncan's "Race to the Top" agenda, affectionately known around here as the "Race off the Cliff," or "Duncan's Dumbo Circus." Below are some key snippets that, hopefully, Arne has time to read between his pick-up games and bogus "Listening Tour" appearances [all bolds mine; h/t to George Sheridan on ARN]:

The items on any test are a sample from some larger universe of knowledge and skills, and scores for individual students are affected by the particular questions included. A student may have done better or worse on a different sample of questions. In addition, guessing, motivation, momentary distractions, and other factors also introduce uncertainty into individual scores. When scores are averaged at the classroom, school, district, or state level, some of these sources of measurement error (e.g., guessing or momentary distractions) may average out, but other sources of error become much more salient. Average scores for groups of students are affected by exclusion and accommodation policies (e.g., for students with disabilities or English learners), retest policies for absentees, the timing of testing over the course of the school year, and by performance incentives that influence test takers’ effort and motivation. Pupil grade retention policies may influence year-to-year changes in average scores for grade-level cohorts. [p3]

They even toss in a bit about the high-stakes testing regime and the narrowed curriculum:

Moreover, test results are affected by the degree to which curriculum and instruction are aligned with the knowledge and skills measured by the test. Educators understandably try to align curriculum and instruction with knowledge and skills measured on high-stakes tests, and they similarly focus curriculum and instruction on the kinds of tasks and formats used by the test itself. For these reasons, as research has conclusively demonstrated, gains on high-stakes tests are typically larger than corresponding gains on concurrently administered “audit” tests, and sometimes they are much larger. Improvements on the necessarily limited content of a high- states test may be offset by losses on other, equally valuable content that happens to be left untested.


We encourage the Department to pursue vigorously the use of multiple indicators of what students know and can do. A single test should not be relied on as the sole indicator of program effectiveness. This caveat applies as well to other targets of measurement, such as teacher quality and effectiveness and school progress in closing achievement gaps.

...regarding the use of NAEP as a measuring tool for RTTT:

Although BOTA is a strong advocate for the importance of NAEP in measuring U.S. educational progress, NAEP cannot play a primary role in evaluating RTT initiatives, a role that might be mistakenly inferred from the language in the Department’s proposal.

We propose using the NAEP to monitor overall increases in student achievement and decreases in the achievement gap over the course of this grant because the NAEP provides a way to report consistently across Race to the Top grantees as well as within a State over time. . . . (Section I, footnote 1, p. 37805)

It is necessary to be clear about the distinction between the requirements of an evaluation and the kind of “monitoring” that NAEP can provide. For the purposes of evaluating RTT initiatives, there are at least four critical limitations with regard to inferences that can be drawn from NAEP.

1. NAEP is intended to survey the knowledge of students across the nation with respect to a broad range of content and skills: it was not designed to be aligned to a specific curriculum. Because states differ in the extent to which their standards and curricula overlap with the standards assessed by NAEP, it is unlikely that NAEP will be able to fully reflect improvements taking place at the state level.

2. Although NAEP can provide reliable information for states and certain large school districts, it cannot, as presently designed, fully reflect the effects of any interventions targeted at the local level or on a small portion of a state’s students, such as are likely to be supported with RTT initiatives.

3. States are likely to undertake multiple initiatives under RTT, and NAEP results, even at the state level, cannot disaggregate the contributions of different RTT initiatives to state educational progress.

4. The specific grade levels included in NAEP (grades 4, 8, and 12) may not align with the targeted populations for some RTT interventions.

...and, hopefully, the authors of the recent piece of "crap" about value-added models that appeared in the LA Times are paying attention here:

The term “value-added model” (VAM) has been applied to a range of approaches, varying in their data requirements and statistical complexity. Although the idea has intuitive appeal, a great deal is unknown about the potential and the limitations of alternative statistical models for evaluating teachers’ value-added contributions to student learning. BOTA agrees with other experts who have urged the need for caution and for further research prior to any large-scale, high-stakes reliance on these approaches (e.g., Braun, 2005; McCaffrey and Lockwood, 2008; McCaffrey et al., 2003).

...and more concerns about value-added models:

Prominent testing expert Robert Linn concluded in his workshop paper: “As with any effort to isolate causal effects from observational data when random assignment is not feasible, there are reasons to question the ability of value-added methods to achieve the goal of determining the value added by a particular teacher, school, or educational program” (Linn, 2008, p. 3). Teachers are not assigned randomly to schools, and students are not assigned randomly to teachers. Without a way to account for important unobservable differences across students, VAM techniques fail to control fully for those differences and are therefore unable to provide objective comparisons between teachers who work with different populations. As a result, value-added scores that are attributed to a teacher or principal may be affected by other factors, such as student motivation and parental support.

...and more concerns:

In addition to these unresolved issues, there are a number of important practical difficulties in using value-added measures in an operational, high-stakes program to evaluate teachers and principals in a way that is fair, reliable, and valid. Those difficulties include the following:

1. Estimates of value added by a teacher can vary greatly from year to year, with many teachers moving between high and low performance categories in successive years (McCaffrey, Sass, and Lockwood, 2008).

2. Estimates of value added by a teacher may vary depending on the method used to calculate the value added, which may make it difficult to defend the choice of a particular method (e.g., Briggs, Weeks, and Wiley, 2008).

3. VAM cannot be used to evaluate educators for untested grades and subjects.

4. Most data bases used to support value-added analyses still face fundamental challenges related to their ability to correctly link students with teachers by subject.

5. Students often receive instruction from multiple teachers, making it difficult to attribute learning gains to a specific teacher, even if the data bases were to correctly record the contributions of all teachers.

6. There are considerable limitations to the transparency of VAM approaches for educators, parents and policy makers, among others, given the sophisticated statistical methods they employ.

There's also this little line, which teachers should take very seriously. We know high-stakes, standardized tests do not measure what our students know:

The use of test data for teacher and educator evaluation requires the same types of cautions that are stressed when test data are used to evaluate students

But we all know reformers, philanthropists, and bureaucrats toss caution out the window when they get their dirty little hands on data, and they're far too willing to accept and use test scores as indications of real learning and teaching.
And here's another piece of wisdom that all teachers know, but few politicians or edupreneurs can grasp:

The choice of appropriate assessments for use in instructional improvement systems is critical. Because of the extensive focus on large-scale, high-stakes, summative tests, policy makers and educators sometimes mistakenly believe that such tests are appropriate to use to provide rapid feedback to guide instruction. This is not the case.

...and a bit about "rapid-time" turnaround systems:

The choice of appropriate assessments for use in instructional improvement systems is critical. Because of the extensive focus on large-scale, high-stakes, summative tests, policy makers and educators sometimes mistakenly believe that such tests are appropriate to use to provide rapid feedback to guide instruction. This is not the case.

In addition, BOTA urges a great deal of caution about the nature of assessments that could meet the Department’s definition for inclusion in a “rapid-time” turnaround system.

Rapid-time, in reference to reporting and availability of school- and LEA-level data, means that data is available quickly enough to inform current lessons, instruction, and related supports; in most cases, this will be within 72 hours of an assessment or data gathering in classrooms, schools, and LEAs. (Section IV, p. 37811)

If the Department is referring to informal, classroom assessment methods that can be scored and interpreted by the teacher, a 72-hour turnaround is a reasonable goal. It is important to provide teachers with a better understanding about the types of assessment they can use informally in their classrooms.

If we want to provide better tests than fill-in-the-bubble type (or its click-the-bubble digital counterpart), teachers are the only ones that can perform this work.
The report also urges caution when making international comparisons (an ode to the late Gerald Bracey):
In addition to the aspiration of creating common assessments, the Department’s proposalalso notes the objective of creating assessments that are “internationally benchmarked.” There are several different ways this phrase might be interpreted. However, for assessment results that could be directly compared to their international counterparts, we note that the difficulties that arise in comparing test results from different states apply even more strongly for comparing test results from different countries. For making comparisons internationally, the problems with differing standards, assessments, instruction, curricula, and testing regimes are magnified. In addition, international test comparisons raise difficult problems related to language translation. Because of these challenges, the Department should think carefully about the kind of “international benchmarking” that it wants to encourage states to pursue.
And the conclusion:

In closing, we return to the beginning of this letter, with the importance of rigorously evaluating the innovations supported by RTT funds. Careful evaluation of this spending should not be seen as optional; it is likely to be the only way that this substantial investment in educational innovation can have a lasting impact on the U.S. education system. BOTA urges the Department to carefully craft a set of requirements for rigorous evaluation of all initiatives support by RTT funds.

There's a lot to chew on here.
Interestingly, the report neglects to mention charter schools, a major component of the RTTT guidelines. Strange.


  1. Hi my name is Zenneia McLendon and I am writing from the National Academies. Thank you so much for posting on our recent report, "Letter Report to the the U.S. Department of Education on the Race to the Top Funds". Just wanted to let your readers know that this report is available to download for free here:, we encourage you all to read it in hopes it will enhance the conversation.

  2. Anonymous9:44 PM

    Since BOTA is focused exclusively on testing and assessment issues, you should not expect it to weigh in on issues related to charter schools. I was surprised that BOTA decided to weigh in on RTT at all; this is pretty unusual in my experience. I interpreted it as an indication of how alarmed prominent and thoughtful experts in educational measurement are by Duncan and company's ignorance about standards and assessment and their generally amateurish approach to reform. I share this alarm and am angry at how poorly the Duncan crew are serving President Obama.

  3. Anonymous - thank you for the charter clarification and additional comments.

  4. @ Anonymous 9.44....

    Duncan and crew are not serving President Obama... they are serving big business interests represented by the Broad, Gates, Milken and Stuart Foundations et al... there's a ton of money to be made in privatizing education, charter schools, standardized testing (and the remedial work that 'needs' to be done to improve those all important test scores).... and then there's also the philosophy behind these 'reforms', best and most openly expressed by Mike Milken in that they are interested only in turning out a workforce that is appropriately skilled for business needs and an expanded consumer base...

    There is absolutely no concern here for the welfare of individuals, this country or society, or the world globally...

    for more information, see this blog and others around the country, such as