Theory on testing fails the grade
October 28, 2005
Back in 2002, President Bush predicted "great progress" once schools began administering No Child Left Behind's mandated annual testing. Education Secretary Rod Paige equated opposition to NCLB testing with "dismissing certain children" as "unteachable."
That same week The New York Times documented "recent" scoring errors affecting "millions of students" in "at least twenty states," which amounted to several million pretty good alternate reasons for opposing NCLB testing.
There's nothing wrong with assessing what students have learned. Parents, colleges, and employers can track how kids are doing, and teachers can identify which areas need more teaching. That's why I give quizzes and tests and one reason students write essays.
Of course, everybody knows that some teachers are tougher graders than others. Standardized testing is supposed to help gauge one teacher's "A" compared to another's so we can compare students from different schools.
This works fine as long as we recognize that all tests have limitations. For example, for years my students took a nationwide social studies test that required them to identify the New Deal. The problem was the seventh graders who took the test hadn't studied U.S. history since the fifth grade, and FDR usually isn't the focus of American history for ten-year-olds. He also doesn't appear in eighth grade history class until May, about a month after eighth graders took the test.
Multiply our FDR glitch by the thousands of curricula assessed by nationwide testing. Then try pinpointing which schools are succeeding and failing based on the scores those tests produce. That's what No Child Left Behind pretends to do.
Test designers will tell you they've eliminated inconsistencies by "aligning" their tests with newly drafted grade level expectations. The trouble is these new objectives are often hopelessly vague, arbitrarily narrow, or so unrealistic that they're pretty meaningless. That's when they're not obvious and the same as they always were.
Even if we could perfectly match curricula and test questions, modern assessments would still have problems. That's because most are scored according to guidelines called rubrics. Rubric scoring requires hastily trained scorers, who commonly aren't teachers or even college graduates, to determine whether a student's essay "rambles" or "meanders." Believe it or not, that choice represents a 25 percent variation in the score. Or how about distinguishing between "appropriate sentence patterns" and "effective sentence structure," or language that's "precise and engaging" versus "fluent and original."
These are the flip-a-coin judgments at the heart of most modern assessments. Remember that the next time you read about which schools passed and failed.
Unreliable scoring is one reason the General Accountability Office condemned "comparisons between states" as "meaningless." It's why CTB/McGraw-Hill recalled and rescored 120,000 Connecticut tests after the scores were released. It's why ten percent of the candidates taking the 2003 Educational Testing Service Praxis teacher licensing exam incorrectly received failing scores. A Brookings Institution study found that "50 to 80 percent of the improvement in a school's average test scores from one year to the next was temporary" and "had nothing to do with long-term changes in learning or productivity." A senior RAND analyst warned that today's tests aren't identifying "good schools" and "bad schools." Instead, "we're picking out lucky and unlucky schools."
The New England Common Assessment Program, administered to all students in Vermont, Rhode Island, and New Hampshire, offers a representative glimpse of the cutting edge. NECAP is heir to all the standard problems with test design, rubrics, and dubiously qualified scorers.
NECAP security is tight. Tests are locked up, all scrap paper is returned to headquarters for shredding, and testing scripts and procedures are painstakingly uniform. Except on the mathematics exam, each school gets to choose if its students can use calculators.
Whether or not you approve of calculators, how can you talk with a straight face about a "standardized" math assessment if some students get to use them and others don't? Even more ridiculous, there's no box to check to show whether you used one, so the scoring results don't even differentiate between students and schools that did and didn't.
Finally, guess how NECAP officials are figuring out students' scores. They're asking classroom teachers. Five weeks into the year, before we've even handed out a report card to kids we've just met, we're supposed to rate each student's "level of proficiency." Our ratings, which rest on distinguishing with allegedly statistical accuracy between "extensive gaps," "gaps," and "minor gaps," are a "critical piece" and "key part of the NECAP standard setting process."
Let's review. Because classroom teachers' grading standards aren't consistent enough from one school to the next, we need a standardized testing program. To score the standardized testing program, every teacher has to estimate within eight percentage points what their students know so test officials can figure out what their scores are worth and who passed and who failed.
If that makes sense to you, you've got a promising future in education assessment. Sadly, our schools and students don't.
Peter Berger teaches English at Weathersfield Middle School. Poor Elijah would be pleased to answer letters addressed to him in care of the editor
"A child's learning is the function more of the characteristics of his classmates than those of the teacher." James Coleman, 1972
. . .a pupil attitude factor, which appears to have a stronger relationship to achievement than do all the “school” factors together, is the extent to which an individual feels that he has some control over his own destiny. James Coleman, 1966
Friday, October 28, 2005
A Classroom Teacher Tests Testing
at 9:14 AM