[From the Oct. 27, 2003 issue of TIME magazine]
Inside the New SAT Test
America's college gatekeeper is changing dramatically. Get ready for advanced algebra, an essay—and, yes, the return of grammar. An exclusive look at the new exam—and how it may hurt some students' scores
By JOHN CLOUD
Sunday, Oct. 19, 2003
Three hours of misery are apparently not enough. Now the makers of the SAT want to shape what kids learn throughout four years of high school. True, students have always had to brush up on vocabulary and take practice tests before the SAT, but now the College Entrance Examination Board, which owns the test, is developing the "New SAT," an exhaustive revision largely intended to mold the U.S. secondary-school system to its liking.
The College Board wants schools to produce better writers, so the New SAT will require an essay. The board thinks grammar is important, so the new test will ask students to fix poorly deployed gerunds and such. To encourage earlier advanced-math instruction, the New SAT will go beyond basic algebra and geometry for the first time to include Algebra II class material (remember negative exponents—q?3, for instance?). The board, a powerful group of 4,300 educational institutions—including most of America's leading universities—has undertaken an unprecedented effort to push local school districts to alter their curriculums accordingly.
In short, the dreaded SAT could actually help produce a national curriculum, a sweeping education reform enacted without the passage of a single law. In the process, the test itself will have to change to include questions more like classroom exercises and less like—well, less like SAT items. Two types of SAT questions are vanishing: those frustrating little analogies ("somnolent is to wakeful" as "graceful is to clumsy") and the quirky math items that ask you to compare two complex quantities. Instead of the venerable math and verbal sections, the test will have three segments that will be more familiar to Americans: the three Rs, reading, writing and arithmetic. (Hence a perfect score will go from 1600 to 2400.)
At first blush, the changes seem healthy enough. But inevitably, some students will do better, and some worse, on the new test. Girls tend to outperform boys on writing exams, so their overall scores could benefit from the addition of the new writing section. Boys usually score higher on the math section, but the new exam will contain fewer of the abstract-reasoning items at which they often excel. The elimination of analogies may exacerbate the black-white SAT score gap, since the gap is somewhat smaller on the analogy section than on the test as a whole, according to Jay Rosner, executive director of the Princeton Review Foundation.
More broadly, students who attend failing schools could suffer as the SAT morphs from a test of general-reasoning abilities into a test of what kids learn in school. "There's a danger that making it too curriculum-dependent will actually increase overall score gaps for some minority groups," says Rebecca Zwick, a former chair of the College Board's own SAT Committee. "Because we have such huge disparities in the quality of schooling in the country, kids who go to crummy schools may be disadvantaged."
The world of standardized testing has its own language and history; an entire branch of science, psychometrics, is devoted to test design and analysis. But the tiny discipline touches most Americans' lives at some point. Psychometricians help devise tests for fire fighters, lawyers, architects, teachers and, of course, kids. Virtually every American child takes a standardized test at some point, and yet there is widespread confusion over what tests do and do not measure. Insta-experts from the media and from antitesting groups often repeat fallacies: blacks do better in college than their SAT scores predict (actually, for reasons that aren't well understood, blacks tend to do worse in college than matched groups of whites with the same scores); how well you do on the SAT will determine how well you do in life (SAT scores have little power to predict earnings).
Six months ago, TIME asked the College Board if we could sort out some of these conundrums by following the development of the New SAT from inside. To our surprise, board president Gaston Caperton III agreed. Renouncing his predecessors' often combative P.R. approach, Caperton allowed me to attend a series of meetings at which New SAT items were previewed and debated. An experienced politician—he was elected Governor of West Virginia in 1988 and '92—Caperton knows the old adage about making laws and sausage. Designing tests is also a messy process, and he deserves credit for laying it bare. But while the production of New SAT questions has entailed some expected debates—Is this item too hard? Is that one biased against women?—I saw something quite unexpected as well: Caperton is changing the very nature and purpose of the SAT.
At his insistence, the goal of influencing school curriculums has become the overriding preoccupation of the new test's developers. Caperton speaks with less enthusiasm about the traditional mission of the SAT: to help colleges predict how well applicants will do if they are admitted. To be sure, Caperton believes the notion (actually, he's staking his career on it) that the SAT can both improve high schools and still remain useful to colleges as a predictor. But the first goal is a political aim; the second, a psychometric one. And Caperton has surrounded the New SAT with dozens of educators who aren't schooled in psychometrics.
Which raises the possibility that Caperton may, in his well-intentioned effort to ameliorate schools, ruin his main instrument for doing so. "I'm worried they may be asking one test to carry too many buckets of water," says Fred Hargadon, a former College Board vice president. Caperton believes the SAT should be a tool of social change as well as of social measurement—that it should serve communitarian ends even as it tries to give reliable, valid scores to individual kids and colleges. "This (new) test is really going to create a revolution in the schools," he says. But can the SAT be engineered to fulfill all his ambitions?
So far, most Americans know little of the machinations under way at the College Board. Even test coaches have heard only the barest outlines of what the New SAT will look like. But this is how powerful the test has become: many schools are already worrying about how to change their curriculums to fit the new exam—even though the College Board has yet to finish a first draft of the first test booklet. The maiden administration isn't until March 2005. (The name of the test will be, simply, SAT. The letters now stand for nothing; the "New" is a temporary marketing term.)
In Georgia, Clarke County schools' director of assessment Ginger Davis-Beck says that in anticipation of the revamped test, her district might split Grade 10 into a semester of geometry and a semester of Algebra II for students who didn't get to Algebra I until Grade 9; it would be an unorthodox move that could require hiring more teachers. In Ohio, curriculum specialist Jennifer Manoukian of the Sycamore school system, outside Cincinnati, feels uneasy about the prospect of grammar questions. "Research shows that direct instruction of grammar is not beneficial," she says. "The correlation between that kind of grammar instruction and student performance in real life is very low. Yet here we have a test seeming to say that you have to go back to some old methods."
Some educators also fear that SAT prep will move from helping students improve their vocabulary to teaching them how to scribble hasty compositions in 25 or 30 minutes (the College Board hasn't settled on an essay time limit). "I'd hate to see all our English teachers begin to teach a formulaic style of writing to prepare students for this particular test," says college adviser Alice Kleeman of Menlo-Atherton High School in Atherton, Calif. "Adjustments might have to be made so that students can practice the type of essays expected on the SAT while reinforcing the idea that this isn't the only type of writing there is."
College Board vice president Chiara Coletti says the board has received "far more positive than negative" responses to the new test. But she adds that most people are just beginning to understand what will appear on it. Once they do, a much richer, knottier conversation about the New SAT will probably begin. For decades, the purpose of the test has been to try to measure students' general-reasoning abilities, not their specific knowledge of algebra or the extent to which they have written practice essays. Caperton's feat is actually twofold: not only has he begun to shape a U.S. curriculum, but he has also granted victory in a long, contentious argument about whether admissions tests should assess aptitudes or achievements. For decades, the SAT was, at its heart, an aptitude test; now it's becoming more like its competitor, the act, the nation's biggest achievement test.
What's the difference? Achievement tests gauge mastery of subject matter; your U.S. history final was an achievement test. The SAT IIs are a battery of achievement tests the College Board offers in 18 subjects, including physics and Korean. Aptitude tests are harder to define. Many people seem to think of aptitude exams in general—and the old (or current) SAT in particular—as IQ tests, a notion subtly promulgated by Nicholas Lemann, the new dean of Columbia's Graduate School of Journalism, in his influential anti-SAT book, The Big Test (1999). Writing about early versions of the SAT, Lemann points out that "the bulk of the test was devoted to word familiarity, the eternal staple of intelligence testing." He correctly notes that the exam directly descended from IQ tests given by the U.S. Army in the early part of the past century.
But he also says that the "SAT has changed remarkably little over the years," which is true only in the most basic sense: it still examines verbal and mathematical skills. Even so, the question types have changed dramatically. The first Scholastic Aptitude Test, which was given on June 23, 1926, included "Artificial Language" and logic sections that would seem bizarre to today's SAT takers. (A practice question asked students to translate a gibberish sentence—"OK entcola kon"—based on a given lexicon.) Similarly, IQ tests look quite different from the SAT. The Wechsler Adult Intelligence Scale, the most widely used IQ test, asks funny little questions like "In what two ways is a lamp better than a candle?"
If IQ tests try to probe innate abilities, and if achievement tests rate classroom learning, aptitude tests assay something in between—developed abilities. Developed abilities are those nurtured through schoolwork, reading, doing crosswords, soaking up the arts, debating politics, whatever. These aren't inborn traits but honed competencies. Whereas early psychometricians, many of them racist, propagated what Lemann calls the dipstick theory—the idea that a test score is like a mark on a dipstick showing the raw amount of intelligence in your mental oil tank—the field outgrew that simplistic notion at least a generation ago. "I don't think anyone believes the SAT or even pure (IQ) tests are—or have ever been—a pure measure of intelligence," says Zwick, the former SAT Committee chair and author of Fair Game? The Use of Standardized Admissions Tests in Higher Education (2002). "There is not a test that is completely independent of environmental experience or experience in the schools."
But in the psychometric world, developed abilities are distinct from school achievements. To understand the difference, consider a sports metaphor: if you learn to hit free throws 90% of the time, that's a stellar basketball achievement. But what if you're so out of shape that you can't run up and down the court? Then your achievement won't matter much in an actual game. Your level of physical fitness is a developed ability: it's both an innate skill that helps determine how well you play basketball and an outcome of playing lots of basketball. Similarly, the more you challenge yourself intellectually, the more you condition your brain; your academic achievements are less impressive if you don't have the conditioning to build upon them. As the SAT becomes more an assessment of one's achievements, it will less sensitively gauge these underlying skills.
To be sure, it was never a perfect measure of developed abilities, which by their nature are more difficult to appraise than, say, how many plant names you memorized in botany. But what happens when you move away from trying to assess aptitude? Consider the reading section of the New SAT. In May, the College Board's Reading Development Committee decided that SAT item writers should feel free to use literary terminology in their questions for the reading section. Words that one would typically use only in a literature class—simile, personification—had always been avoided on the SAT, on the theory that a student should get credit for being able to comprehend the phrase "Youth is wasted on the young" even if he doesn't know to call it a paradox. No more.
Although the committee decided that the most arcane lit terms (metonymy, for instance) won't appear on the SAT, terms like simile are now fair game. The use of technical language will also increase in math. For instance, in the past, an SAT item might have stipulated that group A has 10 members and group B has 10 + 5x members, where x = 3. What's the total number in both groups? Add the 10 from group A and the 10 + (5 ? 3) from group B. You get 35. But on the New SAT, the question might read, "What is the union of sets A and B?" Union and set are terms of art for mathematicians; a "union of two sets" is everything in both sets. The answer is still 35, but you must know the jargon to get it.
In May, I saw a statistical analysis of one math question that had been rewritten to include a specialized math term. We can't print the question because it may appear on a future SAT, but I can report this: when the specialized term was added, the percentage of students who got the right answer in a field trial fell, from 68% to 21%—a staggering decline of 47 percentage points. Who are all those students who could do the math but didn't know the specialized language?
I put that question to David Lohman, a University of Iowa psychology professor who has studied the differences between achievement and aptitude tests. In a paper that will be published in the forthcoming book Rethinking the SAT , Lohman analyzed test scores for 6,300 11th-graders who in 2000 took two very different tests, the Iowa Tests of Educational Development (ited) and the Cognitive Abilities Test (CogAT), a standardized exam first published in 1971 that Lohman helped revise two years ago. The ited is your basic achievement test: it assesses how well kids have learned such class exercises as setting up science experiments, reading social studies passages, and spelling. The CogAT, by contrast, is a test that measures verbal, quantitative and figural reasoning abilities, irrespective of any one curriculum. (In the quantitative section, for instance, a question asked students to figure out the next number in the following series: 2, 7, 11, 14, 16. You can get the answer without knowing much math. Notice that the numbers ascend by 5 (7 ? 2), 4 (11 ? 7), 3 (14 ? 11) and so on. The answer is 17.)
When he compared ited and CogAT scores by race, Lohman found something surprising to those outside his field: the gap between white and minority students was smaller on the reasoning test than on the achievement test. Whites did about the same on both exams, but the percentage of blacks who scored reasonably well (above the 70th percentile) was higher on the CogAT tests. Others have replicated such findings by comparing achievement and reasoning tests in earlier grades; one theory as to why minorities often score higher on the latter is that they attend poor schools that leave their potential untapped. "Indeed," writes Lohman in Rethinking the SAT, "the problem with the current version of the SAT"—which continues to show a racial score gap—"may not be that it is an aptitude test, but that it is not enough of an aptitude test."
So why is it becoming even less of one? Largely because Richard Atkinson, president of the University of California—the College Board's biggest client—wanted it to. Board president Caperton surely has his own ambitions, but it's unlikely he would have sought such radical changes if Atkinson hadn't spoken out against the SAT. In a February 2001 speech in Washington, Atkinson recommended that his university stop asking its 76,000 yearly applicants for SAT scores. It's hard to overstate the gravity of this moment for the College Board. If U.C. had followed through on the recommendation, the board could have lost a huge pool of students, who pay $28.50 each to take the SAT.
In his 2001 speech, Atkinson called for U.C. to "require only standardized tests that assess mastery of specific subject areas rather than undefined notions of ?aptitude.'" Why the switch? "Last year," he said, "I visited an upscale private school and observed a class of 12-year-old students studying verbal analogies in anticipation of the SAT. I learned that they spend hours each month—directly and indirectly—preparing for the SAT, studying long lists of analogies such as ?untruthful is to mendaciousness' as ?circumspect is to caution.' The time involved was not aimed at developing the students' reading and writing abilities but rather their test-taking skills."
Like so many disconcerted 11th-graders, Atkinson had been driven round the bend by analogies—which are, not coincidentally, banished from the New SAT. But some academics are now offering an elegy for the analogy: "Analogical thinking is at the very foundation of how we make use of old knowledge to understand new things," says Lohman. "It may take a long time to understand how our solar system is set up, but if someone could use that information to help you understand the structure of an atom, it speeds the process up ... When we learn, we have to do this again and again. Students listen to a lecture and say, ?How did that relate to what I know?'" O.K., but isn't Atkinson right that students should spend more time reading than studying word lists? Probably, though it depends on what they read and how well they study the word lists. To this day, I remember learning the word apathy on a list I studied while preparing for the SAT. Because I had gone to a rural Arkansas junior high that assigned a romance novel in eighth-grade lit, poring over the word list three years later was actually helpful.
A cognitive psychologist, Atkinson has no specialized training in psychometrics, though he has researched mathematical models of memory. He says he favors achievement over aptitude tests partly because his university's research shows that the SAT II subject-based tests are just as good at predicting success at U.C. as the regular SAT. "When I saw that data," he says, "that was the nail in the coffin." But according to an exhaustive 2002 College Board study, the most accurate predictor of success in college—at U.C. and everywhere else—is a combination of high school grades, SAT scores and SAT II scores. The changes Atkinson has wrought may alter instruction at the "upscale private school" he talked about in his speech, but they may be corrosive, psychometrically speaking, for the rest of the nation.
Consider the new reading section of the SAT, which will feature, for the first time, at least one fiction passage on every test. In January, after they began perusing novels to find excerpts, text hunters at Educational Testing Service (ETS), the Princeton, N.J., firm that the College Board pays to write SAT questions, put together a list of books to be avoided when picking passages. On the list were 40 or so titles often assigned in good English classes—novels such as Animal Farm, Catch-22 and Native Son. ETS had a solid psychometric rationale for shunning the books: reading-comprehension questions should measure a student's ability to analyze something new, not something already assigned in English class.
However, when the board's Reading Development Committee met in May, its chair, retired English teacher Joan Vinson of Dallas, argued against excluding those books. "These books are included in some of the best literature out there," Vinson said. "Also, if we're trying to align the SAT more with school curricula, that's something we can't do if we exclude these books." Other committee members heartily agreed that passages from William Faulkner and James Joyce—authors typically assigned only in the best schools—should also be considered for inclusion. Students who have already read these authors will have a clear advantage over those who haven't.
Which isn't a bad thing if you want to encourage students to read great books (though some may object to a private group like the College Board deciding which books). But now you're measuring not just reading ability but also the achievement of having plowed through As I Lay Dying. At ETS, measuring anything beyond developed ability used to be considered noise that disrupted the clear sound of a score. Psychometricians try to screen out all kinds of noise—questions that ask about subways, for instance, could be excluded because rural kids may not be familiar with them. Questions showing even the vaguest bias are excised; you will never find a woman measuring cups of flour in an SAT question. The concern is that girls who read such a question will be distracted by the implicit sexism, and so their answer will reflect not their ability but their distraction—that's noise.
But other kinds of noise are now to be allowed. Take the writing section, which will be divided between multiple-choice questions on grammar and style, and an essay students must write on an assigned topic. Historically, the SAT has had only multiple-choice items. As Lemann writes of the early rationale for the SAT, "Tests that require a student to write essays ... are highly susceptible to the subjective judgment of the grader and to the mood of the taker on the day of the test, so they have low reliability."
Reliability is a measure of a test's precision from one administration to the next—a gauge of how much noise, or measurement error, it has eliminated. The standard error of measurement for a typical SAT is about 30 points for the math section and 30 for the verbal. That's why the College Board tries to get students and admissions officers to think of scores not as pinpoints but as ranges: if you get 510 on the SAT's math section, your "true" score is anywhere between 480 and 540.
Thirty points in either direction is a pretty big swing, but scores on the writing section will be even less reliable: field trials of the New SAT estimate a standard error of measurement of 41 points. That means a kid who gets a 670 may "really" be in the élite reaches of the 700s—or in the more average environs of the low 600s. There are two reasons for the writing test's imprecision: first, the multiple-choice component of the test will be just 20 to 30 minutes, compared with 70 minutes each for math and reading. Less time means fewer questions, and it's harder to wring out measurement error with a small number of items. (Think about it this way: if you taste only one dish served by a chef, you can't judge him with as much precision as if you eat everything on the menu.) Even worse, each test will feature just one essay topic; if you retake the test and get a topic you really love, your score could shoot up—a clear example of low reliability.
The other reason the writing test will be less reliable is that human beings, not machines, will grade the essay. In June, I participated in a mock grading session with members of the College Board's writing- development committee. We read 15 essays by kids who had taken a pretest; they had been given 25 minutes to write on a topic I can't reveal, since it may appear on a future SAT. We scored the essays on a scale of 1 to 6, 1 meaning "very poor" organization and development and 6 meaning the student organized her thoughts, displayed "facility" with language and "insightfully addressed the writing task." Such standards are quite rubbery, as we discovered: of the essays we read, we 15 readers uniformly agreed on a grade for none. On most of the essays, the lowest score was 3 full points away from the highest.
Our grading also rewarded the blandest essays. I gave a 5 to a kid who had written a funny, subtle first-person account of a friend who had slacked off his studies and begun dressing "like a pimp" in order to impress the cool kids. The other graders gave the essay 2s and 3s; there was one 4. (Our scores didn't count for anything. On a real test, the raw 1 to 6 score will be combined with the raw score from the multiple-choice grammar segment and translated into an overall writing score on the traditional 200-to-800 scale.)
College Board officials are acutely worried about the subjectivity. Parents have already called to say they don't understand how a 3 differs from a 4. In response, the board is conducting further studies of the test and developing more consistent, less pliable 1-to-6 scoring points. Graders hired by an Iowa City, Iowa, company, Pearson Educational Measurement, will actually score the essays. Pearson trains its scorers to follow the 1-to-6 guide closely. Two of them will read each essay, and on the basis of the firm's experience with exams in Texas and elsewhere, they will disagree only 30% of the time. When they do, a third—and, if necessary, fourth—scorer will resolve disputes. It's a thorough system, but it will be expensive. The pressure to read fast—and to reward competent but formulaic essays—will be massive.
In some ways, Gaston Caperton has an excruciating job these days: he must sell the New SAT even as he defends the current test against its critics. He cannot say the College Board was wrong about the SAT all these years; nor can he say the board was wholly right about it. That's why he argues, on the blade of a knife, that the SAT is not becoming a typical achievement test but that it is coming "into a great balance" between a test of "critical reading, comprehensive writing and higher mathematics" and a test of "learned skills that you use to reason."
If that balancing act is awkward, Caperton relishes another part of his job—his nascent role as the nation's curricular impresario. "I didn't want to run a testing company," he says. "But when I saw what the College Board was and, more important, what it could be, I saw the power to do much more than they were doing in the past to improve education." Under his watch, the board is issuing quite specific recommendations about what schools should teach—for instance, math lessons should include radical equations (such as 5 ?x + 14 = 20; x = 1.44). Also, the board says most students should double the amount of time they spend writing (a laudable but pricey goal: some schools will have to initiate a round of hiring, since their teachers barely have time to grade kids' work as it is). A recent College Board brochure expansively declares, "The skills evaluated by the new SAT are precisely those needed by all students today." All?
But in a historical sense, Caperton's ambitious agenda for the big test is appropriate: 77 years ago, the exam began life as a tool of social change. The most significant early champion of the SAT was Harvard president James Conant, who, Lemann writes, disliked achievement tests because "they favored rich boys whose parents could buy them top-flight high-school instruction." Conant helped the SAT grow into the behemoth it is today precisely because it was different.
Today Atkinson and Caperton have launched another great social experiment with the SAT. This time, the idea is that the test's rigorous new curricular demands will lift all boats—that all schools will improve because they want their students to do well on the test. Schools have long tried to prepare kids for the SAT, but education experts scorned the practice of openly teaching to the test. Now it's the mission of the College Board that every school should teach to the SAT. "I would say that the most important aspect of this test is sending a real message back to kids on how to prepare for college," says Atkinson. It's not clear what happens to students in schools that won't hear, or can't afford to heed, his message.
—With reporting by Anne Berryman/Athens, Laura Randall/Cincinnati and Jeffrey Ressner/Los Angeles
From the Oct. 27, 2003 issue of TIME magazine