Epi.org

Bringing it back home: Why state comparisons are more useful than international comparisons for improving U.S. education...

2015-10-30

Contents

Introduction and key findings

Since its inception in 2000, the Program for International Student Assessment (PISA)1—an international test of reading, math, and science—has shown that American 15-year-olds perform more poorly, on average, than 15-year-olds in many other developed countries. This finding is generally consistent with results from another international assessment of 8th graders, the Trends in International Mathematics and Science Study (TIMSS).2

International test rankings have come to dominate how politicians and pundits judge the quality of countries’ education systems, including highly heterogeneous systems such as that of the United States. While international tests and international comparisons are not without merit, international test data are notoriously limited in their ability to shed light on why students in any country have higher or lower test scores than in another.3 Policy prescriptions based on these test results therefore risk being largely descriptive, based on correlational evidence that offers limited and less-than-convincing proof of the factors that actually drive student performance.

Indeed, from such tests, many policymakers and pundits have wrongly concluded that student achievement in the United States lags woefully behind that in many comparable industrialized nations, that this shortcoming threatens the nation’s economic future, and that these test results therefore demand radical school reform that includes importing features of schooling in higher-scoring countries.

This report challenges these conclusions. It focuses on the relevance of comparing U.S. national student performance with average scores in other countries when U.S. students attend schools in 51 separate education systems run not by the federal government, but by states (plus the District of Columbia). To compare achievement in states with each other and with other countries, we use newly available data for student mathematics and reading performance in U.S. states from the 2011 TIMSS and 2012 PISA, as well as several years of data from the National Assessment of Educational Progress (NAEP).4 In particular, we use information on mathematics and reading performance of 15-year-olds from the PISA data, information on mathematics performance in 8th grade from the TIMSS data, and information on mathematics and reading performance of students in 4th and 8th grade from the NAEP data.

We conclude that the most important lessons U.S. policymakers can learn about improving education emerge from examining why some U.S. states have made large gains in math and reading and achieve high average test scores. The lessons embedded in how these states increased student achievement in the past two decades are much more relevant to improving student outcomes in other U.S. states than looking to high-scoring countries with social, political, and educational histories that differ markedly from the U.S. experience. No matter how great the differences among U.S. states’ social and educational conditions, they are far smaller than the differences between the United States as a whole and, say, Finland, Poland, Korea, or Singapore. As such, this report starts the process of delving into the rich data available on student academic performance in U.S. states over the past 20 years—and shows that the many major state successes should be our main guide for improving U.S. education.

The report is organized around three main arguments:

1. Policymakers are not correct in concluding—based on international tests—that U.S. students are failing to make progress in mathematics and reading.

Country averages not adjusted for major national differences in students’ family academic resources (such as the number of books in the home or mother’s education level) mistakenly attribute U.S. students’ performance entirely to the quality of U.S. education. After adjusting for these factors, U.S. students perform considerably better than the raw scores indicate.5

Focusing on national progress in average test scores obscures the fact that socioeconomically disadvantaged U.S. students in some states have made very large gains in mathematics on both the PISA and TIMSS—gains larger than those made by similarly disadvantaged students in other countries.

Student performance in some U.S. states participating in international tests is at least as high as in the high-scoring countries. Additionally, TIMSS gains made by students in several states over the past 12 years are much larger than gains in other countries. Specifically:

Students in Massachusetts and Connecticut perform roughly the same on the PISA reading test as students in the top-scoring countries (i.e., Canada, Finland, and Korea)6 and high-scoring newcomer countries (i.e., Poland and Ireland), and higher than students in the post-industrial countries (i.e., France, Germany, and the United Kingdom). Socioeconomically advantaged students in Massachusetts score at least as well in mathematics as advantaged students in high-scoring European countries.

On the 2011 TIMSS, advantaged students in Connecticut, Massachusetts, Minnesota, North Carolina, Indiana, and Colorado performed at least as well in mathematics as their counterparts in high-scoring countries/provinces such as Quebec, England, and Finland.

Over 1999–2011, students in Massachusetts and North Carolina made average TIMSS mathematics gains at least as large as students’ average gains in Finland, Korea, and England. Over 1995–2011, students in Minnesota made TIMSS mathematics gains similar to those in Korea and larger than those in England.

2. It is extremely difficult to learn how to improve U.S. education from international test comparisons.

Policy recommendations based on the features of high-scoring countries’ education systems, or on the education reforms in countries making large gains on international tests, are overwhelmingly correlational and anecdotal. There is no causal evidence that students in some Asian countries, for example, score higher on international tests mainly because of better schooling rather than large investments made by families on academic activities outside of school.

Reforms in countries such as Germany and Poland, with big gains in PISA scores but with early vocational tracking, seem to have little applicability to the U.S. system. Such differences between the educational cultures of other countries and of the United States make it difficult to draw education policy lessons from student test scores.

3. Focusing on U.S. states’ experiences is more likely to provide usable education policy lessons for American schools than are comparisons with higher-scoring countries (such as Korea and Finland).

There are vast differences between the conditions and contexts of education in various countries. In contrast, among U.S. states, school systems are very similar to one another, teacher labor markets are not drastically different, and education systems are regulated under the same federal rules. If students with similar family academic resources in some states make much larger gains than in other states, those larger gains are more likely to be related to specific state policies that could then be applied elsewhere in the United States.

We analyze relative state performance using data from the NAEP mathematics and reading achievement tests over 2003–2013 in all states and over 1992–2003 in most states. We adjust scores to control for differences in the composition of students in each state (e.g., family characteristics, school poverty, and other factors); while this reduces the variation in scores among states, significant differences remain.

In general, state gains in mathematics were larger than in reading. However, there were large variations in gains across states.

Over 1992–2013, the average annual increase in NAEP 8th grade adjusted mathematics scores in the top-gaining 10 states was 1.6 points per year—double the 0.8 point annual adjusted gain in the bottom-gaining 10 states.

Over 2003–2013, a number of states made significantly large gains in 8th grade mathematics and reading. These include Hawaii, Louisiana, North Carolina, Massachusetts, and New Jersey. States that made large reading gains were not necessarily the same states with large mathematics gains. For example, Texas made large gains in mathematics but not reading.

Gains for states with improvements were not necessarily a function of the initial level of student performance. Some of these states started from a low level of student performance, others from a middle level, and others from high levels. For example, students in Hawaii, the District of Columbia, Louisiana, and North Carolina started with low adjusted test scores in 8th grade mathematics and made large gains since the 1990s; students in Massachusetts, Vermont, and Texas started with relatively higher test scores and also made large gains.

There are undoubtedly different lessons to be learned in states where students made gains from initially low beginnings and from initially high beginnings, or where students made gains on one subject matter but not the other. The next stage for researchers should be to determine what those lessons are. This paper begins this process.

When we tested for possible explanations for these variations, we found that students in states with higher child poverty levels perform worse in 8th grade mathematics even when controlling for individual student poverty and the average poverty of students attending a particular school. Thus, students with similar family resources attending school in high-poverty states tend to have lower achievement than students in low-poverty states. We also found that students in states with stronger accountability systems do better on the NAEP math test, even though that test is not directly linked to the tests used to evaluate students within states.

As a suggestive strategy for further (qualitative) policy research, we paired off states with different patterns of gains in 8th grade math. This reveals, for example, that 8th grade students in Massachusetts made much larger gains after 2003 than students in Connecticut, that students in New Jersey made larger gains than students in New York after 2003, and that students in Texas already started out scoring higher in 8th grade math in 1992, but still made larger gains over 1992–2013 than students in California, especially after 2003.

These and other comparison groups could provide important insights into the kinds of policies that enabled students in some states to make much larger adjusted gains in math scores than students in states that are geographically proximate and/or that have similar population sizes.

The tables and figures referenced in the text can be found at the end of the report.

Do U.S. students perform poorly on international tests?

U.S. students score lower, on average, on international tests than students in other developed countries. This has led many policymakers and journalists to conclude that American student achievement lags woefully behind that in many comparable industrialized nations, that this shortcoming threatens the nation’s economic future, and that the United States therefore needs radical school reform.

Upon release of the 2011 TIMSS results, for example, U.S. Secretary of Education Arne Duncan called them “unacceptable,” saying that they “underscore the urgency of accelerating achievement in secondary school and the need to close large and persistent achievement gaps” (Duncan 2012). A year later, when the 2012 PISA results were released, he stated, “The big picture … is straightforward and stark: It is a picture of educational stagnation.” He continued, “The problem is not that our 15-year-olds are performing worse today than before. The problem instead is that they are not making progress. Yet students in many nations … are advancing, instead of standing still” (Duncan 2013). He highlighted the urgent nature of addressing this lack of progress by going on to argue, “In a knowledge-based, global economy, where education is more important than ever before, both to individual success and collective prosperity, our students are basically losing ground. We’re running in place, as other high-performing countries start to lap us” (Duncan 2013).

Is Secretary Duncan correct in his conclusions? In 2013, the Economic Policy Institute published a comprehensive report, What Do International Tests Really Show About American Students’ Performance? (Carnoy and Rothstein 2013), that criticized the common interpretation of U.S. students’ performance on international tests. Carnoy and Rothstein (2013, 7) wrote that the interpretation was “…oversimplified, exaggerated, and misleading. It ignores the complexity of the content of test results and may well be leading policymakers to pursue inappropriate and even harmful reforms that change aspects of the U.S. education system that may be working well and neglect aspects that may be working poorly.”

The report argued that by focusing on country averages not adjusted for major national differences in students’ family academic resources—such as the number of books in the home or mother’s education level—policymakers and journalists mistakenly attributed the poor performance of U.S. students entirely to the quality of U.S. education. The results from PISA and TIMSS tests suggest that when adjusted for differences in the family academic resources of samples in various countries, middle school and high school students in the United States as a whole do reasonably well in reading compared with students in large European countries, but do not score as highly in mathematics either on the TIMSS or the PISA as students in other developed countries, including larger European countries such as France and Germany (Carnoy and Rothstein 2013). This is particularly true for U.S. middle and highly advantaged students.

The report also argued that focusing on “national progress in average test scores” in one year failed to differentiate gains over time for disadvantaged and advantaged students and how these compared with gains over time for similarly advantaged or disadvantaged students in other countries. The report’s final argument was that different international and national tests produced different pictures of mathematics achievement gains by U.S. students in the same 1999–2011 period, and that these differences suggested using caution when basing calls for education reform upon the results of any single test. Specifically, when adjusted for the varying family academic resources of the students sampled, U.S. students’ average TIMSS mathematics scores increased substantially in the 12 years between 1999 and 2011; the same was not true for adjusted U.S. scores on the PISA mathematics test over 2000–2009. Because the full range of knowledge and skills that we describe as “mathematics” cannot possibly be covered in a single brief test, the report urged policymakers to examine carefully whether an assessment called a “mathematics” test necessarily covers knowledge and skills similar to those covered by other assessments also called “mathematics” tests, and whether performance on these different assessments can reasonably be compared.7

The most recent PISA (2012) and TIMSS (2011) results discussed below confirm Carnoy and Rothstein’s earlier findings. Compared with students in other countries, U.S. students as a whole continue to perform better, on average, on the PISA reading test than in mathematics, and the results in mathematics suggest that U.S. students, on average, are performing well below students in many other countries. The results also continue to show that when scores on the PISA and TIMSS are adjusted for differences in family academic resources of students in the samples, U.S. students do considerably better than the raw scores indicate, and the results show that disadvantaged U.S. students have made large gains in mathematics on both tests since 1999–2000.

Further, the latest PISA and TIMSS tests included results for a number of U.S. states. Three U.S. states (Massachusetts, Connecticut, and Florida) participated in the PISA in 2012, and nine U.S. states (Alabama, California, Connecticut, Colorado, Florida, Indiana, Massachusetts, Minnesota, and North Carolina) participated in the 2011 TIMSS. The state data give a much more varied picture of U.S. student performance. Students in some of these states performed at very high levels in all subjects, as high as students in the highest-scoring countries. Conversely, students in others performed considerably below the OECD average.

Thus, the PISA and TIMSS results provide a more mixed picture of U.S. student performance than Secretary Duncan claims, with low average scores in mathematics and middle-of-the-road scores in reading for the United States as a whole, but large mathematics gains for disadvantaged students in the past 12–14 years, and a number of U.S. states where students perform at least as well as students in higher-scoring countries, with much larger gains in mathematics in the past decade for students in these states than for students in other countries. Secretary Duncan can claim “stagnation” and “lack of progress,” but the data we present below show that the truth is more nuanced.

Comparing U.S. student average performance on the 2012 PISA

To simplify our comparisons of national average PISA scores and of these scores disaggregated by family academic resources (FAR),8 we focus on comparing students in the United States with students in eight other countries—Canada, Finland, South Korea, France, Germany, the United Kingdom, Poland, and Ireland.

We refer to three of these countries (Canada, Finland, and South Korea) as “top-scoring countries” because they score much better overall than the United States in reading and math—about one-third of a standard deviation better. Canada, Finland, and South Korea are also the three “consistent high-performers” that Secretary Duncan highlighted when he released the U.S. PISA 2009 results (Duncan 2010).

We call three others (France, Germany, and the United Kingdom) “similar post-industrial countries” because they score similarly overall to the United States. They also are countries whose firms are major competitors of U.S. firms in the production of higher-end manufactured goods and services for world markets. Their firms are not the only competitors of U.S. firms, but if the educational preparation of young workers is a factor in national firms’ competitiveness, it is worth comparing student performance in these countries with student performance in the United States to see if these countries’ education systems—so different from that in the United States—play a role in their firms’ success.

We call the remaining two comparison countries (Poland and Ireland) “high-scoring newcomer countries” because they have been cited by the Organization for Economic Cooperation and Development (OECD), the sponsor of PISA, for their high performance in PISA scores, and were middle-income European countries in the 1990s that have had relative economic success since (although Ireland’s economy went into a sharp downturn in the 2008 recession).

Table 1 shows that, on average, U.S. performance on the 2012 PISA (498 points in reading and 481 points in math) was substantially worse than performance in the top-scoring countries and high-scoring newcomer countries in both math (49 and 28 points less than in the two groups, respectively) and reading (30 and 23 points less than those groups, respectively), was about the same as performance in the similar post-industrial countries in reading, and was substantially worse than performance in the similar post-industrial countries in math (19 points lower).

How we describe PISA scores in this report

PISA is scored on a scale that covers very wide ranges of ability in math and reading. When scales were created for reading in 2000 and for math in 2003, the mean for all test takers from countries in the OECD was set at 500, with a standard deviation of 100. When statisticians describe score comparisons, they generally talk about differences that are “significant.” Yet while “significance” is a useful term for technical discussion, it can be misleading for policy purposes, because a difference can be statistically significant but too small to influence policy. Therefore, in this report, we avoid describing differences in terms of statistical significance. Instead, we use terms like “better (or worse)” and “substantially better (or worse)” (both of which are significantly better for statistical purposes), and “about the same.” We also sometimes use “substantially higher (or lower)” interchangeably with “substantially better (or worse),” etc. In general, in this report we use the term “about the same” to describe average PISA score differences that are less than 8 scale points, we use the term “better (or worse)” to describe differences that are at least 8 scale points but less than 18 scale points, and we use the term “substantially (or much) better (or worse)” to describe differences that are 18 scale points or more.

Eighteen scale points in most cases is equivalent to about 0.2 standard deviations. Policy experts generally consider an intervention that is 0.2 standard deviations or more to be an effective intervention; such an intervention, for example, would improve performance such that the typical participant would now perform better than about 57 percent of all participants performed prior to the intervention.

And in the case of trends, we sometimes speak of scores that were “mostly unchanged,” a phrase with identical meaning as “about the same.” Further, it is difficult to interpret and compare the results from various assessments because each test has its own unique (and arbitrary) scale.

However, the variation among U.S. states is large. In reading, students in Massachusetts (527 points) and Connecticut (521 points) perform about the same as students in the top-scoring and high-scoring newcomer countries and higher than students in the post-industrial countries. (Indeed, in reading, Massachusetts students only perform worse than students in Korea.) However, students in Florida (492 points) perform worse or substantially worse than all the comparison groups of countries. In mathematics, where the U.S. average is substantially lower than in all three comparison groups of countries, students in Massachusetts and Connecticut (514 and 506 points, respectively) perform as well as or better than students in all three post-industrial countries and in one of the high-scoring newcomers, Ireland. They only perform worse than the average of the group of high-scoring countries and Poland. On the other hand, students in Florida (467 points) perform substantially worse in math (as was the case in reading) than students in all three comparison groups of countries.

Comparing the 2012 PISA performance of U.S. students with different levels of family academic resources

We next disaggregate scores in the U.S. states, in the United States as a whole, and in the eight comparison countries by an estimate of the family academic resource status of test takers, dividing them into six groups, from the least to the most advantaged.9 We refer to these as Group 1 (lowest FAR), 2 (lower FAR), 3 (lower-middle FAR), 4 (upper-middle FAR), 5 (higher FAR), and 6 (highest FAR). We also refer to Groups 1 and 2 together as disadvantaged students, to Groups 3 and 4 together as middle-class students, and to Groups 5 and 6 together as advantaged students.

From Tables 2A and 2B we can observe that more U.S. 15-year-olds (40.2 percent) and Florida 15-year-olds (48.3 percent) are in the disadvantaged FAR groups (Groups 1 and 2) than in any of the eight comparison countries or Massachusetts and Connecticut. We can therefore see why comparisons that do not control for differences in family academic resource distributions between countries and states may differ greatly from those that do. Yet, even Connecticut and Massachusetts have a higher percentage of 15-year-olds in the disadvantaged groups (29.1 and 32.0 percent) than the high-scoring countries—Canada, Finland, and Korea (25.2, 22.6, and 12.9 percent, respectively). The proportion of advantaged students in the Massachusetts and Connecticut samples (21.7 and 24.2 percent, respectively) are more similar to the proportions in all the comparison countries except Poland, which has a somewhat lower proportion (17.8 percent).

Any comparison of average performance in a country as a whole that is not adjusted for these differences is hardly a meaningful measure of learning differences possibly due to quality of schooling. Also, in comparing across countries, differences in the student body composition in the different countries need to be accounted for. We know that family-resource-disadvantaged students universally have lower academic performance than more advantaged students. Tables 2A and 2B allow us to compare the scores of students by FAR group in the United States as a whole (and in the three U.S. states) with students in the same FAR group in other countries. The tables also show the typical “unadjusted” PISA score for each state and country and the score adjusted for differences in the proportion of students in each FAR group in each country’s sample (these scores are depicted graphically in Figure A). We make the adjustment by assuming that each country and state has the same proportion of students in each FAR group as the U.S. sample. This “adjusted” score offers a more accurate comparison of how students’ schooling influences academic performance.

These results confirm earlier findings suggesting that disadvantaged FAR students in the United States compare more favorably to their foreign counterparts than do U.S. middle and highly advantaged students.10 The results also show that although adjusting average PISA mathematics and reading scores reduces the difference between U.S. and comparison country averages, scores remain lower in the United States than in the high-scoring and newly high-scoring countries.

However, students across all FAR groups in higher-scoring U.S. states, such as Massachusetts and Connecticut, do at least as well in reading as students in four of the highest-scoring countries (Canada, Finland, Poland, and Ireland) on the 2012 PISA test (see Table 2A). It is important to note that in sharp contrast to the United States as a whole and in sharp contrast to lower-scoring (and poorer) states such as Florida, Massachusetts and Connecticut’s substantial proportion of advantaged students (more than 20 percent of the sample in each state) generally do at least as well in mathematics as advantaged students in the highest-scoring countries (including the newly high-performing), and in the post-industrial countries.11 12 The only exception is that the most advantaged students in Korea (i.e., those in Group 6) score substantially higher in math than the most advantaged students in Connecticut and Massachusetts.

Even so, Massachusetts students do slightly better than Korean students on the PISA reading test when we adjust for differences in student FAR (the difference is 7 points, which we typically classify as “about the same”) (Table 2A). In addition, there is one group in Massachusetts that may be comparable to students in Korea in terms of out-of-school activities: self-identified Asian-origin students. Students who identified themselves as being of Asian origin in the Massachusetts PISA sample scored 569 on the 2012 PISA mathematics test, significantly higher than the unadjusted average in Korea (554), and about the same as students in Singapore (573).

Comparing changes in the PISA performance of low FAR and high FAR students in the United States with their counterparts in other countries, 2000–2012

The results in Table 2C show that although average U.S. PISA scores stagnated over 2000–2012, U.S. disadvantaged students (those in FAR Groups 1 and 2) made very large gains compared with similarly disadvantaged students in a number of post-industrial and high-scoring countries (in line with the findings in Carnoy and Rothstein 2013). A positive number in the table indicates that U.S. students in a particular FAR group gained that many scale points in reading or math over 2000–2012 compared with the same FAR group in the comparison country. A negative number means that U.S. students fell further behind their counterpart FAR group in the comparison country. Thus, compared with Group 1 and 2 students in other countries, U.S. students made relative gains on the PISA mathematics test of between 13 scale points (compared with Group 2 in Ireland) to 60 scale points (compared with Group 1 in Finland) between 2000 and 2012, and made even larger relative gains on the reading test. However, U.S. Group 5 students lost ground to students in every country except the U.K. in both reading and math.13 Furthermore, U.S. students in every group lost considerable ground to their corresponding group students in Germany and Poland.

The results in Table 2C also confirm that high FAR students in the United States have been performing steadily worse on the PISA compared with their counterparts in other countries (a conclusion of Carnoy and Rothstein 2013). This has important implications for U.S. policymakers—implications that require a more nuanced analysis of where to focus in trying to raise U.S. students’ mathematics performance.14

Comparing average 2011 TIMSS performance of students in U.S. states, the United States as a whole, and other countries

Table 3A compares 2011 TIMSS mathematics results for the United States as a whole, the participating Canadian provinces (Alberta, Ontario, and Quebec; Canada as a whole did not participate), Finland, South Korea, and England. Table 3B presents results for the nine U.S. states that participated individually in the 2011 TIMSS (Massachusetts, Minnesota, Connecticut, Indiana, Alabama, Colorado, North Carolina, California, and Florida).

Column 6 of Tables 3A and 3B displays the published average 2011 TIMSS scores of each country, state, or province. Column 7 of both tables reweights the average scores, assuming that each country, state, or province had a FAR distribution similar to that of the United States nationwide. It shows that adjusting for FAR composition makes little difference in the overall average scores of most countries, provinces, and states. The greatest differences are in the cases of Ontario, Korea, Massachusetts, Minnesota, and Alabama.

The results in Tables 3A and 3B add support to the results we found in the PISA comparisons, contradicting the general lament that the United States performs relatively poorly in mathematics. After adjusting for FAR distribution differences by applying the overall U.S. FAR distribution to the various countries, provinces, and states, students in seven of the nine U.S. states that took the TIMSS in 2011 performed better than students in all but Korea and Quebec. Students in Korea perform substantially better in every FAR group than in any U.S. state, including Massachusetts.

Since Finland has been held up as having an exemplary education system, it is noteworthy that each FAR group in Massachusetts, Minnesota, and North Carolina outperformed or substantially outperformed comparable students in Finland. Students in Indiana, Colorado, and Florida performed the same as or better than students in Finland in every FAR group.

Overall, students in Massachusetts in every FAR group performed substantially better than comparable students in the three Canadian provinces, Finland, and England. The exceptions are the Group 1 students in Quebec, who performed as well as comparable students in Massachusetts. Students in North Carolina and Minnesota also performed as well as or better than comparable students in these provinces and countries in all but the lowest (North Carolina, Minnesota) and lower (Minnesota) FAR groups.

As in the case of Massachusetts and Connecticut in the 2012 PISA results, on the 2011 TIMSS mathematics test higher FAR students (Groups 4 and 5/6) in states such as Massachusetts, Minnesota, and Connecticut outperformed their counterparts in comparison provinces and countries to a greater degree than did disadvantaged students (Groups 1 and 2) in those states. As in the 2012 PISA, the one exception to this trend is Korea. Excluding the Korean exception, this overall result suggests two possibilities. The first is that advantaged students in higher-scoring U.S. states are being more adequately prepared in mathematics, relative to disadvantaged students, when compared with other high-scoring countries and provinces. The second is that there are factors in the disadvantaged student groups in many states (including Massachusetts, Minnesota, and North Carolina) other than the number of books in the home that may negatively affect test scores. These factors—such as English language proficiency and some factors mediated by race—may be less prevalent in Canada’s provinces, Finland, or England.

These TIMSS results for several other states, in addition to Connecticut and Massachusetts, that also scored highly on the PISA test—Colorado, Indiana, Minnesota, and North Carolina—suggest that students in various parts of the United States, particularly advantaged students, are performing at least as well in mathematics as their counterparts in “high-scoring” countries such as Canada and Finland, but not as well as advantaged students in Korea.

Changes in U.S. states’ TIMSS scores over time

Three U.S. states took the PISA test for the first time in 2012, but a number of U.S. states have participated in the TIMSS test since 1995—Connecticut, Massachusetts, Minnesota, North Carolina, Indiana, Missouri, and Oregon. This allows us to observe how well students in various states performed on the TIMSS mathematics test over 1995–2011, and we can compare these changes to those in several of our comparison countries/provinces that also took the TIMSS over this period.

Table 4 shows the results for those states that participated in more than one administration of the TIMSS. Students in Missouri and Oregon, which participated only in 1995 and 1999, scored lower in 1999 than in 1995. Five states made gains: Minnesota (over 1995–2011) and Connecticut, Indiana, Massachusetts, and North Carolina (over 1999–2011). The gains in North Carolina and Massachusetts are particularly large.15

Table 5A compares changes in TIMSS scores across FAR groups in 1999–2011 in Connecticut, Massachusetts, Indiana, and North Carolina with changes in the same FAR groups in Finland, Korea, England, and the United States as a whole over this period. Table 5B compares changes in Minnesota over 1995–2011 with changes in Korea, England, and the United States as a whole over this period.

Table 5A shows that the increases in TIMSS mathematics test scores in all FAR groups in Massachusetts and North Carolina were very large over 1999–2011, about 50 scale points, or one-half a standard deviation. Mathematics gains in Connecticut and Indiana were smaller, ranging between 0.1 and 0.3 standard deviations in Connecticut and about 0.2 to 0.4 standard deviations in Indiana. In the United States as a whole, advantaged students (Groups 5/6) made smaller gains than disadvantaged and middle FAR students. However, in these four states, this pattern of gains was shared only in Indiana, and even in Indiana, the gain for advantaged students was almost 0.3 of a standard deviation. In this same period, students’ performance in Finland stayed about the same, but the score of Group 1 fell 18 points, or about 0.2 standard deviations. Students in Korea also made gains, but these were much smaller than those in Massachusetts and North Carolina, and more similar to those in Connecticut and Indiana. The pattern of gains in Korea tended to be larger for advantaged students than disadvantaged students.

As shown in Table 5B, students in Minnesota made large gains in 1995–2011 across all groups, with somewhat larger gains in middle and disadvantaged groups. The gains were about the same as in the United States as a whole, Korea, and England.

The distribution of students among FAR groups in the U.S. TIMSS samples changed radically over 1995–2011. It was more similar to the distributions in Korea’s and England’s samples in 1995 and more similar to Finland’s, Korea’s, and England’s samples in 1999 than in 2011. The percentage of disadvantaged students in the U.S. sample increased considerably in the years after 1995 and 1999.

In column 7 of Tables 5A and 5B (shown graphically in Figures B and C), we estimate the hypothetical score each state and country would have had were the distribution of the sample among FAR groups the same as the distribution in the United States as a whole in 2011. This adjustment increases the gain for every state, for the United States as a whole, and for England. Finland’s loss stays about the same, and Korea’s gain from 1999 to 2011 drops significantly (however, Korea’s gain from 1995 to 2011, shown in Table 5B, stays about the same).

Such very large gains in mathematics for students in U.S. states, particularly those in Massachusetts and North Carolina, do not conform to the characterization of U.S. education as failing. This more positive image of U.S. education in several of these states is reinforced by the fact that, as we showed in Tables 3A and 3B, students in several states (Massachusetts, Minnesota, and North Carolina)—once their average TIMSS scores and those of comparison countries are adjusted to a common FAR sample distribution—perform at least as well in mathematics as students in the Canadian provinces, Finland, and England (but worse than in Korea).

Why it is difficult to learn much about improving U.S. schools from international test comparisons

Over the past several years, the controversy around the validity and politics of using international tests to draw conclusions for educational policy and practice has intensified. The critiques cover a broad range of issues. Some question the reliability of the PISA test and the PISA rankings (Stewart 2013). Others cast doubt on the representativeness of a key PISA sample, Shanghai, continuously held up by the OECD as a symbol of educational excellence (Loveless 2013, 2014; Harvey 2015).16 The discussions about the validity of international tests as measures of students’ knowledge and the representativeness of PISA samples reveal an important aspect of these tests.

The most cogent of the recent critiques is that international agencies—especially the OECD—are often too quick to use international test score data to argue that particular education policies are the reason test scores are high in some countries and low in others. A main policy recommendation coming out of international comparisons is to copy or adapt the policies of higher-scoring countries (see OECD 2011). For example, because students in some East Asian countries and cities—such as Korea, Japan, Singapore, and, most recently, Shanghai—achieve such high test scores, the OECD and others consistently feature these as exemplary education systems. Some reasons given for educational quality in East Asia are the high level of teacher skills, high teacher pay, and, in some countries, such as Korea, a rather equal distribution of students from different social class backgrounds across schools.17 Yet these explanations are only backed by correlational, not causal, evidence.

Furthermore, out of 55 countries that have taken the PISA mathematics test over a number of years, only 18 trend upward. Of these 18, about half are low-scoring developing countries with levels of educational and economic development very different from those of the United States. The OECD has focused heavily on the high-scoring countries and big gainers, but it has failed to balance the discussion with explanations for why students in countries with “good” school systems—such as Finland, Australia, New Zealand, Canada, Belgium, Hungary, the Czech Republic, and Sweden—did significantly worse on the PISA mathematics test in 2012 than in 2003 (OECD 2013a, Figure I.2.16).

U.S. policymakers seem to be a particular target for recommendations drawn from the PISA data on how to improve U.S. education, and these recommendations further illustrate the pitfalls of using other countries to draw lessons for U.S. policy. In the wake of the PISA 2009 score release, Secretary Duncan requested that the OECD prepare a report on lessons for the United States from international test data. In that report, Lessons from PISA for the United States, the OECD advised Duncan to follow the lead of educators in Ontario, Shanghai, Hong Kong, Japan, Germany, Finland, Singapore, and Brazil, in each case drawing “lessons” of how schooling in those countries/provinces/cities brought students to high or higher levels of performance (OECD 2011). In 2013, OECD produced a second Lessons from PISA aimed at U.S. policymakers, this time making more general recommendations drawn from correlational analyses of the 2012 PISA data (OECD 2013d).

As we have shown, there is reason to agree with international testing proponents that the U.S. education system could be improved to teach students mathematics better, and, less urgently, to make U.S. students into better readers. Detailed studies about why other countries’ education systems do well or badly can certainly provide many points of discussion for U.S. educators and policymakers about how school systems are organized, about curriculum differences, and about teacher training, among many other themes. For example, 20 years ago, based on international comparisons of 1995 TIMSS test scores, studies were able to argue convincingly that students in the United States did poorly on the 8th grade TIMSS mathematics test mainly because only one-quarter of U.S. students were exposed to algebra and an even lower share were exposed to geometry by the 8th grade (Schmidt et al. 2001). Scholars also argued that the U.S. math curriculum was a “mile wide and an inch deep” (Schmidt et al. 1997). This was a wise use of international comparisons to make policy changes in mathematics. Likewise, there may also be useful research in other countries on the impact of education policies on student outcomes. However, as we shall see, looking to other countries to find solutions for education issues involving nationally or even regionally specific and complex student-teacher-administrative interactions is certainly not the only option that U.S. education policymakers can contemplate.

Why some of the policy recommendations are not feasible

Education systems develop in social and political contexts and are inseparable from those contexts. Families in some cultures are more likely to put great emphasis on academic achievement, particularly on achievement as measured by tests. They act on this emphasis by investing heavily in their children’s out-of-school academic activities, including “cram courses,” which focus on test preparation (Ripley 2012; Bray 2006; Wantanabe 2013; Bray and Lykins 2012; Byun 2014).18 In a world that puts high value on test scores, perhaps such intensive focus on children’s academic achievement should be applauded. However, whether it is a good choice for middle and high schoolers to spend most of their waking hours studying how to do math and science problems, and whether it is likely that families in other societies would buy into this choice for their children, are highly controversial questions and certainly only somewhat related to the quality of schooling received by students in a particular society.19

There is some evidence that mathematics instruction in certain Asian countries is better than in the United States. Comparisons of mathematics instruction in Japanese and U.S. schools (Hiebert et al. 1999) support the “mile wide, inch deep” critique of U.S. mathematics education made almost 20 years ago (Schmidt et al. 1997). There is also evidence that greater “exposure to formal mathematics” has a positive effect on TIMSS and PISA mathematics performance (Schmidt et al. 2001; Schmidt et al. 2015; Carnoy et al. 2015a). However, the relevant questions are whether there is something specific we can learn from Japanese school practices that we know will increase U.S. student mathematics performance, and whether greater exposure to formal mathematics is taking place mainly in schools.20 As noted, if greater exposure outside of school is the more important reason that students in the PISA sample, in, say, Korea, Japan, Singapore, Hong Kong, or Singapore do better on the PISA mathematics test than students in the United States, we would need to change core behavior patterns in most U.S. families. Simply improving mathematics instruction in school would not help much if spending significantly more time on mathematics outside school is the main reason 15-year-olds in Korea score higher on the PISA test.

Finland has also been touted as having a model education system, mainly because of its highly trained teachers and the autonomy teachers and principals in the country allegedly have in their classrooms and schools.21 Teacher education and classroom teaching in Finland indeed seem very good, but neither the OECD nor the Finns ever offered any evidence approaching causality to support these claims.22 Furthermore, performance on PISA reading and mathematics tests by students in the country’s lower FAR groups has deteriorated considerably over the past decade (Table 2C), though this decline has not been addressed by PISA analyses.

Two recent examples of “successful” reforms featured in OECD reports are those undertaken by Germany and Poland (OECD 2013b, Volume II, Box II.3.2; OECD 2013c, Volume IV, Box IV.2.1). Students in those two countries scored near the OECD average in 2003 and have made large gains since. One study of the Polish reforms argues that Poland’s 1999 reform postponing vocational tracking from the 9th to the 10th grade lifted by one standard deviation the PISA reading scores of students who would have gone to vocational school in the 9th grade. The study argues that this explains much of Polish reading test score gains in 2000–2006 (Jakubowski et al. 2010). Yet, thanks to a special follow-up sample in Poland, that same study is also able to show that in 10th grade, the 9th grade cohort entering the vocational education track “lost” the gains they had made in 9th grade.23

Germany had a PISA-reported increase in its sample of disadvantaged students in the 2000s, making the achievement in that decade of substantial gains for such students more interesting and impressive. Nevertheless, the gain in average test scores from 2000 to 2009 apparently came from gains made by children from Slavic-country immigrant families. The gains of ethnic German students were negligible (Stanat et al. 2010). The reported concentration of German student performance gains in first- and second-generation immigrants, most from Russia and Eastern Europe, raises questions about whether school reforms were related to such gains or whether lessons learned in Germany from educating Russian and Eastern European immigrants are applicable to the U.S. context, where most low-scoring immigrants are from Mexico and come to the country with considerably less family academic resources. An OECD report on lessons for the United States from other countries discusses German reforms but concedes that there seems to be no empirical link between those reforms and German test score gains (OECD 2011).24

These are just a few cases illustrating how a lack of relevance and causal evidence does not impede the OECD and others from drawing conclusions from PISA data on what works to increase student test scores. The OECD has made a regular practice of recommending what countries and even schools should do to increase their students’ learning even though there is no link to particular country situations and no causal evidence for these claims (OECD 2013a, Chapter 6).

When evaluating such recommendations, U.S. policymakers need first to ask whether they are relevant to the U.S. context—whether the experiences described above, such as those in Asia, Finland, Germany, and Poland, are meaningful for U.S. educational conditions. Secondly, U.S. policymakers need to ask whether the analysis underlying the recommended intervention or reform makes a reasonable inferential case that the intervention or reform is linked to improved education outcomes. It is a challenge to find any OECD recommendations in the two Lessons from PISA for the United States volumes that meet these criteria. For all these reasons, it is challenging to learn about improving U.S. schools from comparisons based on international tests.

Should U.S. policymakers look outward or inward?

Because of all these issues of comparability, the PISA and TIMSS results in various U.S. states are particularly interesting. Our analysis in the previous section suggests that the average of even usually lower-scoring advantaged students in the United States as a whole may be a poor measure of success or failure of such students in many of the administrative units that actually administer U.S. education. In a number of states that have participated in international tests, advantaged students do at least as well in mathematics—in which students in the United States as a whole do not perform very well—as advantaged students in high-scoring PISA and TIMSS countries. At the same time, advantaged students in many other states do worse than their counterparts in large industrial European countries and much worse than students in high-achieving countries.

Thus, the question U.S. policymakers and educators should ask is whether they need to turn to the rest of the world to find out what will make relevant contributions to improving U.S. education—or whether they should look within the United States’ varied multi-educational state-based systems to find these answers.

The case for looking inward across states

The case for looking inward, across U.S. states, is compelling. When considered carefully, the concept of a “United States education system” is largely a construct of agencies administering international tests and the U.S. Department of Education. Education policies are, in the environment of the past 20 years, largely formulated by states and local school districts, even though federal legislation and programs such as the Civil Rights Act, Title I, Title IX, No Child Left Behind, Race to the Top, and Common Core are important in holding states and districts to certain requirements and steering them in certain directions. Ultimately, however, implementation varies, and the result is considerable variation in the quality of education systems even among states. As would be expected, student performance in mathematics on international tests varies greatly among U.S. states.

Secondly, successful states’ results and experiences are more relevant for drawing policy lessons for state policymakers because the conditions and context of education are more similar among U.S. states than among the United States and other countries. School systems are very similar in U.S. states, teacher labor markets are not drastically different, and the state systems are regulated under the same federal rules. If students with similar family academic resources in some states make much larger gains than in other states, those larger gains are reasonably likely to be related to specific state policies that are transferable across states.

If we are to learn lessons for improving 15-year-olds’ mathematics and reading performance, we argue that looking at the great number of successful education systems within the United States is far more useful and feasible than turning for lessons to Finland or Germany (which is also not a single education system) or Poland. If test score levels were not very high in the best-performing U.S. states, one could argue against this approach. But this is not the case. Students in Massachusetts with similar family academic resources do at least as well as students in all but a limited group of Asian countries, and were students in Texas or North Carolina or Vermont to take the PISA test, so probably would they.

Thirdly, as shown, many states have made large gains—particularly in mathematics—since the 1990s. As evidenced by the TIMSS test, in some of these states, these gains have been larger than in countries such as Finland, held up as examples of good education policy. In Indiana, for example, the largest gains on the TIMSS mathematics test appear to be among disadvantaged students, and in others, such as Connecticut, the gains appear to be largest for advantaged students. In three states—Minnesota, Massachusetts, and North Carolina—large gains are spread across all family academic resource groups. Differences in states’ policy changes may reveal important lessons for policymakers.

For all these reasons, it makes greater sense to use student performance across U.S. states to understand why students in some are able to achieve high mathematics scores on several kinds of tests than to look to other countries for educational policy direction.

What we can learn from student performance in U.S. states over time

Having made the case that systematically comparing student performance in U.S. states could be highly valuable for education policymakers, we now turn to an analysis of the differences in student achievement in mathematics and reading across U.S. states over the past 10 and 20 years, using microdata available from the National Assessment of Educational Progress (NAEP).25 We have established that there is considerable variation in student performance among U.S. states on both the PISA and TIMSS tests, and that student gains on the TIMSS mathematics test for five states also vary considerably from the 1990s to 2011. NAEP data allow us to extend this analysis to all U.S. states and to begin to suggest ways to draw lessons from the results to improve student achievement.

Comparing states’ performance on the NAEP over 1992–2013 using regression analysis

Among its multiple uses (see, for example, Lubienski 2002; Lubienski and Lubienski 2014), the NAEP, the United States’ national assessment test, is the main data source for interstate comparisons. It provides state-level data since 1990 and sufficiently consistent samples of states since 1992. The NAEP is applied to students in specific grades (4th, 8th, and 12th). We focus on the 8th grade results in mathematics and reading but also estimate results for students in the 4th grade to check on the possibility that changes in 4th grade scores in a state drive changes in 8th grade scores four years later. These are called “cohort effects.”

Student performance on the NAEP varies considerably across states in each year the NAEP is given. For example, in 8th grade mathematics in 2013, students in the lowest-performing jurisdictions, the District of Columbia and Alabama, averaged mathematics scale scores of 265 and 269. At the top of the spectrum, students in New Jersey scored 296, and students in Massachusetts scored 301. This represents a spread of about one standard deviation in average scores among individual students taking the test and about 0.8 of a standard deviation among state averages.

Just as in our analysis of TIMSS and PISA results across countries and states, average NAEP scores can also differ among states for reasons other than the quality of schooling. One of these is demographics. In states with a higher percentage of students from low-academic-resource families, average scores would tend to be lower. In states with a higher percentage of minorities such as African Americans or Hispanics—groups that because of a history of discrimination or, in the case of recent immigrants, because of limited English language ability, traditionally do less well on such tests—average scores are likely to be lower. The portion of the lower scores resulting from differences in the composition of students’ characteristics (including gender, race/ethnicity, and age), or differences in the composition of students with different levels of family resources (including language spoken in the home, mother’s education, and individual poverty, measured by eligibility for free or reduced lunch), cannot reasonably be attributed to the performance of the education system.

Using the NAEP individual- and school-level microdata, we estimate student performance on the NAEP as a function of students’ characteristics, students’ family academic resources, and variables identifying each state. The coefficients for each state show how students in that state score once we account for that part of the variation in test scores associated with these student and family resource differences among states. We know that part of the variation in observed test scores among students is related to the family-academic-related skills that students bring to school, and that the level and distribution of students’ family resources vary across states. Thus, to get a better understanding of how educational quality differences vary across states, we want to take out student demographic variation, regardless of what the ultimate source of those differences across states might be. We call this our Model 1 adjustment.

In addition to adjusting student achievement for student sample demographics, as we were able to do with the international test data, NAEP offers the opportunity to dig further into the sources of state differences. With NAEP, in addition to controlling for student family characteristics, we can control for some school- and teacher-level factors that may differ across states and are likely to be correlated with student academic performance. Specifically, we can control for the concentration in schools of free and reduced-price lunch students and of black or Hispanic students—two types of school-level “peer effects.” We call this our Model 2 adjustment. If we want to adjust further to get at how well states are doing given educational resources available in the states, we can also control for the several characteristics of teachers measured by the NAEP in the schools attended by the sampled students, and for whether the school they attended is public or private. We call this our Model 3 adjustment.

Furthermore, unlike PISA or TIMSS data, the NAEP data allow us to track changes in test scores over time in all the states (after 2003).26 If some states are making larger gains in test scores compared with other states and these gains are not the result of favorable changes in student demographics, we could posit that something about those states’ school systems is contributing to greater student performance gains. If we also adjust the test scores for the several available measures of teacher resources in each state, the coefficients for each state represent a “state effect” on student performance net of student family academic resources, peer effects from the way state education systems concentrate students in schools, and some measurable teacher resources.

Adjusted score rankings for states, 2003–2013

Because all states were required to take all the tests beginning in 2003, estimates for the adjusted scores for each state in both NAEP reading and mathematics are available for 2003–2013. For this period, we compared the relative positions of each state and their relative gains during that period, and we proceed as mentioned above: We first estimated the regressions using the NAEP microdata files, including, stepwise, student characteristics (Model 1), then adding the social class and race concentration of students in schools (Model 2), then, in a third step, adding type of school (private, charter, or public), teacher experience, and whether teachers had majored in the tested subject (mathematics or reading) in undergraduate or graduate studies (Model 3). In order to enter all the states in the regressions as discrete variables, we needed to omit one state as a reference variable. In our estimates we left out California, a large and relatively low-scoring state. The “adjusted state fixed effects” in each of these three regressions were estimated from the coefficient of the states added as variables in the regressions.27 This process (using Model 3) is followed to estimate the “adjusted state fixed effects” for students in 4th and 8th grade, in math and reading.28</