| Sign In to gain access to subscriptions and/or personal tools. |
DOI: 10.3102/0162373707299550
Using Covariates to Improve Precision for Studies That Randomize Schools to Evaluate Educational InterventionsMDRC
This article examines how controlling statistically for baseline covariates, especially pretests, improves the precision of studies that randomize schools to measure the impacts of educational interventions on student achievement. Empirical findings from five urban school districts indicate that (1) pretests can reduce the number of randomized schools needed for a given level of precision to about half of what would be needed otherwise for elementary schools, one fifth for middle schools, and one tenth for high schools, and (2) school-level pretests are as effective in this regard as student-level pretests. Furthermore, the precision-enhancing power of pretests (3) declines only slightly as the number of years between the pretest and posttests increases; (4) improves only slightly with pretests for more than 1 baseline year; and (5) is substantial, even when the pretest differs from the posttest. The article compares these findings with past research and presents an approach for quantifying their uncertainty.
Key Words: randomizing schools cluster randomization precision intraclass correlation educational interventions pretests THE best way to measure the impacts of many educational interventions is to randomize schools to a treatment group that receives the intervention or a control group that does not and compare future outcomes for the two groups. This design is especially appropriate for evaluating whole-school reforms, which are intended to change how schools operate.1 Randomizing schools is also the design of choice for evaluating classroom-level innovations, if they are likely to "spill over" within schools from treatment classrooms to control classrooms.2 The principal drawback of the approach, however, is its limited statistical power or precision and the corresponding need to randomize many schools (often 40–60, as discussed later) to identify with confidence intervention effects that are educationally meaningful (Bloom, 2005; Bloom, Bos, & Lee, 1999; Schochet, 2005). One of the most promising ways to improve the precision of such designs is to use multiple regression analysis (also referred to as analysis of covariance) to control for the characteristics of schools and/or students during a baseline period before randomization. Such baseline covariates (in the past often referred to as concomitant variables) can include demographic characteristics, socioeconomic characteristics, and measures of past student performance (pretests). Although using covariates to improve the precision of randomized experiments is not a new idea (its use dates back to Fisher, 1937/1947), little is known about the effectiveness of this approach for studies that randomize schools, because this is an empirical question. The main purpose of the present article is therefore to provide such empirical evidence to help researchers design future studies.3 Part 1 of this article introduces the types of covariates considered. Part 2 describes our statistical framework and empirical analysis. Parts 3 and 4 present our empirical findings. Part 5 compares these findings with those of past research and presents an approach for quantifying the uncertainty surrounding the proposed estimators.
Table 1 lists the research questions addressed in the present article. These questions consider the extent to which different types and combinations of covariates correlate with, and thus predict, student outcomes. The greater this predictive power, the more covariates improve precision (discussed in detail later).
Core Questions The first question addressed by the present analysis is, How much can pretests improve precision? If precision is improved a lot, many fewer schools need to be randomized for a given study, the costs of the study will be reduced, and more studies can be supported. An important, related subsidiary question is, How much, if any, precision is lost by using school-level pretests instead of student-level pretests? Data on school-level pretests (mean baseline scores for schools) can often be obtained quickly and cheaply from electronic reports posted on state or local Web sites. Data on student-level pretests (individual baseline scores) must be obtained from state or local administrative records, which is more difficult and expensive. There are two reasons to expect school-level covariates to perform as well as student-level covariates. First, correlations across aggregate entities tend to be much higher than those across individuals.4 This suggests that correlations across schools will be much higher than correlations across students. Second, because the school-level variance is usually the binding constraint on precision (discussed later), covariates that correlate highly with school outcomes are likely to be more important in explaining variance than covariates that correlate highly only with student outcomes. The next core research question acknowledges the fact that most educational interventions take considerable time to implement, and thus their evaluations often must span several follow-up years. In designing such evaluations, it is therefore important to ensure adequate precision not only for the 1st follow-up year but for subsequent years as well. This raises the issue of how the predictive power of a baseline covariate declines as the gap in time between it and follow-up measures increases. The faster this predictive power declines, the larger the study sample must be to ensure adequate precision for later follow-up years. The next three questions in Table 1 consider how precision varies across subjects (reading and math), education levels (elementary school, middle school, and high school), and local school districts (the five urban districts in the present analysis). Findings for reading and math are important because of the need to evaluate interventions for both subjects. Findings for different education levels are important because of the need to evaluate interventions for them. Findings for different school districts are important for assessing the usefulness of these results for planning future studies. If findings vary little across districts, researchers can be fairly confident in using them to plan future studies. But if these findings vary a lot across districts, researchers will have to depend more on estimating planning parameters from data for the districts they are studying (which is the preferred approach when feasible). The final core research question considers how parameters that determine precision vary across years in a school district. To the extent that these parameters are stable over time, it is safe to plan a study on past estimates. To the extent that these parameters vary over time, researchers must be especially conservative about their likely future values.
Further Questions The first two questions in this category consider the improvement in precision that can be achieved by using two pretests. The first question concerns pretests for two baseline years instead of one, and the second concerns using a school-level pretest and a student-level pretest together. It stands to reason that two pretests should have greater predictive power and thus yield greater precision than one. But it is an empirical question as to how much difference is made by the addition of a pretest for a 2nd baseline year and how much difference is made by using a school-level pretest and a student-level pretest together. The third question considers how much precision can be achieved if pretest data are not available and only demographic characteristics can be used as covariates. This is not likely to occur for evaluation studies based on data from local school districts, but it might occur for studies based on data from national surveys. The fourth question takes a different tack with respect to using demographic data. It considers the extent to which adding demographic covariates to a pretest can improve precision. The fifth question considers situations for which the pretest differs from the posttest. This occurs frequently when states and districts change the tests they use to assess student or school performance. One might expect less predictive power, and thus less precision, when the pretest differs from the posttest. But much of the predictive power of school-level pretests might reflect general "school effects" that are fairly stable over time and tests. Thus, precision might be almost as high when posttests and pretests differ as when they are the same. The last two questions in Table 1 consider what precision is likely to be if a study focused on subgroups of schools within a district that have especially high concentrations of low-income students (typically measured by the proportion of students receiving free lunch) and those with especially low past student performance. Such schools are the most frequent focus of evaluation studies funded by the U.S. Department of Education and private foundations.
This part of the article provides a brief overview of the statistical framework and empirical analysis for the present findings.
Analytic Approaches The second approach, longitudinal analysis, follows a specific student cohort or group of student cohorts over time. It might, for example, follow all students who were in second grade when an intervention was launched, regardless of whether they moved away or stayed in their original schools. Another version of longitudinal analysis would follow all students who were in a particular grade when their schools were randomized and did not change schools subsequently.5 This approach might, for example, follow all students who were in second grade when their schools were randomized and did not move away. For both types of longitudinal samples, intervention effects could be estimated as the difference in mean outcomes for the treatment group and control group during each follow-up year.6 The statistical model in Equation 1 below provides a simple way to estimate the difference of mean outcomes at a given point in time for either a repeated cross-sectional analysis or a longitudinal analysis:
where yij is the outcome for student i from school j; Tj equals 1 for students from treatment schools (intervention schools) and 0 for students from control schools; ej is a random error for school j, which is assumed to be independently and identically distributed across schools; and
The intercept,
The second error,
Denoting the variance of the school error term, ej, as
Using Covariates The other path to take is to collect information about sample members baseline characteristics. There are several ways to use baseline information to improve the precision of impact estimates for cluster-randomized designs. One way is to create matched pairs or stratified blocks of clusters on the basis of similarities in their baseline characteristics and then to randomize clusters within pairs or blocks. This approach, which has been in the literature for decades (beginning with Fisher, 1937/1947) and is widely used, has important strengths and weaknesses, which are discussed by Raudenbush et al. (2007) and a number of other authors.8 Another approach, which is the basis for the present article, is to control statistically for baseline characteristics by including them as covariates in a regression model such as Equation 2 or 3 below. Doing so reduces the random variation in one or both of the two error terms, thereby increasing the precision of impact estimates:
or
where xij is an individual-level covariate for student i from school j, and Xj is an aggregate covariate for all students in a particular grade from school j. The most effective covariate is typically a pretest, because pretests reflect observable and unobservable factors that determine future outcomes. A student-level pretest represents individual past performance. A school-level pretest represents the mean performance of past students in the same grade of a school. Another type of baseline covariate is demographic characteristics, such as student gender, over age for grade, race or ethnicity, or eligibility for subsidized meals. A convenient way to report the precision of a research design is its minimum detectable effect or minimum detectable effect size.9 Intuitively, a minimum detectable effect is the smallest true effect that a research design can detect with confidence. Formally, a minimum detectable effect is the smallest true effect that has a given level of statistical power for a given level of statistical significance. Education evaluations often measure intervention effects in standardized "effect size" units. This metric reports effects as a multiple of the standard deviation of the outcome measure across students in the study sample. For example, an effect size of 0.25 represents an impact equal to one quarter of the student-level standard deviation. Equation 4 below is a particularly useful expression for representing the factors that determine the minimum detectable effect size of a research design that randomizes schools and uses a baseline covariate or set of covariates.10
where J is the total number of schools randomized, n is the number of students per school in the grade of interest, P is the proportion of schools randomized to treatment, K is the number of school-level covariates included in the model, MJ–K is a multiplier based on the t distribution that accounts for the number of degrees of freedom (J–K) for the standard error of the impact estimate, The first term on the right-hand side of Equation 4, MJ–K, is a multiplier that reflects how the t distribution (used to test the statistical significance of the impact estimate) varies with the number of degrees of freedom (J–K). This relationship depends on the statistical significance level used, the statistical power level desired, and whether a one-tailed or two-tailed test is conducted. When the number of degrees of freedom exceeds about 20, the value of the multiplier is approximately 2.8 for a two-tailed test and 2.5 for a one-tailed test, given 80% statistical power and .05 statistical significance (Bloom, 1995).
The next term in the expression, Equation 4 also illustrates how the number, size, and allocation of schools (J, n, and P, respectively) influence precision. Because J appears in both denominators under the square root sign, the minimum detectable effect size is roughly inversely proportional to the square root of the number of schools randomized. Because the number of students per school, n, appears in only one denominator, it has less (often much less) influence on precision. The proportions of schools randomized to the treatment group and control group, P and 1–P, respectively, also appear in both denominators of the equation. As explained by Bloom (2005), this illustrates why a balanced design (with equal numbers of schools in the treatment group and control group) produces the smallest possible minimum detectable effect. It also illustrates why the precision of an unbalanced design does not differ substantially from that of a balanced design unless the design is highly imbalanced. Most relevant for the present discussion are the roles played by R2C and R2I in Equation 4. These terms represent the proportion of the school-level random variance and student-level random variance that is reduced or "explained" by the covariate or covariates. Specifically,
and
where School-level covariates can reduce random variation between schools only because their values are constant for all students in a school. Thus, R2I is zero for designs with school-level covariates only. Student-level covariates can reduce random variation between schools and across students within schools because their individual values can vary across students within schools, and their mean values can vary between schools. Nonetheless, as will be shown later, some school-level covariates can reduce minimum detectable effect sizes by as much as or more than student-level covariates.12
Table 2 illustrates how
The minimum detectable effect size in the upper left-hand corner of each panel in Table 2 represents a school-randomized design without covariates and thus values of zero for R2C and R2I. For example, when equals .15, the minimum detectable effect size for a design with no covariates is 0.37. Now consider what happens when a school-level covariate is added to the analysis. First recall that such covariates can increase R2C but cannot affect R2I. In Table 2, this is equivalent to moving from left to right in a row. When doing so, the minimum detectable effect size declines rapidly. Thus, increasing R2C produces dramatic improvements in precision, all else being equal. For example, when R2C reaches .8 (given =.15 and R2I = .0), the minimum detectable effect size falls to 0.19, which is roughly half of its original value. This improvement in precision is equivalent to that which would be produced by a fourfold increase in the number of schools randomized.14
Now consider what happens when a student-level covariate is added to the analysis. Recall that such covariates can increase both R2I and R2C. In Table 2, increasing R2I is equivalent to moving down a column in a panel. This makes very little difference for the minimum detectable effect size. For example, moving down the first column in the middle panel indicates that when R2I.equals .8 (given
Last, consider how the unconditional intra-class correlation (
Empirical Analysis Table 3 describes the districts, schools, and students in the study sample. The districts had 25–168 elementary schools, 17–41 middle schools, and 11–32 high schools. The average elementary school had 57–75 third grade students who were tested in a given year, the average middle school had 196–297 eighth grade students tested, and the average high school had 234–269 tenth grade students tested. In two districts, students were predominantly Black; in two districts, they were a mix of Blacks and Hispanics or Whites; and in one district, demographic information was not readily available. In the three districts in which data on economic status were available for elementary schools, the percentage of students who were categorized as low income (on the basis of their eligibility for free lunch) ranged from 41% to 79%. Table 3 also lists the type of standardized test used by each district.
The first step in the present analysis for a given grade, subject, district, and year was to estimate the unconditional values of the school-level and student-level variance components (without covariates) and use these estimates to compute the unconditional intraclass correlation. The second step was to estimate the conditional values of the two variance components for different baseline covariate specifications. For each specification, the relationships between the conditional and unconditional values of the two variances were used to compute R2C and R2I in Equations 5 and 6. The mean values of these parameter estimates (across years for a given grade, subject, and district) are presented in a series of tables to provide an empirical guide for planning future evaluation studies. In addition, the mean estimated values of the three empirical parameters ( , R2C, and R2I) are used to compute minimum detectable effect sizes for alternative sample designs for each grade and subject. In what follows, we present detailed results of analyses for third grade reading followed by a summary of results for third grade math plus reading and math for the other grades examined. Bloom, Richburg-Hayes, and Black (2005) present detailed findings for reading and math in all grades examined.
Third Grade Reading This section presents detailed findings for third grade reading. For reference, Table 4 presents definitions of the symbols used to represent findings for each of the covariates examined.
Precision with a single pretest Table 5 presents estimates of minimum detectable effect sizes for a research design with no covariates or a single pretest, given the mean estimated values (across years) of , R2C, and R2I for these covariate specifications in each district. Minimum detectable effect sizes are based on the assumptions of 80% statistical power and .05 statistical significance for a two-tailed hypothesis test with 60 third graders per school. Results in the top, middle, and bottom panels are for samples of 20, 40, and 60 schools, respectively, with half of the schools in each case randomized to the treatment group.
The first five columns in Table 5 present findings by district. The last column presents the mean values of the district results with each district weighted equally. Means that are not based on data for all districts are presented in parentheses. Although these findings for subsets of districts are important in their own right, they are not fully comparable with other findings for all districts. Each row in a panel in Table 5 presents findings for a particular covariate specification. The first row presents findings for a design without covariates. The next three rows present findings for school-level pretests lagged 1, 2, and 3 years (Y–1, Y–2, andY–3). Findings for these school-level pretests reflect the precision that might be expected for the 1st, 2nd, and 3rd follow-up years of a study, respectively. The final three rows in each panel present corresponding results for student-level pretests lagged 1, 2, and 3 years (y–1, y–2, and y–3). Before interpreting these results, it is necessary to digress briefly and consider the question, How much precision is needed for an educational evaluation?16 In other words, how small must its minimum detectable effect size be? There is no universal standard for making such judgments. However, one widely used approach is that of Cohen (1977), who proposed that minimum detectable effect sizes of roughly 0.20, 0.50, and 0.80 be considered small, medium, and large, respectively. Lipsey (1990) provided empirical support for this characterization by examining the actual distribution of 102 mean effect size estimates reported in 186 meta-analyses that together represent 6,700 studies with 800,000 sample members. Consistent with Cohens categorization, the bottom third of this distribution ranged from 0.00 to 0.32, the middle third ranged from 0.33 to 0.55, and the top third ranged from 0.56 to 1.20. However, recent research suggests that, at least for education interventions (and perhaps for other types of interventions as well), much smaller effect sizes should be considered substantively important. Thus greater precision might be needed than is suggested by Cohens (1977) categories. Foremost among the findings motivating these new expectations are those from the Tennessee Class Size Experiment. These findings indicate that changing elementary school classes from a standard size of 22–26 students to a reduced size of 13–17 students increases average student performance by an effect size of roughly 0.1–0.2 (Nye, Hedges, & Konstantopoulos, 1999). This landmark study of a major educational intervention suggests that even big changes in schools produce what, by previous standards, would have been considered small effects on student achievement. Another important piece of related research is that by Kane (2004), who found that, on average nationwide, a full year of elementary school attendance increases students reading and math achievement by an effect size of only 0.25. Thus, an education intervention that has a positive effect size half as large as this (0.125) would seem to qualify as a noteworthy success. Further reinforcing these findings are results published by the National Center for Education Statistics (1997) indicating that, on average nationwide, high school students increase their reading achievement by an effect size of 0.17 annually and math achievement by 0.26 annually. This gain represents the effect of attending school plus the effect of all other factors that influence student development throughout a year. Thus, again the message is clear: Program effect sizes for student achievement of as little as 0.10–0.20 might be policy relevant. At the present time, standards for interpreting the magnitudes of educational impacts and thus determining the requisite precision of educational evaluations are in a state of flux. However, because numerous recent evaluations have been designed to detect effect sizes of roughly 0.20, the present article uses this value as a benchmark.17 Now consider the findings in Table 5, beginning with those for a design without covariates. The mean values of the minimum detectable effect size for this most basic design are 0.57 for 20 randomized schools (ranging from 0.47 to 0.63 across districts), 0.39 for 40 randomized schools (ranging from 0.33 to 0.44 across districts), and 0.32 for 60 randomized schools (ranging from 0.27 to 0.35 across districts). Hence, the design does not appear to be capable of achieving the prevailing benchmark for precision without randomizing far more than 60 schools (about 150), which most likely would be prohibitively expensive. The next three rows in each panel of Table 5 illustrate how an aggregate pretest lagged 1, 2, or 3 years (Y–1, Y–2, and Y–3) can vastly improve precision for the 1st, 2nd, or 3rd follow-up years of an evaluation study. During the 1st follow-up year, when the time lag between the posttest and pretest is 1 year, the average minimum detectable effect size (MDES) for all districts is 0.37, 0.26, and 0.21 for 20, 40, and 60 randomized schools, respectively.18 Thus, according to these estimates, randomizing 60 schools when using such a covariate would achieve the prevailing benchmark for precision, and randomizing 40 schools would approach doing so. (Note that to obtain these samples might require operating a study in multiple districts.) During the 2nd follow-up year of an evaluation study, when the time lag between the posttest and pretest is 2 years, the mean MDES for all districts is slightly larger: 0.40, 0.28, and 0.23 for 20, 40, and 60 randomized schools, respectively. This represents the slightly lower predictive power of a pretest for a 2-year time period. During the 3rd follow-up year, the mean MDES is slightly larger yet, although it is not directly comparable with the others because it represents only three of the five school districts in the analysis. Overall, the mean findings suggest that by randomizing 40–60 schools, one can approach or attain the prevailing standard for precision during the first 3 years of an evaluation. But there is noticeable variation in the findings across districts, and hence their applicability to other districts is uncertain. Now consider whether student-level pretests, which are more difficult and costly to obtain, can improve precision by appreciably more than school-level pretests. The findings in Table 5 suggest that the answer to this question is no. This can be seen by comparing the minimum detectable effect size during the 1st follow-up year (the only time for which data from all districts are available) for a student-level pretest (y–1) and a school-level pretest (Y–1). For example, with 40 randomized schools, the mean MDES during the 1st follow-up year is 0.26 for a school-level and a student-level pretest. And in no district does the student-level covariate appreciably outperform the school-level covariate.
Precision with other covariate specifications and school samples
A more encouraging result occurs with the addition of a school-level pretest to a student-level pretest or vice versa (Y–1 and y–1). This is perhaps because the two sources of information being combined differ more from each other than is the case for two pretests of the same kind. Adding a student-level pretest to a school-level pretest reduces the mean MDES from 0.27 to 0.25. Adding a school-level pretest to a student-level pretest reduces the mean MDES from 0.28 to 0.25. Findings for all but one district are consistent with this pattern. The next two rows in the table present estimates of minimum detectable effect sizes when school-level or student-level math scores (Z–1 or z–1) are used as a pretest for a third grade reading posttest. These findings provide conservative estimates of the precision that one might expect when a pretest and posttest represent different tests in the same subject. This situation can arise when school districts change their student assessments, which they do frequently. The results in Table 6 indicate that even if a pretest is in the "wrong" subject, it can improve precision dramatically. A school-level math pretest reduces the mean MDES for a reading posttest from 0.41 without covariates to 0.29. This is equivalent to doubling the number of schools randomized. Similarly, a student-level math pretest reduces the mean MDES to 0.31. In both cases, the resulting precision is almost as good as that for a pretest and posttest in the same subject. The last three rows in the top panel of Table 6 present estimates of minimum detectable effect sizes that would result if student demographic characteristics, X, were used as covariates either alone or in conjunction with a school-level or student-level pretest. To properly interpret these findings, it is necessary to focus on results for Districts A, B, and C, because demographic data were not available for District D. Consider first the results when demographic characteristics are used alone as covariates. In this case, the estimated minimum detectable effect sizes for Districts A, B, and C are 0.35, 0.29, and 0.27, respectively, compared with 0.36, 0.20, and 0.23, respectively, for a school-level pretest. Hence, demographics improve precision by less than pretests. Now consider how precision changes if individual student demographic characteristics are added as covariates to a school-level or student-level pretest. The estimates in the table for Districts A, B, and C suggest that adding this baseline information can improve precision slightly. The next panel in Table 6, for the subsample of low-income schools in each district (identified by the percentage of their students who were eligible for free lunch), indicates that narrowing the potential schools to be randomized to a much more homogeneous pool does not necessarily improve precision when one is using a pretest as a covariate. This is the case for all three districts (A, B, and C) for which data were available to identify low-income schools. The final panel in Table 6, for the subsample of low-achieving schools in each district, presents similar results. Once again, the precision for this much more homogeneous subsample of schools is no better than that for the full sample. For example, the mean estimated minimum detectable effect size for the full sample of schools and this subsample both equal 0.27 when a school-level pretest is used. The findings in the last two panels of Table 6 have important implications. First, they suggest that narrowing the pool of schools to be randomized on the basis of their economic status or past performance may not provide more precision than simply using these factors as covariates. (The reasons for this result are explored later.) Second, these findings suggest that the basic pattern of minimum detectable effect sizes for different covariate specifications that were observed for the full sample of schools holds for the subsamples as well.
Variation in precision across years
Table 7 presents the range across years of minimum detectable effect sizes implied by the estimated parameters for each district in the present analysis during years with available data. Because data for the two most basic covariate specifications were available for more than 1 year in every district, findings for these specifications are presented. As can be seen, sometimes there is considerable variability from year to year in the likely minimum detectable effect size for a given district, and sometimes there is little variation. This is the case for the full sample of schools from each district, its sub-sample of low-income schools, and its subsample of low-performing schools. Unfortunately, there is no known way to predict where and when precision might be variable or stable. Therefore, when planning a study, one probably should be conservative.
Parameter estimates
The next panel in Table 8 presents values of R2C and R2I when school-level pretests are the only covariates used. First note that R2I is zero for all of these covariate specifications. This is because the value of a school-level covariate is the same for all students from a given school and thus cannot covary with their test scores. Next, note that R2C varies in predictable ways across the different types and combinations of pretests. It declines as the gap in time increases between posttests and pretests, and it is larger for combinations of pretests than for single pretests. Last, note that for any given covariate specification, R2C varies substantially across districts. For example, it ranged from a low of .31 in District A to a high of .77 in District B for a single school-level pretest lagged 1 year (Y–1). The middle panel in Table 8 presents values of R2C and R2I when student-level pretests are the only covariates used. Because these pretests can vary across and within schools, their values for both R2C and R2I are nonzero. These values also vary in predictable ways, declining as the time lag between pretests and posttests increases and being higher for combinations of pretests than for single pretests.
It is particularly useful to compare the effects of student-level and school-level pretests on R2C and R2I, because doing so illustrates why student-level pretests do not provide superior precision. The simplest and clearest way to make this comparison is to focus on pretests lagged 1 year (y–1 and Y–1), for which data from all districts are available. In terms of R2C, the school-level pretest has a slight advantage in all but one district, in which it had a considerable advantage. In terms of R2I, the student-level pretest has an advantage that ranges from small to substantial. Recall, however, that R2C represents the reduction in The bottom panel in Table 8 presents values of R2C and R2I for the other major covariate specifications included in the present analysis. These findings are presented for researchers who are considering the use of such specifications.
Parameter ranges across school samples and time
Differences between parameter estimates for the full samples of schools and those for subsamples explain why precision is no better for the subsamples than for the full samples, even though the subsamples are considerably more homogeneous. Because of the greater homogeneity, is typically much smaller for the subsamples. However, it is also the case that R2C is typically much lower for the subsamples of schools than for the full samples. This is because the subsamples have less variation in student achievement outcomes (because of the greater student homogeneity), which leaves less room for pretest covariation. Another way to explain this phenomenon is that the restricted variation in outcomes for schools in the subsamples contains less "signal to noise" than is the case for the full samples. Because moving to a more homogeneous subsample of schools reduces both and R2C, the overall effect on precision is negligible. Also note that moving to a subsample of schools has little or no effect on R2I. This is because restricting the range of variation in outcomes across schools by choosing a subsample of them does not necessarily affect the variation in individual outcomes within schools and thus does not necessarily affect the margin for covariation with individual pretests.23
Last, note that Table 9 illustrates appreciable variation over time in estimates of
Summary for Elementary Schools
Nevertheless, the findings in Table 10 indicate an extraordinary degree of consistency across grades and subjects. Consider the results for a design with no covariates. The mean estimated minimum detectable effect size ranges from 0.38 to 0.40 when 40 schools are randomized. This implies that the mean estimated unconditional intraclass correlation, , is almost identical for the grades and subjects being compared. Results for school-level pretests are equally consistent. When 40 schools are randomized, the MDES for Y–1 is 0.26 in all cases, and its counterpart for Y–2 ranges from 0.28 to 0.29. This implies that the average estimated values of R2C are highly consistent across grades and subjects. Corresponding results for student-level pretests are only slightly less consistent, ranging from 0.26 to 0.30 for y–1 when 40 schools are randomized. Overall then, the findings indicate that in the absence of specific local data, the best guess is that randomizing 20 elementary schools with a single school-level or student-level pretest will produce a minimum detectable effect size of about 0.38 or 0.39, randomizing 40 schools will produce a minimum detectable effect size of about 0.26 or 0.27, and randomizing 60 schools will produce a minimum detectable effect size of about 0.21 or 0.22. Table 11 presents the mean estimated minimum detectable effect sizes for the remaining covariate specifications and for subsamples of low-income schools or low-performing schools given 40 randomized schools. Only findings for Districts A to D are available for third grade, and only findings for Districts A to C are available for fifth grade. Furthermore, data were not available from all of these districts for some covariate specifications or subsamples of schools. These findings are reported in parentheses.
Even with their smaller samples of districts, the results in Table 11 exhibit a high level of consistency across grades and subjects. In addition, patterns of findings across covariate specifications and subsamples that were reported earlier for third grade reading hold with striking regularity for the other grades and subjects.
This section examines the likely precision of studies that randomize middle schools or high schools to measure the effects of educational interventions on student achievement. To do so, it presents summary estimates of minimum detectable effect sizes for 8th grade and 10th grade reading and math. These estimates are computed in the same way as those for elementary schools, except that they assume 250 students in a grade per school (instead of 60 for elementary schools), and they could be estimated only for Districts A and C given available data. Table 12 presents estimated minimum detectable effect sizes for designs with no covariates or a single pretest given 20, 40, or 60 randomized schools. Comparing these findings with those for elementary schools in Table 10 suggests that pretests reduce minimum detectable effect sizes by proportionately much more for middle schools and high schools than for elementary schools. Indeed, precision with a pretest improves consistently and substantially by more as one moves from elementary schools to middle schools to high schools. This progression implies a corresponding increase in the values of R2C. To see this, compare the minimum detectable effects for Y–1 in Table 12 for middle schools and high schools with their counterparts in Table 10 for elementary schools. They are largest for elementary schools, appreciably smaller for middle schools, and appreciably smaller yet for high schools. These differences are not due to the fact that elementary school findings are averaged across all five districts whereas those for middle schools and high schools are averaged only across Districts A and C. Detailed findings (not presented) indicate that even within these two districts, for which data for all educational levels are available, there is a pronounced reduction in minimum detectable effect sizes as one moves from elementary to middle to high schools.24
Perhaps the most important feature of these findings is what they imply for the number of middle schools or high schools that must be randomized to attain the prevailing standard of 0.20 for minimum detectable effect sizes. Recall that the findings in Table 10 indicate that roughly 60 elementary schools are needed to achieve this standard. But the findings in Table 12 indicate that only about 40 middle schools or 20 high schools would be needed to do so. Because there are no existing distinctions between standards of precision for secondary schools and elementary schools, the present findings suggest that experimental samples can be much smaller for secondary schools. On the other hand, there is a small but growing body of evidence suggesting that greater precision may be needed for secondary schools than for elementary schools. The reason for this is that developmental trajectories for reading and basic math are much flatter in later grades than in early grades. Hence, annual gains for reading or math in effect size are much larger for elementary school students than for high school students (Bloom, Lipsey, Hill, & Black, 2006; Kane, 2004). Therefore, the ability of interventions to create impacts on these outcomes (their added value) might be more limited for later grades. To address this issue is well beyond the scope of the present article, however.25 Another factor to consider when assessing these findings that is beyond the scope of the present analysis is the extent to which they do or do not apply to other outcomes that are important for secondary schools, such as measures of credits accumulated, rates of on-time transitions from one grade to the next (especially 9th grade to 10th grade), or achievement in more advanced subjects that are only taught at the secondary level. Table 13 presents estimates of minimum detectable effect sizes given 40 randomized schools for a broad range of covariate specifications and alternative subsamples of schools. The basic pattern of findings across alternative designs roughly mirrors that for elementary schools in Table 11. But the magnitudes of the minimum detectable effect sizes are considerably smaller for middle schools than for elementary schools and considerably smaller yet for high schools than for middle schools.
Table 14 provides a birds-eye view of the key parameter estimates for elementary schools, middle schools, and high schools to identify what produced their similarities and differences in minimum detectable effect sizes. The top panel in the table reports the mean estimated values of the unconditional intraclass correlation ( ) by grade, subject, and district. The bottom panel reports corresponding estimates of R2C for a school-level covariate lagged 1 year (Y–1). (Values of R2I are zero for this covariate.) A comparison of these results for elementary, middle, and high schools indicates why their precision is similar without a covariate but vastly different when a pretest is used. For this purpose, it is most useful to compare findings for the same school district. This restricts such comparisons to Districts A and C.
With respect to estimates of , there is no clear pattern across educational levels. In District A, the estimates are lower for secondary schools than for elementary schools, whereas in District C, the reverse is true. On average, across districts these values are fairly similar for elementary schools and secondary schools. However, there are large and consistent differences in the values of R2C across educational levels. In District A, these values range from .31 to .54 for elementary schools, .77 to .78 for middle schools, and .93 to .97 for high schools. In District C, they range from .61 to .81 for elementary schools, .83 to .91 for middle schools, and .91 to .95 for high schools. It is these large differences in R2C that produce the large differences in minimum detectable effect sizes reported earlier, which in turn produce the large differences in the numbers of randomized schools needed to achieve the prevailing standard of 0.20 for minimum detectable effect sizes.
How the Present Findings Relate to Past Research There is a large and growing body of empirical research on the magnitudes of intraclass correlations with respect to public health outcomes and the incidence of risk behaviors (smoking, drinking, drug abuse, sexual activity, etc.) in communities, firms, hospitals, group medical practices, schools, and so on (e.g., Murray & Blitstein, 2003; Murray & Short, 1995; Siddiqui, Hedeker, Flay, & Hu, 1996; Ukoumunne, Gulliford, Chinn, Sterne, & Burney, 1999). The intraclass correlations for these clusters and outcomes are typically much smaller than those for measures of student achievement within schools. They tend to range from about .01–.05 and only occasionally reach .10. For evaluating educational interventions there are only a few studies that focus empirically on the parameters that determine precision: those of Hedges and Hedberg (2007), Schochet (2005), Gargani and Cook (2005), and Bloom et al. (1999). Hedges and Hedberg (2007) report on an ongoing project to construct a "variance almanac" using data from four large national databases: the National Education Longitudinal Study, the Early Childhood Longitudinal Study, the Prospects Study, and the Longitudinal Study of American Youth. Their findings comprise estimates of intraclass correlations for standardized test scores of students within schools for the United States as a whole; for the Midwest, Northeast, South, and West; and for schools within these regions that are located in urban, suburban, and rural areas. In addition, they report intraclass correlations for subgroups of schools based on their levels of poverty and performance. Estimates of intraclass correlations without covariates and with covariates that control for student gender and race and ethnicity plus pretests are presented. These estimates are reported for reading and math tests in grades K-12. Hedges and Hedberg (2007) present a large number of estimated intraclass correlations that vary widely. One important feature of these findings is that intraclass correlations for urban schools are consistently higher than those for suburban or rural schools. Within the studys samples of urban schools, the overwhelming majority of estimated unconditional intraclass correlations range from about .15 to .30. However, because these intraclass correlations include school differences across districts, they are not directly comparable with those reported in the present article for schools within districts. Furthermore, estimates of the explanatory power of the demographic covariates are not reported. Schochet (2005) presents a summary of findings from past empirical studies of intraclass correlations for achievement outcomes for students within schools plus results based on new tabulations of data from three evaluation studies: (a) the Longitudinal Evaluation of School Change and Performance, representing 71 Title I (low-income) elementary schools from 18 districts in seven states (for reading and math achievement in Grades 3, 4, and 5); (b) an evaluation of Teach for America, representing 17 elementary schools in six cities (for reading and math achievement in Grades 2, 3, and 4); and (c) an evaluation of the 21st Century Community Learning Centers Program, representing 30 elementary schools in 12 districts (for reading and math achievement in Grades 1, 3, and 5). Estimates from the first database indicate that adjustments for district effects reduce intraclass correlations substantially. This suggests that using Hedges and Hedbergs (2007) findings to predict intraclass correlations for schools within districts might overstate their magnitudes. Adjustments for district effects are not reported for the other two databases.
On the basis of the findings surveyed and presented, Schochet (2005) concluded that "the examined data sources suggest that values for Bloom et al. (1999) present findings from a study of reading and math test scores in Grades 3 and 6 for 25 elementary schools from Rochester, New York, during 2 years. Seven of the eight estimated intraclass correlations from their analysis range between .14 and .21; one equals .08. The authors tested the ability of numerous covariate specifications, to increase precision. These findings are included as part of the present article. Gargani and Cook (2005) analyze reading scores for a single grade (not specified) in 1 year for 88 elementary schools from Louisville, Kentucky. They estimated the unconditional intraclass correlation to be .11. When they controlled statistically for a single school-level pretest, they obtained an R2C value equal to .85. On the basis of these results, the authors concluded that randomizing only 22 elementary schools could produce a minimum detectable effect size of 0.20. The overall results for elementary schools from the present article are generally consistent (to the extent that they can be compared) with those of Hedges and Hedberg (2007), Schochet (2005), and Bloom et al. (1999). They suggest that on average, using data for a school-level or a student-level pretest and randomizing about 60 elementary schools can produce a minimum detectable effect size of 0.20. This differs substantially from the conclusion of Gargani and Cook (2005). The findings of the present study for middle schools and high schools have (to our knowledge) no direct counterparts in the existing literature. These findings, as noted earlier, suggest that on average, randomizing about 40 middle schools or 20 high schools and using data for a pretest can produce a minimum detectable effect size of 0.20. All of the findings in the present study and the existing literature reflect estimates that vary across school districts and years. To a certain extent, this variation reflects ran |


ij is a random error for student i from school j, which is assumed to be independently and identically distributed across students within schools.
, in the model equals the mean value of the outcome measure for the control group. The regression coefficient,
0, equals the difference between the mean outcome for the treatment group and control group. Hence, it is the impact of the intervention on the outcome. In these two regards, Equation 1 is the same as statistical models for experiments that randomize individuals. What makes it different is the presence of two random errors instead of one.
2 and the variance of the student error term,
2, the intraclass correlation,
, equals 



