| Sign In to gain access to subscriptions and/or personal tools. |
Strategies for Improving Precision in Group-Randomized ExperimentsUniversity of Chicago
University of Michigan
Interest has rapidly increased in studies that randomly assign classrooms or schools to interventions. When well implemented, such studies eliminate selection bias, providing strong evidence about the impact of the interventions. However, unless expected impacts are large, the number of units to be randomized needs to be quite large to achieve adequate statistical power, making these studies potentially quite expensive. This article considers when and to what extent matching or covariance adjustment can reduce the number of groups needed to achieve adequate power and when these approaches actually reduce power. The presentation is nontechnical.
Key Words: group-randomized experiments multilevel research design INTEREST has rapidly increased in studies that randomly assign social units to alternative treatment conditions. Such units include whole schools (Borman et al., 2005; Cook, Hunt, & Murphy, 2000; Flay, 2000; Mosteller, Light, & Sachs, 1996; Porter, Blank, Smithson, & Osthoff, 2005), housing projects (Bloom & Riccio, 2005; Sikkema, 2005), neighborhoods or whole communities (Hannan, Murray, Jacobs, & McGovern, 1994; Sherman & Weisburd, 1995; Teruel & Davis, 2000; Weisburd, 2000, 2005), and physician practices (Donner & Klar, 2000; Grimshaw, Eccles, Campbell, & Elbourne, 2005; Leviton & Horbar, 2005; Murray, 1998). During the 1980s, the U.S. Department of Labor began supporting randomized trials as a means for understanding which employment and training programs effectively increased earnings. Similarly, during the 1990s, the U.S. Department of Health and Human Services started supporting randomized trials to determine which drug prevention programs and other risk reduction programs were effective (Bloom, 2005; Mosteller & Boruch, 2002). In 2002, with the creation of the new Institute for Education Sciences, the U.S. Department of Education established the conduct of randomized trials as a priority (Education Sciences Reform Act, 2002). In fact, most of these studies treat whole classrooms or schools rather than students as the units of randomization. The decision to assign groups rather than individuals to treatments is essential when interventions are designed to treat entire collectives of persons. This is true, for example, in whole-school reform efforts designed to engage all of the teachers in a school in a joint effort to improve instruction (Borman et al., 2005) or when a violence prevention program is intended to operate by changing the normative climate in an entire school (Flay, 2000). Group-based trials are also often preferred when there is a danger of "spillover effects" or when logistical, ethical, or political considerations discourage person-based randomization (Bloom, 2005). Cook (2005) suggested that by targeting the group level, an intervention may improve group-level processes, resulting in greater long-term impacts on individuals who come into contact with the group. In most group-based studies, the statistical power to detect treatment effects depends more strongly on the number of groups available than on the number of persons per group (Bloom, Bos, & Lee, 1999; Donner & Klar, 2000; Murray, 1998; Raudenbush, 1997). At the same time, the number of groups to be sampled will also typically drive costs. For example, in a school-based intervention, it is far more expensive to recruit a single school and to sustain that schools engagement than it is to sample and assess an additional student once a school has agreed to participate. Thus, the number of groups drives cost and power more than the total sample size of students. An evaluation of statistical power for a typical group-randomized study can create the impression that the study will be exceedingly expensive. Consider, for example, a hypothetical study in which J groups are to be assigned either to an experimental or to a control condition (J/2 groups in each), with n = 100 persons per group assessed after the treatment on a continuous outcome variable. Assume that the true mean difference between the two groups is equivalent to 0.25 on the scale of the outcome standard deviation. Such effect sizes (ESs) are widely regarded as nonnegligible in education research focused on academic achievement (Bloom, 2005). Finally, assume that 85% of the variation lies between persons within groups, with 15% of this variation lying between groups within treatment conditions. This variance ratio is common in school-based research, particularly for achievement outcomes in elementary schools (Bloom, Richburg-Hayes, & Black, 2005). The data will be analyzed by means of a simple t test using group means as the outcomes. Under these conditions, 82 schools (41 per condition) must be sampled to achieve statistical power of 0.80 to reject the null hypothesis of no program effect at the 5% level of significance (Raudenbush, 1997). In many research settings, the cost of studying 82 groups will be daunting. For example, in a study of whole-school reform, the cost of recruiting 82 schools, implementing the intervention in half of them, sustaining the involvement of those schools in data collection, and dispatching data collectors can be expected to be many millions of dollars. Given current research funding, such per study costs would severely limit the number of such studies and therefore limit the prospects of a plan to rely on school-based randomized studies for information about school improvement. In the hypothetical example above, no attempt was made to identify or use prior information that might have predicted participant outcomes. Yet it is well known that such information can often be ascertained at comparatively low cost and that when this information is used effectively, the sample-size requirements for experiments can be reduced, sometimes substantially. Given the high cost of sampling groups, it becomes essential to find efficient experimental designs and analytic approaches that capitalize on prior information to increase statistical power. Early in the past century, the pioneers of experimental design in agriculture learned that pretreatment information can often be exploited to improve statistical power (Cochran, 1957; Fisher, 1926, 1936, 1949). Two key approaches are available:
In sum, both approaches use prior information to reduce uncertainty about outcomes. Prerandomization blocking builds this information into the design such that randomization occurs within blocks. In contrast, analysis of covariance (ANCOVA) exploits the linear association between a covariate and the outcome in the analysis phase of the study. The key aim of the current article is to clarify conditions under which the use of blocking and covariance adjustment can reduce the number of groups required to achieve adequate power in group-randomized studies under specific conditions. This effort may be regarded as the extension of classic Fisherian principles of experimental design to settings in which social units ("groups"), rather than individuals, are the units of randomization and treatment. Throughout the discussion, we draw on the key principles at play when persons are the unit of randomization and show how they extend nicely to the case of group-randomized studies, with certain key modifications that we shall highlight. This is not the first article to consider such issues. Several authors have provided useful accounts of the trade-offs that arise in selecting alternative designs and analytic methods in group-randomized trials (for reviews, see Bloom, 2005; Donner & Klar, 2000; Hughes, 2005; Martin, Diehr, Perrin, & Koepsell, 1993; Murray, 1998; Raudenbush, 1997). Our aim is to provide a precise specification of how the power of two alternative approaches, blocking and ANCOVA, compares with the power of a standard design that does not exploit pretreatment information. We make comparisons for specific sample sizes and parameter values. We hope to enable readers to use these results and the freely available software we describe to improve the planning of group-randomized studies (Raudenbush, Liu, Spybrook, Martinez, & Congdon, 2006).1 For simplicity, we focus on the case in which the aim is to compare two treatments: a novel "experimental" approach and a more standard "control" approach. In this case, blocks each consisting of two units may be formed. These blocks are called matched pairs. We consider the question of when and to what extent such matching will boost the statistical power for a given sample size or reduce the sample size needed to achieve a given level of power. We also consider the utility of ANCOVA in such a two-treatment setting. In this setting, the aim is to provide sharp answers to specific questions about trade-offs in selecting approaches to design and analysis. We ask the following questions:
We restrict our attention to "two-level designs" (e.g., persons within groups, with groups as the unit of randomization) and draw on the literature from "single-level designs" (e.g., persons as units of randomization). We reserve a brief discussion of more complex designs (e.g., students within classrooms within schools that are assigned to treatments, repeated-measures designs) for a concluding section, noting that the software described can handle a number of more complex designs. This article is organized as follows. In the next section, we consider matching in the context of group-randomized designs, addressing Question 1 above. In the following section, we turn to covariance adjustment (Question 2). Next, we compare matching and covariance adjustment (Question 3). A concluding section summarizes the results, provides guidance for planning new studies, and briefly explores more complex designs.
The potential benefits associated with matching in cases in which individual participants are randomized to treatment groups have been a topic of methodological research for many decades (Cochran, 1957; Federer, 1955; Fisher, 1926, 1936, 1949; Kempthorne, 1952). Excellent summaries can be found in classic texts on experimental design (see, e.g., Kirk, 1982). The basic principle of blocking (of which matching is a special case) was elucidated by Fisher (1926) in the context of agricultural research. He criticized a simple experiment in which plots of land were assigned at random to receive one of five varieties of seed to assess the causal effect of such varieties on plant growth: On most land, however, we shall obtain a smaller standard error, and consequently a more valuable experiment, if we proceed otherwise. The land is divided first into seven blocks, which, for the present purpose, should be as compact as possible; each of these blocks is divided into five plots, and these are assigned in this case to five varieties, independently, and wholly at random. If this is done, those components of soil heterogeneity which produce differences in fertility between plots of the same block will be completely randomized, while those components which produce differences in fertility between blocks will be completely eliminated. (p. 509) Blocking is not always helpful, however. If little variation lies between blocks, blocking can actually reduce power because it entails a loss of degrees of freedom. Even when blocking does little to boost power, it can be useful because of a potential increase in the "face validity" of an experiment. We discuss these issues below in the context of person-randomized trials before turning to the main focus of this section: matching in group-randomized trials.
Loss of Degrees of Freedom In contrast, suppose that prior to randomization, the individuals are rank ordered on the basis of some information X collected beforehand. Next, the experimenters match the individuals such that the first two in the ranking constitute Pair 1, the next two constitute Pair 2, and so on. Within each pair, individuals are then assigned at random to treatment and control conditions, the treatment conditions are implemented, and an estimate of the treatment effect is obtained from each pair. In this case, prior information on the individuals is explicitly embedded into the design. The general name for such a design is a randomized block design, and in the special case in which each block contains two units, it is called a matched-pairs (MP) design. The number of quantities computed here is (M/2) + 1 (a treatment effect for each pair and an average treatment effect), and the available degrees of freedom are then M –[(M/2) + 1] = (M/2) – 1, half those in the CR design. When the overall sample size M is small, losing degrees of freedom can substantially increase the critical value of the t test. So if matching is ineffective, this loss of degrees of freedom will reduce power. This penalty becomes negligible as M increases.
Improving Face Validity
Matching in Group-Randomized Studies
Example Under the CR design, intact schools, rather than isolated students, are assigned at random to treatment and control conditions. The literacy programs are administered, and students are then assessed on an outcome variable Y, the score on a standardized reading test. The comparison of outcomes needs to take into account the nesting of the students within the schools. Similarly, under the MP group-randomized design, researchers first locate pairs of schools, rather than of individuals, that are likely to be similar on their mean reading outcomes. Researchers then rank order the schools on the basis of expected achievement and create the pairs: the first two schools constitute Pair 1, the next two schools constitute Pair 2, and so on. Within each pair, one school is assigned at random to the new literacy program condition while the other is kept as a control. The programs are then implemented, and students are assessed on Y. Again, the comparison of outcomes must take into account the nesting of the individuals within the groups and of the groups within the pairs. The question that naturally arises is whether the MP design has substantially more power than the CR design. Intuitively, the answer will be yes if X (the matching variable) strongly predicts Y (reading outcomes). If so, schools within any given pair, which are by design very similar on X, will also tend to be very similar on Y in the absence of the treatment effect. In the extreme case in which matches on X within pairs are perfect and in which X perfectly predicts Y in the absence of treatment, the difference between pair members on Y will perfectly reflect the impact of the new literacy program within that pair, and the average of these differences will give an excellent estimate of the population average effect of the new program.
Generalization
Factor 6, the within-pair correlation, is also equivalent to the proportion of variance in the latent mean outcomes that lies between pairs. The fact that a correlation is equivalent to a variance ratio may seem odd, but that is easily shown to be the case. Factor 7 is referred to as the ES variability (ESV). The ESV quantifies how much the treatment effect varies across pairs. For example, the school differences in treatment implementation may result in school differences in the intervention effect. This variation in the treatment effect is captured in the design by the ESV. Within the past 15 years, a number of authors have compared the power of the CR design with that of the MP design in the context of group-randomized studies (Freedman, Green, & Byar, 1990; Gail, Mark, Carroll, Green, & Pee, 1996; Martin et al., 1993). Freedman et al. (1990) examined how the MP design increased the precision of the estimate of the treatment effect relative to the CR design, restricting their attention to a comparison of the standard errors of the treatment effect (as noted by Martin et al., 1993). This approach, although useful, does not incorporate the effect on power of the loss of degrees of freedom using the MP design. Martin et al. (1993) then considered the power of CR and MP designs for small group-randomized studies, in which the loss of degrees of freedom would likely matter. They concluded that if there were fewer than 20 groups, matching should be used only if the correlation between the matching variable and the outcome is greater than about .45. Otherwise, the loss of degrees of freedom in the MP design would result in a less powerful design than the CR design (see also Hughes, 2005). Bloom (2005) provided calculations showing the required predictive power of matching to increase the power of the test relative to a CR design for varying numbers of groups in the context of social studies.
Making the comparison precise
Power in the MP design will also depend on how much the ES varies from pair to pair. Such ESV is allowed to be 0.01 and 0.05. To clarify, suppose ES = 0.15 and ESV = 0.01. If the pair-specific ESs vary randomly across pairs according to the normal distribution, one would expect 95% of the pairs to yield ESs in the range of 0.15 ± 1.96 0.01 ES units, that is, over the interval (–0.05, 0.35). An ESV larger than 0.01 would yield quite large plausible value intervals, perhaps larger than is realistic, although 0.05 is considered an upper bound. Looking at the upper left panel of Figure 1, when J = 20 (10 matched pairs) and ES = 0.15, power for the CR design is extremely low, at about 0.10. Power for the MP design is not much better, even when matching explains 90% of the outcome variation. When ES = 0.30 (upper right panel), things are not much better. The CR design yields about 0.25 power, and the MP design does not help much unless it explains a large fraction of the variation. Still, even if 90% of the variation is explained, power is unacceptable, at about 0.60. It is clear that the ES will need to be quite large if such a small study (J = 20) is to yield adequate power. Notice that the utility of matching also depends on the ESV, but this dependence is weak. Indeed, the curves for ESV = 0.01 and ESV = 0.05 are virtually indistinguishable despite the fact that ESV = 0.05 is very large indeed. Across all the scenarios shown in the figure, the smaller ESV of 0.01 gives slightly greater power than the larger ESV of 0.05. Increasing the number of schools will naturally add power to the study. Consider the center right panel of Figure 1, in which ES = 0.30 and J = 40 (20 matched pairs). Power for the CR design is about 0.47. The utility of matching depends on the percentage of outcome variation explained. Until about 25% of the outcome variation is explained, matching actually makes things slightly worse. But matching increasingly helps as the percentage of variance explained increases toward 1.0 and can even lead to a well-powered study if it successfully explains about 80% of the variation in the outcome. Looking now at the bottom pair of panels, the benefit of increasing J to 60, especially for the larger ES, can be seen. The benefits of matching are again clear. When ES = 0.30, power increases beyond 0.80 if about 60% of the variance in school means is explained.
Recall that Figure 1 portrays a "bad case" in that the ICC is .20. When the ICC is large, there is considerable variation between schools, meaning that the CR design requires many schools to achieve adequate power for a given ES. Matching reduces the need to have so many schools by accounting for some of the large variation between schools. However, matching may actually hurt unless
Figure 2 gives a somewhat more optimistic scenario because now the ICC is .10 rather than .20. This means that about 10% of the overall variation in outcomes lies between schools. Power is uniformly higher than in Figure 1, holding constant J and ES. However, the benefit from matching is smaller in Figure 2 than in Figure 1. Now,
These principles are even clearer when Figure 3 is examined, in which an ICC of only .05 is assumed. Now, the CR design achieves nearly respectable power when ES = 0.30 and J = 40 and quite respectable power when J = 60. Notice that now, there is little to gain from matching because little of the variation lies between schools, and matching schools can help only in reducing that small fraction of variation further.
The implications of this exercise are clear. Matching helps most when between-school variation is large and therefore a big problem for the CR design. The CR design then requires a very large number of schools to "cut down the noise" caused by random variation in school means within treatments. Matching helps by explaining some or most of this variation, so that fewer schools are required under matching than under the CR design, as long as the percentage of variation in school means explained by matching achieves a threshold. This threshold is higher when few schools are sampled and goes up further as the ICC declines. When little variation lies between schools in the absence of matching, matching will not help much, because there is little variation for matching to explain. Indeed, matching is more likely to hurt when the ICC is small than when the ICC is large.
Matching prior to randomization is one way to increase power. A second major strategy for increasing the power of an experiment is the use of ANCOVA. The efficacy of ANCOVA in the context of group-randomized studies has been discussed by several authors (see, e.g., Bloom, 2005). A question arising in this literature is whether the covariate should be a person-level characteristic or a group characteristic. According to Bloom (2005), correlations at the group level in the social sciences are typically higher than correlations at the individual level. In addition, group-level data are often more accessible and may be less expensive to acquire. For example, consider a study designed to test the effect of a reading intervention in schools in which reading achievement is the outcome and all third grade classrooms within a school are assigned to the same treatment condition. A useful covariate would be last years third grade scores on the reading test. Finding last years average third grade scores for each school will typically be inexpensive. Examples of this type arise frequently, so we focus here on covariance adjustment for group-level covariates.
Example ANCOVA modifies the script in that the investigators decide to exploit the availability of information on W, the prior mean achievement of school J. Many studies have found a strong linear association between school-level mean achievement and school-level mean outcome, so such a W is an excellent candidate as a covariate. As in matching, the experimenter uses ANCOVA to exploit the existence of prior information about schools, information that hopefully is helpful in predicting school-level mean outcomes. One key difference is that although matching builds this prior information into the design of the study (by first matching and then randomizing within pairs), ANCOVA uses a statistical model to "hold constant" the prior variables when evaluating the impact of the treatment on the outcome. In fact, operationally, the ANCOVA design is identical to the CR design: Persons are simply randomly assigned to either the experimental or the control group with no regard for any prior information. However, the analytic model is different. Specifically, the outcome, Y, is regarded as a linear function of two things: the covariate, W, and treatment group membership, indicated by a dummy variable, Z. The aim is to predict how two schools with the same value of W would differ if one received the experimental treatment (Z = 1) and one received the control treatment (Z = 0). Because treatments are assigned at random, only the treatment effect plus chance differences (other than the covariate) contribute to the predicted difference between two such schools. Thus, in the scenario described above, using linear regression, the experimenter uses W and treatment group membership Z to predict Y. In this scenario, the expected mean difference between the two groups, holding constant W, is the average impact of the treatment in the population from which the schools were sampled. Intuitively, ANCOVA will boost power when W (the prior mean achievement of school J) has a strong linear association with Y (the school-level mean reading test score). If so, two schools that have very similar W values but experience different treatments will have very similar predicted Y values in the absence of a treatment effect. Therefore, if, on average, schools that are similar on W but vary with respect to treatment also vary systematically on Y, one will tend to find evidence of a treatment impact.
To illustrate this idea, data were generated in which W strongly predicts Y. To illustrate the logic of ANCOVA, assume an unrealistically high correlation of
Figure 4 gives a scatterplot of Y against W for the two treatment groups. The plot shows an extremely strong, positive, linear association between W and Y (note the sample prediction line within each of the two groups). Note that at nearly all values of W, the experimental group Y values tend to be elevated above those of the control group. Thus, when attention is restricted to schools that have similar values of W, it can be readily discerned that the treatment and control group means are different. The estimate of the average impact of the treatment is the vertical distance between the two nearly parallel lines in the bottom left panel of Figure 4. The test of the average treatment effect under the nested ANCOVA is positive and statistically highly significant (t = 7.01, p < .001). In effect, controlling for the covariate W has dramatically reduced the "noise" in Y, revealing the "signal," that is, the impact of the treatment.
Generalization To see how the data look when a useless covariate is used, refer to the bottom right panel of Figure 4. Values of this covariate, labeled U, increase along the horizontal axis, with Y on the vertical axis. Confining attention to cases with similar values of U in no way reduces uncertainty about Y and therefore is of no help in discerning the treatment effect. In this case, ANCOVA, like the CR analysis, gives a nonsignificant test of the impact of the treatment (t = 0.25). The key assumptions underlying ANCOVA are that
Assumptions 1 and 2 are not needed in using the MP design. Thus, the assumptions for ANCOVA are more restrictive than those needed for the MP analysis, making the latter more flexible. As noted in the matching discussion, the power associated with the CR design depends on five factors:
Making the Comparison Precise
In many ways, the results parallel those of the preceding section. Thus, a smaller benefit of using the covariate is seen when the ICC is .10 than when it is .20, and even less benefit is seen for ICC = .05. Once again, the school-level covariate can explain only the variation between schools, and when the ICC is very small, there is little between-school variation to be explained. One obvious and important difference involves the potential negative effect of using the covariate. Recall that matching can actually undermine power unless the percentage of variance explained by matching reaches a threshold value. In contrast, Figures 5–7 suggest that the penalty for using a useless covariate is negligible in all cases. The reason is clear: Only one degree of freedom is sacrificed in using the covariate, compared with K – 1 (where K is the number of pairs) in the MP group-randomized design.
We have explored how two approaches to using prior information can improve power compared with an approach that does not make use of such prior information. As long as there is substantial variation between schools to explain, matching will substantially improve power as the correlation within pairs on the outcome increases. And ANCOVA will substantially improve power when the correlation between the covariate and the outcome increases. The question then naturally arises: Which of these two approaches is most effective in increasing power? A comparison between ANCOVA and matching makes sense only when ANCOVA is a reasonable approach, that is, when the assumptions required for ANCOVA are at least approximately correct. Yet if the ANCOVA assumptions do hold, ANCOVA will be optimal. However, there are benefits of matching. Recall that matching can ensure balance between treatments on key salient variables, increasing face validity. Moreover, the weaker assumptions required for matching make it appealing. The question therefore arises as to whether matching might be nearly as good as ANCOVA when the ANCOVA assumptions hold. If so, one might argue that matching ought to be adopted to purchase its unique benefits. The available literature does not provide a conclusive answer to this question.
To date, researchers evaluating matching techniques have tended to overestimate the utility of matching (Hughes, 2005; Klar & Donner, 1998). In this literature, it is common to postulate the existence of a continuous pretreatment variable, say
First, a group-level covariate W and, for each group, a latent "true" group mean outcome, µ, were generated as bivariate normal in distribution, each with variance 1.0 and correlation
The results are graphed in Figures 8–10. These are the same results shown in Figures 5–7, except that the power for the MP design at each value of
As expected, the ANCOVA approach works best. In every case, power associated with matching increases at a rate nearly equal to that of ANCOVA, but always at a lower level. The two are most similar when J is large and the ICC is large. A key difference is that when wµ is small, matching can do worse than the CR design, whereas the penalty for a small correlation with ANCOVA is negligible. As seen earlier, there is virtually no penalty for using ANCOVA when X is a useless covariate, unless J is very small, smaller than studied here. In contrast, matching can reduce power, particularly when ineffective, when the number of pairs is small and when the ICC is small.
The logic of social experimentation often requires that groups, rather than individuals, be the unit of assignment and of treatment. Assignment by randomization confers the same advantages in group-based studies as in person-based studies: A well-implemented randomized study eliminates bias by statistically balancing treatment conditions on all prior characteristics. In this case, the only differences between treatments on prior characteristics are chance differences, and standard significance tests and confidence intervals correctly quantify uncertainty about the existence and magnitude of causal effects. However, if no effort is made to identify and control prior group characteristics, group-randomized studies will tend to lack statistical power, unless many groups are recruited. This need for many groups arises because the effects of short-term interventions typically implemented in social settings will tend to be modest. The number of groups required to achieve adequate power depends also on the intraclass correlation, also interpretable as the fraction of variation in group-mean outcomes that lies between groups. Even when this fraction is modest (e.g., less than 10%), the number of groups required to achieve adequate power can be daunting if the ES is modest. In this article, we have evaluated two alternative approaches to identifying and controlling prior group characteristics: matching and ANCOVA. Under certain circumstances, such uses of prior information can substantially increase power given the number of groups or, equivalently, reduce the number of groups needed given a desired level of power. In this discussion, we first briefly summarize the key findings. Second, we provide advice on how to use existing data to assess the likely contribution of matching or ANCOVA to improve power. Third, we briefly explore other more complex designs and how available information can be used to increase the power to detect treatment effects.
Key Findings
A key factor to consider, however, is the fraction of variation that lies between groups in the absence of matching. This factor is indexed by the ICC. Matching is most helpful when the ICC is comparatively large. If the ICC is tiny, there is little variation between groups to be removed through matching. Thus, the threshold value of
ANCOVA
Matching versus ANCOVA We were surprised to find little precise guidance in the literature on this question. On reflection, this finding is understandable in that precise mathematical comparisons are difficult if possible at all. We therefore conducted a simulation study. In the simulation study, matching approximated the power of ANCOVA as the number of groups increased and as the ICC increased. However, for many plausible cases, ANCOVA did significantly better than matching, because in these scenarios, either the number of groups or the ICC was too small to enable matching to approximate the power of ANCOVA (see Figures 8–10). Match and covary? Given the trade-offs, is it possible to benefit from matching and use X as a covariate? Although the precise study of such an option goes beyond the scope of this article, the results are suggestive. On one hand, adding a useless covariate to an MP study will cause little harm unless the number of groups is very small. On the other hand, if the covariate is powerfully related to the outcome, the benefit may be great, particularly if matching was not effective. Thus, it is quite plausible to envision a scenario in which matching is desirable to enhance face validity but adds little power, or even reduces it. Adding a strong covariate may then help. On the other hand, if a good covariate is already available, matching in addition to ANCOVA would seem unpromising in most cases of interest for the purpose of increasing power. In studies with small numbers of groups, matching in addition to ANCOVA may hurt because the loss of degrees of freedom associated with matching increases the critical value of the test statistic for the treatment effect.
Our hope is that this article will improve the planning of group-randomized studies. Good planning typically requires reasonable estimates of quantities that can be known precisely only after a study has been conducted. In many cases, pilot data or data from archives can be analyzed to provide good estimates of these quantities. If those estimates become available, the reader might use the information presented in the figures to guide planning. These figures cover only a small fraction of the cases that arise in practice, however. Details of how to replicate the results in the figures using the Optimal Design software package are in Appendix A. We now consider how to use the software (Raudenbush et al., 2006) to plan studies in cases not represented in those figures. The discussion is restricted to the case of continuous outcomes. However, the software documentation shows how to use the software for dichotomous outcomes and presents all equations needed to derive the results in the figures.
CR Design Planning group-level CR studies is a little more complicated. To obtain power, one needs not only a hypothesized ES but also an estimate of the ICC, the fraction of variation lying between groups. Prior data will be useful in this regard (see Bloom, 2005; Shochet, 2005). Power is then a function of both the number of groups, J, and the sample size per group, n.
MP Design
In the group-randomized case, planning for MP is again a bit more complicated. One needs the same quantities as required in the CR case (the standardized ES and the ICC) plus an estimate of
Analysis of Covariance
Other (More Complex) Designs One such design is the three-level group-randomized trial with treatment at level 3. In this case, there is an additional layer of clustering between the unit of analysis and the unit of randomization. Suppose a schoolwide study involving several schools is designed to determine the effects of a whole-school reform on students academic achievement. The reform is implemented at the school level, and the outcomes of interest are at the individual level. However, the effectiveness of the reform may depend on the quality of the teachers or on other classroom-level characteristics. Thus, taking into account the clustering at the level of the classroom is essential. To increase the power of such a study, an option might be to block the schools by district prior to randomization. This blocking can reduce the between-district variation and thus potentially increase the power. Again, whether the power increases will depend on the strength of districts as a blocking variable and on the number of schools in the study, among other variables. This type of blocking, however, may be desirable for face-validity purposes. Alternatively, a school-level covariate such as school mean prior achievement might be included in the analysis. With three levels in the design, any level covariate could be used; however, the school-level covariate in this case is the most logical for reasons outlined above in the discussion of covariance adjustment. Another (three-level) design not discussed here is the cluster randomized trial with repeated measures at the individual level. Such a design involves repeated measures nested within individuals nested within groups (Raudenbush & Liu, 2001). Yet again, the main findings about the power-increasing strategies discussed in this article can once again be extended. All the designs illustrated in the figures plus the three-level group-randomized trial with treatment at all levels and the cluster-randomized trial with repeated measures at the individual level are included in the Optimal Design software and its documentation, available from the William T. Grant Foundations Web site or from the authors. We encourage researchers to use these resources in planning their studies of group-based initiatives. See Appendix A for details on how to use the software.
All the results in the figures included in this article may be replicated using the Optimal Design for Group Randomized Trials mode of the Optimal Design software. We include here a few examples. For a more in depth discussion, refer to the software documentation (http://www.wtgrantfoundation.org/ or http://sitemaker.umich.edu/group-based). Example 1: Replicating the Completely Randomized (CR) Results for Intracluster Correlation (ICC) = .20, n = 20, Effect Size (ES) = 0.15, and J = 20 (Figure 1, Top Left Panel)
Select Cluster Randomized Trial Example 2: Replicating the Matched-Pairs (MP) Results for ICC = .20, n = 20, ES = 0.15, J = 60, and ESV = 0.01 (Figure 1, bottom left graph)
Select Multi-site CRT Example 3: Replicating the ANCOVA Results for ICC = .20, n = 20, ES = 0.15, and J = 60 (Figure 5, Bottom Left Panel)
Select Cluster Randomized Trial
Estimating pairs for MP Designs for Group-Randomized StudiesSuppose that the planner has obtained pilot data on J schools, each with sample size n. First, the user computes a one-way random-effects analysis of variance, obtaining an estimate of the between-school variance, 2, the within-school variance, 2, and, from these, ICC = 2/( 2 + 2) and = 2/( 2 + 2/n). The next step is to construct J/2 matched pairs just as they will be constructed in the study. One then proceeds as follows:
Estimating
Now suppose that W itself is also the sample mean of a person-level covariate and is therefore a fallible estimate from the pilot sample of the "true mean W." Then let
STEPHEN W. RAUDENBUSH is a professor in the Department of Sociology at the University of Chicago, 1126 E. 59th Street, Chicago, IL 60637; sraudenb{at}uchicago.edu. He is best known for his expertise in quantitative methodology using the advanced research technique of hierarchical linear models, which allows researchers to accurately evaluate data from school performance. His research pursues the development, testing, refinement, and application of statistical methods for individual change. He also researches the effects of social settings, such as schools and neighborhoods. ANDRES MARTINEZ is a doctoral student in the combined program in Education and Statistics at the University of Michigan and a visiting research scholar at the University of Chicago, Social Science Research Building, Room 417, 1126 E. 59th Street, Chicago, IL 60637; amzzz{at}umich.edu. His current research interests include causal inference in hierarchical settings, the measurement of educational settings, and the effectiveness of conditional cash transfer programs in education. JESSACA SPYBROOK is a doctoral candidate in the combined program in Education and Statistics at the University of Michigan, Institute for Social Research #2050, 426 Thompson Street, Ann Arbor, MI 48106-1248; jessacah{at}umich.edu. She has been a part of the "Building Capacity for Evaluating Group-Level Interventions" project sponsored by the William T. Grant Foundation since January 2004. She co-authored documentation that accompanies the Optimal Design Software and has been a part of the consulting team that assists researchers in the design and analysis of group-randomized trials. The work reported here was supported by the grant "Building Capacity for Evaluating Group-Level Interventions," sponsored by the William T. Grant Foundation. We are especially grateful to Bob Granger and Ed Seidman of the William T. Grant Foundation for their advice and encouragement. Special thanks also to Howard Bloom and Xiaofeng Liu for their consultation and advice on the issues discussed here and to three anonymous reviewers for valuable advice and careful reading of previous drafts.
1 The Optimal Design software is freely available from the Web site of the William T. Grant Foundation (http://www.wtgrantfoundation.org). It can be used to estimate the power of individual- and group-randomized studies following the designs discussed in this article, among others. Software documentation containing all the formulas for the power calculations is also accessible on the Web site.
2 The model generating the data was yij = 2 + 0.95 Wj + 0.25 Zj + rj + eij, where yij represents the outcome for student i = {1, . . ., n} in school j = {1, . . ., J}; Wj represents the school-level covariate, assumed normally distributed with mean 0 and variance 1; Zj represents a school-level indicator (1 for treatment, 0 for control); rj Received for publication March 20, 2006. Revision received December 6, 2006. Accepted for publication January 8, 2007.
Educational Evaluation and Policy Analysis, Vol. 29, No. 1,
5-29 (2007) This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

.
pairs; and 
0.01 ES units, that is, over the interval (–0.05, 0.35). An ESV larger than 0.01 would yield quite large plausible value intervals, perhaps larger than is realistic, although 0.05 is considered an upper bound. 







Power for the main effect of treatment (continuous outcome) 
= 0.01, B = 0.8. Clicking on the plot for an effect size of 0.15 yields power of about 0.49, which approximates the power for the MP with 0.01ESV in the bottom left graph in
=.64. Clicking on the figure for an ES of 0.15 yields power of about 0.40, which is the same as the power for the ANCOVA for a correlation of .80 (.802 = .64) in the bottom left graph in
2, the within-school variance,
= 
N(0,





