| Sign In to gain access to subscriptions and/or personal tools. |
Intraclass Correlation Values for Planning Group-Randomized Trials in EducationNorthwestern University
University of Chicago
Experiments that assign intact groups to treatment conditions are increasingly common in social research. In educational research, the groups assigned are often schools. The design of group-randomized experiments requires knowledge of the intraclass correlation structure to compute statistical power and sample sizes required to achieve adequate power. This article provides a compilation of intraclass correlation values of academic achievement and related covariate effects that could be used for planning group-randomized experiments in education. It also provides variance component information that is useful in planning experiments involving covariates. The use of these values to compute the statistical power of group-randomized experiments is illustrated.
Key Words: intraclass correlation cluster randomized trials experiments statistical power MANY social interventions operate at a group level by altering the physical or social conditions. In such cases, it may be difficult or impossible to assign individuals to receive different intervention conditions. In other cases, it may be possible to assign treatments to individuals, but for practical or political reasons, the assignment of individuals to treatments is not feasible. In either situation, field experiments may assign entire intact groups (such as sites, classrooms, or schools) to the same treatment, with different intact groups being assigned to different treatments. Because these intact groups correspond to what statisticians call clusters in sampling theory, this design is often called a group-randomized or cluster-randomized design. Cluster-randomized trials have been used extensively in public health and other areas of prevention science (see, e.g., Donner & Klar, 2000; Murray, 1998). Cluster-randomized trials have become more important in educational research more recently, following increased interest in experiments to evaluate educational interventions (see, e.g., Mosteller & Boruch, 2002). Methods for the design and analysis of group-randomized trials have been discussed extensively by Donner and Klar (2000) and Murray (1998).
The sampling of subjects into experiments via statistical clusters introduces special considerations that need to be addressed in the analysis. For example, a sample obtained from m clusters (such as classrooms or schools) of size n randomized into a treatment group is not a simple random sample of nm individuals, even if it is based on a simple random sample of clusters. Instead, it is a two-stage sample (with one stage of clustering). Consequently, the sampling distribution of statistics on the basis of such clustered samples is not the same as that based on simple random samples of the same size. For example, suppose that the (total) variance of a population with clustered structure (such as a population of students within schools) is Several analytical strategies for cluster-randomized trials are possible, but the simplest is to treat the clusters as units of analysis, that is, to compute mean scores on the outcome (and all other variables that may be involved in the analysis) and carry out the statistical analysis as if the site (cluster) means were the data. If all cluster sample sizes are equal, this approach provides exact tests for the treatment effect, but the tests may have lower statistical power than would be obtained by other approaches (see, e.g., Blair & Higgins, 1986). More flexible and informative analyses are also available, including analyses of variance using clusters as a nested factor (see, e.g., Hopkins, 1982) and analyses involving hierarchical linear models (see, e.g., Raudenbush & Bryk, 2002). For general discussions of the design and analyses of cluster-randomized experiments, see Murray (1998); Bloom, Bos, and Lee (1999); Donner and Klar (2000); Klar and Donner (2001); Raudenbush and Bryk (2002); Murray, Varnell, and Blitstein (2004); or Bloom (2005).
Wise experimental design involves the planning of sample sizes so that the test for treatment effects has adequate statistical power to detect the smallest treatment effects that are of scientific or practical interest. There is an extensive literature on the computation of statistical power, (e.g., Cohen, 1977; Kraemer & Thiemann, 1987; Lipsey, 1990). Much of this literature involves the computation of power in studies that use simple random samples. However, methods for the computation of statistical power of tests for treatment effects using the cluster mean as the unit of analysis (Blair & Higgins, 1986), analysis of variance using clusters as a nested factor (Raudenbush, 1997), and hierarchical linear model analyses (Snijders & Bosker, 1993) are available. For all of these analyses, the noncentrality parameter required to compute statistical power involves the intraclass correlation
Because plausible values of There is much less information about intraclass correlations appropriate for studies of academic achievement as an outcome. Such information is badly needed to inform the design of experiments that measure the effects of interventions on academic achievement by randomizing schools (Schochet, 2005). One compendium of intraclass correlation values on the basis of five large urban school districts in which randomized trials have been conducted has recently become available (see Bloom, Richburg-Hayes, & Black, 2007 [this issue]). The purpose of this article is to provide a comprehensive collection of intraclass correlations of academic achievement on the basis of national representative samples. We hope that this compilation will be useful in choosing reference values for planning cluster-randomized experiments.
We find that across Grades K–12, the average (unadjusted) intraclass correlation is about .22 for all schools, about .19 for low–socioeconomic status (SES) schools, and about .09 for low-achievement schools. These average intraclass correlations are very similar in reading and mathematics. Note that except in low-achievement schools, these intraclass correlation values are somewhat higher than the guidelines of .05–.15 that are often used. Pretests can explain a substantial amount of the between- and within-school variance when used as covariates. Covariates can substantially increase statistical power by explaining between- and within-school variance. Pretest scores typically explain over three quarters of the between-school variance and over one half of the within-school variance in all schools and in low-SES schools, but they explain somewhat less variance in low-achievement schools. Demographic characteristics are less effective covariates, but they can explain up to one half of the between-school variance in all and low-SES schools. In general, demographic characteristics, when used in addition to pretest scores, explain little additional variance. The remainder of this article gives the methods and data sources that were used, presents the results in detail, and illustrates how to use these results to compute statistical power.
Our analyses focused on intraclass correlations for designs involving the assignment of schools to treatments. Unfortunately, there is a wide variety of designs that might be used to study education interventions, and each of these designs may have its own intraclass correlation (or conditional intraclass correlation) structure. To attempt to provide a reasonable coverage of the designs most likely to be of interest to researchers planning educational experiments, we considered four dimensions of intervention designs. The first dimension of the design is the grade level. The second dimension of the design is what achievement domain (e.g., reading or mathematics) is the dependent variable. The third dimension of the design is the set of covariates that were used in the analysis, if any. Finally, the fourth dimension is the SES or achievement status of schools sampled in the overall population of schools. These four dimensions of designs can vary independently. We examined all possible combinations of them.
Grade Level of Students and Achievement Domain
Covariates Used in the Design The second model, which we call the demographic covariates model, involves the testing of treatment effects conditional on covariates that are ascriptive characteristics of students frequently invoked in models of educational achievement, namely, gender, race or ethnicity, and SES. This design may be appropriate when researchers can obtain prior, contemporaneous, or retrospective data from administrative records (appropriate because these covariates are unlikely to change). The third model, which we call the pretest covariates model, involves the testing of treatment effects using pretest scores on the same achievement domain (mathematics or reading) as a covariate. This design is likely to be considerably more powerful than the previous designs but involves the additional cost of collecting another wave of test data and the additional organizational burden of making that data collection in a timely manner. The fourth model, which we call the pretest and demographic covariates model, involves the testing of treatment effects using the ascriptive characteristics of students (gender, race or ethnicity, and SES) and pretest scores on the same achievement domain as the covariates. This design combines both of the sets of covariates in the previous design.
SES or Achievement Status of Schools Within Their Settings Researchers sometimes make decisions to carry out their studies in schools that lie within the middle range of outcomes, omitting schools that have had (or are reputed to have had) the very poorest and the very best outcomes, on the rationale that neither the very poorest schools nor the very best schools give a fair test of an intervention. We operationalized this notion by ordering, on average achievement, the entire sample of schools in a setting and selecting the middle 80% of the schools in each setting, omitting the top and bottom 10% of the schools. Some interventions are designed to be compensatory. Experimenters investigating such interventions might choose only schools within a particular context that have low mean achievement or large numbers of low-SES students to evaluate the intervention. We operationalized low achievement by ordering, on average achievement, the entire sample of schools in a setting and selecting the lower 50% of the schools, omitting the upper 50% of the schools. We operationalized low SES by ordering, on the proportion of students eligible for free or reduced-price lunch, the entire sample of schools in a setting and selecting the upper 50% of the schools, omitting the bottom 50% of the schools. One might argue for a more extreme definition of low-SES or low-achievement schools (e.g., the lower 30% of schools). We chose the lower 50% of schools to achieve a balance between the construct definition (low achievement or low SES) and sufficient sample size to obtain sufficiently precise estimates of the parameters of interest. The choice we made yields some standard errors that are on the order of .02, corresponding to a 2-SE band on either side of the estimate (a very crude 95% confidence interval) of width .08. Because even this range is large enough to have important substantive consequences, we judged that restricting the proportion of schools in the definition of the low-SES or low-achievement sample (which would decrease sample sizes of those groups) would lead to unacceptable impreciseness.
The object of this article is to estimate intraclass correlations and associated variance components for academic achievement in reading and mathematics for the United States and various subpopulations. Consequently, we relied on data from longitudinal surveys with national probability samples, all of which are described in detail elsewhere. We chose longitudinal surveys because we wished to use achievement data collected in earlier years as pretest data for evaluating conditional intraclass correlations relevant for planning studies that would use a pretest as a covariate. In some cases, more than one survey could have provided data on a given grade level. In such cases, we generally report here results on the basis of the survey with the largest sample size, although we made an exception to this principle when the larger sample was for the base year of a longitudinal study that would have provided no pretest data. Some general information about the surveys used in our main analyses is reported in Table 1.
The results reported for kindergarten, Grade 1, and Grade 3 were obtained from three waves of the Early Childhood Longitudinal Survey (ECLS). The ECLS is a longitudinal study that obtained a national probability sample of kindergarten children in 1,591 schools in 1998 and followed them through the fifth grade (see Tourangeau et al., 2005). Achievement test data were collected in both fall and spring of kindergarten and first grade and in spring only in third and fifth grades. There was no data collection in second and fourth grades. Thus, fall achievement test data collected in the same year could serve as a pretest in kindergarten and first grades, while data collected in the spring of the first grade served as pretest data for the third grade. The results reported for Grade 2 were obtained from the first follow-up to the first grade (base year) sample, and those reported for Grades 4–6 were obtained from the three follow-ups of the third grade (base year) sample in the Prospects study. The results in reading in Grades 7 and 9 were obtained from the base year and the second follow-up of the seventh grade sample in the Prospects study. Prospects was actually a set of three longitudinal studies, starting with (base year) national probability samples of children in 235, 240, and 137 schools, in Grades 1, 3, and 7, respectively, conducted in 1991 (for a complete description of the study design, see Puma, Karweit, Price, Riccuti, & Vaden-Kiernan, 1997). Achievement test data were collected for 3–4 years thereafter for each sample. Thus, the three Prospects studies collected data in Grades 1 (both fall and spring), 2, and 3; Grades 3, 4, 5, and 6; and Grades 7, 8, and 9. There were pretest data in the base year for Grade 1, but no pretest data for the base years in Grades 3 and 7. For all years except the base year, the previous years achievement test data were used as a pretest, and in Grade 1, the test data collected in fall served as a pretest. The results reported on reading in Grades 8, 10, and 12 and mathematics in Grades 10 and 12 were obtained from the National Educational Longitudinal Study of the Eighth Grade Class of 1988, a longitudinal study that began in 1988 with a national probability sample of eighth graders in 1,050 schools and collected reading and mathematics achievement test data when the students were in Grades 8, 10, and 12 (Curtin et al., 2002). Thus, no pretest data were available for Grade 8, but for Grade 10, the Grade 8 data were used as a pretest, and for Grade 12, the Grade 10 data were used as a pretest. Finally, the results on mathematics in Grades 7, 8, 9, and 11 were obtained from the base year and follow-ups of the Longitudinal Study of American Youth (LSAY; see J. D. Miller, Hoffer, Suchner, Brown, & Nelson, 1992). The LSAY is a longitudinal study that began in 1987 with two national probability samples, one of 7th graders and one of 10th graders in 104 schools. Data were collected on mathematics and science achievement each year for 4 years, leading to samples from Grades 7 to 12. There were no pretest data in Grade 7, but the previous years data served as the pretest for each subsequent year.
The data analysis was carried out using Stata 9.1s XTMIXED routine for mixed linear model analysis. For each sample and achievement domain, analyses were carried out on the basis of four different models, which we call the unconditional model, the pretest covariate model, the demographic covariates model, and the pretest and demographic covariates model. We describe these explicitly below in hierarchical linear model notation.
The Unconditional Model
and the Level 2 model for the intercept is
where
The Pretest Covariate Model
and the Level 2 model for the intercept is
where Xjk is the achievement pretest score for the jth observation in the kth school,
The Demographic Covariates Model
where Gjk, Bjk, and Hjk are dummy variables for male gender, Black status, and Hispanic status, respectively; E is an index of mothers and fathers levels of education (which is a proxy for family SES); and
and the covariate slopes
The Pretest and Demographic Covariates Model
where all of the symbols are defined as in the models above. The Level 2 model for the intercept is
and the covariate slopes
The (unconditional) intraclass correlation associated with the unconditional model described above is
where In the three models involving covariate adjustment, the (covariate-adjusted) intraclass correlation is
where For each combination of design dimensions (i.e., for each grade level, achievement domain, covariate set, setting, and choice of SES or achievement status within setting), we estimated the intraclass correlation (or conditional intraclass correlation) via restricted maximum likelihood using Stata and computed the standard error of that intraclass correlation estimate using the result given in Donner and Koval (1982). This resulted in 13 (grade levels) x 2 (achievement domains) x 4 (covariate sets) x 4 (SES or achievement statuses within settings) = 416 intraclass correlation estimates (each with a corresponding standard error). For designs that use covariates, we also provide values of
the proportion of between-school variance remaining, and
the proportion of within-school variance remaining, respectively, after covariate adjustment. For designs involving covariates, these two auxiliary quantities (
Two alternative parameters that contain the same information as
Note that each of the four analyses involved slightly different variables, and there were missing values on some of these variables in our survey data. We decided to compute each analysis on the largest set of cases that had all of the necessary variables for the analysis in question. This means that each of the four analyses of a given data set is computed on a slightly different set of cases. Because the quantities Although we provide estimates of the standard errors of the intraclass correlations, they should be used with some caution for two reasons. First, the distribution of estimates of the intraclass correlations is only approximately normal. Second, not all of these values are independent of one another, and it is not immediately clear how to carry out a formal statistical analysis of differences between estimates of intraclass correlations computed from the same sample of individuals. Nevertheless, we feel that these standard errors are useful as descriptions of the uncertainty of the individual estimates of intraclass correlations.
We found that the intraclass correlations obtained in the nationally representative sample and the schools in the middle 80% of the achievement distribution had intraclass correleations that were almost identical. Consequently, we present results here only the intraclass correlation data from the entire national sample of schools, those in the upper half of the free and reduced-price lunch distribution (low-SES schools), and those in the lower half of the school mean achievement distribution (low-achievement schools).
The main results of this study are presented in Tables 2–7 and discussed in the sections that follow. Each table is divided into four vertical panels of three columns each, one panel for each of the four analyses described above. The data for each grade level are given in a different row. In the row for each grade, the columns of each panel provide the estimates of the intraclass correlation (
To help interpret the tables as a whole, the bottom four rows of each table give summary statistics (across grades) of the estimates of A, B2, and W2, including the mean, the intercept (a) and slope (b) of an unweighted regression of the estimates on grade level (with kindergarten equaling Grade 0), and the correlation (r) between estimates and grade level. For example in Table 2, the mean intraclass correlation in the unconditional model is .220, the correlation between grade and intraclass correlation is –.443, and the regression equation for predicting the unconditional intraclass correlation from grade is .242 – .004(grade).
Mathematics Achievement in the Full Population The linear regression coefficients (the intercept a and slope b) of each of the tabled quantities on grade given at the bottom of each column of the table permits the computation of smoothed estimates of each quantity a + b(grade). For example, the values of a and b for the unadjusted intraclass correlation are a = .242 and b = –.004, so that the smoothed (interpolated) value of the unadjusted intraclass correlation for Grade 11 would be .242 + (–0.004)11 = .198, somewhat higher than the tabled value of .138.
The patterns of reduction of between- and within-cluster (school) variances are generally quite different in models involving different covariates. Specifically, the demographic covariate analyses typically reduced the between-cluster variance to one half to one quarter of its value in the unconditional model (e.g., produced
There is one apparent anomaly in the results reported in Table 2. The
Reading Achievement in the Full Population
There is less consistency in reading than in mathematics among the adjusted intraclass correlations for the three models involving covariates. However, the general pattern of reduction in between- versus within-cluster variance was similar in reading and in mathematics. That is, there was somewhat greater reduction in between-cluster variance and much greater reduction in within-cluster variance in the pretest covariate model than in the demographic covariates model. As in the case of mathematics achievement in the full population, the pretest and demographic covariates model leads to little additional variance explained at either the school or the individual level compared with the model using only pretest as a covariate.
Mathematics Achievement in Low-SES Schools
There is one substantial anomaly in the results reported in Table 4 that is similar to that in Table 2: The 2 values for the pretest and demographic covariates model are sometimes larger than those for the pretest covariate model, a difference that is particularly large at Grade 6. This anomaly (like that in Table 2) appears to be a consequence of differences between the samples used to estimate the two models. As in Table 2, the same pattern is also evident, but to a lesser extent, in the fifth grade B2 data. We suggest using these values only with great caution. It might be wise to use the smoothed values for the pretest and demographic covariates model in Grade 6 (which would give B2 = .195 and W2 = .453) and possibly in Grade 5 (which would give B2 = .192 and W2 = .448).
Reading Achievement in Low-SES Schools
Mathematics Achievement in Low-Achievement Schools Table 6 is a presentation of results in mathematics computed for the schools in the bottom half of the distribution of school mean mathematics achievement and is organized in the same way as Tables 2–5. The mean (across grade levels) unconditional intraclass correlation in mathematics was .087. The intraclass correlation values in this sample are considerably smaller than those reported in Table 2 for the entire national population, a tendency that also holds for the conditional (adjusted) intraclass correlations. There is some variation of intraclass correlations across grade levels, but only the difference between Grades 4 and 5 is larger than 2 standard errors of the difference. In general, the intraclass correlations at kindergarten through Grade 4 range from about .09 to .13, in Grades 5–7 from about .05 to .08, and in Grades 8–12 from .075 to .085.
The use of covariates resulted in a much smaller reduction in both between- and within-school variances in this sample than in the unrestricted sample. Specifically, the demographic covariates analyses typically reduced the between-school variance to no less than one half of its value in the unconditional model (e.g., produced B2 from .5 to .8) but typically reduced within-cluster variance by 5% or less (e.g., produced W2 values greater than .95). The pretest covariate analyses using pretest score as a covariate typically (but not always) resulted in modestly larger reductions in between-cluster variance (e.g., produced B2 values from .3 to .8) but typically reduced within-cluster variance by a larger amount than the demographic covariates model (e.g., produced W2 values from .5 to .8). As in the case of mathematics achievement in the full population, the pretest and demographic covariates model leads to little additional variance explained at either school or individual level compared with the model using only pretest as a covariate. Overall, we find that the intraclass correlation is smaller in this sample than in the full sample, but the explanatory power of pretest and other covariates is also smaller. These two tendencies have opposite effects on statistical power. The smaller intraclass correlation generally leads to larger statistical power, but the smaller explanatory power of covariates generally leads to less statistical power, one partially offsetting the effects of the other.
There is one substantial anomaly in the results reported in Table 6 that is similar to those in Tables 2 and 4: The Grade 2
Reading Achievement in Low-Achievement Schools There is some variation of intraclass correlations across grade levels. The intraclass correlation in Grade 9 is larger (by over 3 standard errors of the difference) than that in either of the adjacent grades. Similarly, the intraclass correlation in Grade 1 is more than 2 standard errors greater than that in kindergarten but less than 2 standard errors of the difference from that in Grade 2. None of the other differences between grades is this large in comparison with their uncertainty. In general, the intraclass correlations at Grades K–4 range from about .10 to .14 and in Grades 5–8 from about .06 to .07, and in Grades 10–12, they are about .05.
As in the case of mathematics, the use of covariates resulted in a much smaller reduction in both between- and within-school variances in this sample than in the entire national sample. Specifically, the demographic covariates analyses typically reduced the between-school variance to no less than one half of its value in the unconditional model (e.g., produced
There are several small anomalies in the results reported in Table 7 that are similar to those in Table 6, in which the
Although the estimates presented in this article are derived from national probability samples, few experiments actually use national probability samples. Thus, one might question if intraclass correlations obtained from national samples resemble those of experiments actually conducted in education. To obtain some empirical evidence on this question, we searched the two most prestigious education journals that publish experimental studies, the American Educational Research Journal and Educational Evaluation and Policy Analysis, from 1995 to 2005 to find the cluster-randomized experiments with academic achievement as an outcome variable. We found eight reports of experiments that had randomized schools. We were able to obtain at least one unconditional intraclass correlation estimate from seven of these experiments (which required contacting authors in several cases). The eighth study did not treat schools as a random effect in the analyses and therefore could not provide an intraclass correlation value. This yielded a total of 41 intraclass correlation estimates, 14 in mathematics outcomes and 27 in reading outcomes. They ranged from .07 to .31 in mathematics achievement (with a mean of .17) and .05 to .74 in reading achievement (with a mean of .19). Eliminating the largest estimate in reading reduced the average value, but only to .17. Some of this variation is surely due to sampling error of estimation. None of the studies provided a standard error for the intraclass correlation estimates, but the form of the standard error is proportional to the square root of the number of schools (see, e.g., Donner & Koval, 1982). Therefore, these standard errors of the experimental estimates must be considerably larger than the largest of those we report on the basis of survey data (i.e., considerably bigger than .03), because the experiments involved considerably fewer schools than our surveys. The average (unconditional) intraclass correlation in Tables 2 and 3 for the full national sample is about .22, the average value in Tables 4 and 5 for low-SES schools is about .19, and the average value in Tables 6 and 7 for low-achieving schools is about .09. Therefore, the average value of the intraclass correlation estimates from the published experiments is roughly consistent with the national values for low-SES schools but somewhat larger than the national values for low-achieving schools. This is consistent with the fact that most of the published experiments explicitly targeted, or realized, substantial samples of low-SES or disadvantaged students. It would not be appropriate to draw strong conclusions from such a small sample of empirical evidence, but this evidence does not suggest that the intraclass correlations obtained in published experiments are substantially different than those obtained from corresponding national (e.g., low-SES) samples.
When it was possible to estimate intraclass correlations for the same grade and achievement domain from more than one survey, we computed estimates from all surveys from which it was possible. Table 8 is a presentation of these estimates for the unconditional and demographic covariates models, along with the difference between each pair of intraclass correlation estimates that should estimate the same value and the standard error of the difference. Too few estimates from the other models could be computed for meaningful comparisons. Because the estimated intraclass correlations are approximately normally distributed in large samples, the difference divided by its standard error should have approximately a standard normal distribution if the two estimates are estimating the same population quantity, and thus a difference larger than 2 standard errors for any particular comparison should happen only about 5% of the time by chance.
Although some of the differences are large enough to have practical implications, they are subject to considerable sampling uncertainty. We found that most of the results agreed within sampling error. Overall, 14 of the 18 differences of unadjusted intraclass correlation estimates (across both reading and mathematics) were less than 2 standard errors of the difference. Three of the 13 differences in mathematics exceeded 2 standard errors (ECLS – Prospects1 at Grade 3 and LSAY10 – NELS in Grades 10 and 12). One of the five differences in reading (ECLS – Prospects1 at Grade 3) exceeded 3 standard errors. However, it is crucial to recognize that the conceptual hypothesis of agreement among data sets that we are testing is that all of the pairs of intraclass correlations are equal. Although the criterion that "differences exceeding 2 standard errors are statistically significant at the 5% level" is (approximately) valid for any single comparison, it is not appropriate for evaluating several comparisons at the same time. To evaluate whether at least one of the comparisons implies a reliable difference, a multiple comparison procedure is needed (see, e.g., R. Miller, 1977). A Bonferroni adjustment for 13 comparisons would require a difference of 2.89 standard errors to be significant at the 5% level, and none of the difference in mathematics is that large. The difference in reading between the estimates from ECLS and Prospects1 at Grade 3 is large enough to be statistically significant, even taking multiple comparisons into account. However, we interpret these comparisons as suggesting that there is a reasonable degree of agreement among the intraclass correlations in these surveys, even though they were conducted as much as a decade apart, by different research organizations, and using different achievement measures.
One way to summarize the implications of these results for statistical power is to use them to compute the smallest effect size for which a target design would have adequate statistical power. This effect size is often called the minimum detectable effect size (MDES; see Bloom, 1995, 2005). In computing the MDES values reported in this article, we used the value 0.8 with a two-sided test at a significance level of .05 as the definition of adequate power. We considered designs with no covariates and with pretest as a covariate at both the individual and group levels. We considered both reading and mathematics achievement as potential outcomes. Finally, we considered a balanced design with a sample of size of n = 60 per school and m = 10, 15, 20, 25, or 30 schools randomized to each treatment group. Table 9 gives the MDESs on the basis of parameters given in Tables 2 and 3 that were estimated from the full national sample. Perhaps the most obvious finding is that the corresponding MDES values for mathematics and reading are quite similar. With no covariates, the MDES values typically exceed 0.60 for m = 10 and typically exceed 0.35 even for m = 30. However, the use of pretest as a covariate reduces the MDES values to less than 0.40 for m = 10 and 0.20 or less for m = 30. Although Cohen (1977) proposed the values 0.20 to define small-sized effects and 0.50 to define medium-sized effects, these labels can be misleading in educational policy contexts, in which effect sizes of 0.20 or smaller are often of policy interest, and consequently, experiments may well be designed to detect effects in this range. Effect sizes used in power analyses should be informed by the magnitude of effects that would be policy relevant and by prior empirical evidence about the likely effect of an intervention being evaluated.
Table 10 gives the MDESs on the basis of parameters given in Tables 4 and 5 that were estimated from the national sample of low-SES schools. These results are remarkably similar to those in Table 9.
Table 11 gives the MDESs on the basis of parameters given in Tables 6 and 7 that were estimated from the national sample of schools in the lower half of the achievement distribution. Because the unconditional intraclass correlations are lower, the MDES values for designs with no covariates are smaller. However, because the covariates are less effective in reducing between- and within-school variance in this sample, the MDES values with pretest as a covariate are not always smaller than in the national sample of all schools. With no covariates, the MDES values typically less than 0.50 for m = 10 and less than 0.30 for m = 30. However, the use of pretest as a covariate typically reduces the MDES values to about 0.30 for m = 10 and 0.20 or less for m = 30.
Specialized software for computing statistical power in group-randomized designs can use the intraclass correlation values and RB2 and RW2 values (where R2 = 1 – 2) presented in this article to compute statistical power. Such programs include Optimal Design (Raudenbush & Liu, 2000) and PinT (Snijders & Bosker, 1993). However, such software is not necessary to compute power for studies that randomize schools. In this section, we illustrate the use of the results in this article to compute the statistical power of cluster-randomized experiments. Consider the two-treatment-group design with q (0 q < M – 2) group-level (cluster-level) covariates and p (0 p < N – q – 2) individual-level covariates in the analysis. Note that we specifically include the possibility that there are zero (no) covariates at a given level. For example, a design with p = 1 and q = 1 might arise, for example, if there was a pretest that was used as an individual-level covariate and cluster means on the covariate were used as a group-level covariate. We assume also that the individual-level covariate has been centered about cluster means. The structural model for Yijk, the kth observation in the jth cluster in the ith treatment might be described in analysis of covariance (ANCOVA) notation as
where µ is the grand mean, The analysis might be carried out either as an ANCOVA with clusters as a nested factor or by viewing the model as a hierarchical linear model and using software for multilevel models such as HLM. In multilevel model notation, it would be conventional to specify a Level 1 (individual-level) model as
and a Level 2 (cluster-level) model for the intercept as
where TREATMENTi is a dummy variable for the treatment group, while the covariate slopes in
The Intraclass Correlations
The object of the statistical analysis is to test the statistical significance of the intervention effect, that is, to test the following hypothesis:
Or, equivalently,
The ANCOVA t-test statistic is
where
In this case, MSAB = n When the null hypothesis is false, the test statistic tA has for this analysis a noncentral t distribution with M – q – 2 degrees of freedom and noncentrality parameter
where Alternatively (and equivalently), the F statistic has the noncentral F distribution with 1 degree of freedom in the numerator and M – q – 2 degrees of freedom in the denominator and noncentrality parameter
For the purposes of power computation, expression 7 is not convenient, because the minimum effect size of interest is likely to be known in units of the unadjusted standard deviation rather than the adjusted standard deviation; that is, we are more likely to know
To express
An alternative, but equivalent, expression of
Note that the quantity [
We illustrate the use of the t statistic. The power of the one-tailed test at level
where c(
Many tabulations (e.g., Cohen, 1977) and programs (e.g., Borenstein, Rothstein, & Cohen, 2001) are available for computing statistical power from designs involving simple random samples, but tables for computing power from the independent-groups t test are the most widely available. Following Cohens (1977) framework, such tables typically provide power values on the basis of sample sizes N1T and N2T (often assumed to be equal for simplicity) and effect size T, where the superscript T indicates that these quantities are what is used in the power tables. The calculations on which they are based translate the sample sizes and effect size into degrees of freedom T and non-centrality parameter T to compute statistical power. In the case of the two-sample t test, they do so via
and
where
Tables such as Cohens (or the corresponding software) can be used to compute the power of the test used in the case of clustered sampling by judicious choice of sample sizes and effect size. We have to enter the table with a configuration of sample sizes and a synthetic effect size (here called the operational effect size) that will yield the appropriate degrees of freedom and noncentrality parameter.
If the actual numbers of clusters assigned are m1 and m2, then entering the power table with sample sizes N1T = m1 – q and N2T = m2 yields
where
Consider an experiment that will randomize 10 schools to receive an intervention m1 = m2 = to improve mathematics achievement so that n = 20 students in each school would be part of the experiment. There are no covariates at either individual or group level, so that p = q = 0 and W2 = B2 = 1. The analysis will involve a two-tailed t test with significance level = .05. Suppose that the smallest educationally significant effect size for this intervention is assumed to be = 0.50. Suppose further that the schools were chosen to attempt to be represent first graders nationally.
Entering Table 2 on the first row for Grade 1 and the panel for the unconditional model (columns 2–3) gives the intraclass correlation for first graders as
so that the noncentrality parameter from Equation 7 is
Using Equation 11 and the noncentral t-distribution function (e.g., the function NCDF.T in SPSS), with M – 2 = 18 degrees of freedom, c(.05/2, 18) = 2.101, and Alternatively, we could compute the power from tables of the power of the t test such as those given by Cohen (1977). To do so, we first compute the operational effect size given in Equation 12 as
Cohens tables give the statistical power in terms of sample size (in each treatment group) and effect size. Examining Cohens Table 2.3.5, we see that the operational effect size of 0.968 is between tabled effect sizes of 0.8 and 1.0. Entering the table with sample size N1T = N2T = 10, we see that a power of 0.39 is tabulated for the effect size of
Note that in this case (and many others), the operational effect size for the tests based on clustered samples is larger than the actual effect size (in this case 0.97 vs. 0.50). This does not mean that the power of the test for the design based on the clustered sample is larger than that based on a simple random sample with the same total sample size. The reason is that the test using the clustered sample has many fewer degrees of freedom in the error term. For example, a test based on an effect size of
Consider an experiment that will randomize m1 = m2 = 10 schools to receive an intervention to improve first grade reading achievement and that n = 20 students in each school would be part of the experiment. An ANCOVA will be used with pretest as a covariate at both individual and school level (so that p = q = 1) using a two-tailed test with significance level = .05. Suppose that the smallest educationally significant effect size for this intervention is = 0.25. Suppose further that the schools were chosen in an attempt to be representative of first graders nationally.
Entering Table 3 on the first row for Grade 1 and the panel for the unconditional model (columns 3–5) gives the intraclass correlation for first graders as
so that the noncentrality parameter from Equation 9 is
Using Equation 11 and the noncentral t-distribution function (e.g., the function NCDF.T in SPSS), with M – 2 – 1 = 17 degrees of freedom, c(.05/2, 17) = 2.110, and
Alternatively, we could compute the power from tables of the power of the t test such as those given by Cohen (1977). Because there is q = 1 covariate at the school level, N1T = m1 – 1 = 10 – 1 = 9 and N2T = m2 = 10. Because Cohens tables give the statistical power in terms of equal sample sizes (in each treatment group), we will need to interpolate between sample sizes N1T = N2T = 9 and N1T = N2T = 10. Here we compute
Examining Cohens Table 2.3.5, we see that the effect size
Entering the table with sample size N1T = N2T = 10, we see that a power of 0.56 is tabulated for the effect size of
To obtain the power associated with an effect size of
It is worth noting that if no covariates had been used at either level of this analysis (i.e., if p = q = 0 and therefore
The values of intraclass correlations and variance components presented in this article provide some guidance for the selection of intraclass correlations for planning cluster-randomized experiments. These values suggest that for experiments that have samples as diverse as the nation as a whole and for those using low-SES schools, somewhat larger values of the intraclass correlation (roughly .15–.25) may be appropriate than the .05–.15 guidelines that have sometimes been used. The guideline of .05–.15 is more consistent with the values of unadjusted intraclass correlations among low-achieving schools and those of covariate-adjusted intraclass correlations we found. In using these values, it is important to keep in mind that these analyses do not separately estimate the between-district and between-state components of variance. Therefore, these two components of variance are included here as part of the between-school variance. This is desirable if the values are to be used in connection with designs that involve schools from several districts or states. However, if the design involves schools from only a single district or state, the estimates reported here may overestimate the relevant intraclass correlations to some degree. Unfortunately, it is unclear just how much of an impact this may have. We suspect that these influences are not large, because a general rule of thumb in both sample surveys and cluster-randomized experiments is that variance components (and therefore contributions to intraclass correlations) of larger units tend to be smaller in magnitude, even though their impact on design effects may be large (because effects on variance inflation factors are proportional to the unit sample size multiplied by the intraclass correlation). Our attempts to explore this question by calculating intraclass correlations with the inclusion of state dummy variables in some of the surveys yielded only negligible effects. Note that the inclusion of multiple districts and states in national samples is also likely to have some impact on the effectiveness of the covariates in explaining between-and within-school variation. It is likely that the somewhat greater between-school variation in national samples leads to a larger intraclass correlation but also to larger covariate effects, so that these impacts partially cancel one another in their effects on statistical power. A more detailed compilation is available from the authors providing values for regions of the country, settings with different levels of urbanicity, and regions crossed with levels of urbanicity. However it is important to recognize that there is a trade-off between bias (estimating exactly the right value of the intraclass correlation in a particular context) and variance (the sampling uncertainty of that estimate). The variance of the intraclass correlation estimate is driven primarily by the number of clusters (in this case, schools). Although the intraclass correlations we computed in a particular region and setting are more specific and therefore likely to have less bias as estimates of the intraclass correlation in an experiment that is to be conducted within a particular region and context, the sample size used to estimate the intraclass correlations is smaller, and thus the estimate is subject to greater sampling uncertainties. Our analyses suggest that although there is often statistically significant variation in intraclass correlations between regions and settings, the magnitude of this variation is typically small. Thus, it is not completely clear whether more specific estimates are always better (i.e., more accurate) for planning purposes.
It is important to note that the power computations illustrated in this article apply to two-level experiments in which students are nested within schools. If the sampling design used is actually a three-level design (e.g., if students are sampled by classrooms within schools) then the power computations given here (or given by specialized software for computing power in two-level designs) would not be correct. Consider a sample (e.g., for a treatment group) obtained by selecting m schools, then p classrooms within each school, and then n students within each classroom. This is not a simple random sample of mpn individuals, nor is it a (two-stage) clustered sample obtained by randomly selecting pn students within each cluster (school). Instead, it is a three-stage cluster sample of m clusters (schools) and p subclusters (classrooms), with n students randomly selected within each subcluster (classroom). The sampling distribution of statistics based on such three-stage clustered samples is not the same as those based on two-stage clustered samples of the same size. For example, suppose that the (total) variance of a population with clustered structure (such as a population of students within classrooms within schools) is This difference in precision of treatment effect estimates leads to a difference in the non-centrality parameters that determine statistical power. In a two-level experiment, the treatment effects are estimated from two-stage cluster samples, leading to the noncentrality parameter (with no covariates) of
where
which is generally smaller than that computed from Equation 13. Therefore, the statistical power of three-level experiments that assign schools to treatments is generally smaller than that of the analogous experiments with two-level designs having the same number of schools and students (see Konstantopoulos, 2006). Note, however, that the issue here is not in which analysis is used (two-vs. three-level) but which sampling design is used (one vs. two stages of clustering within a two-vs. three-stage sampling design). Although we anticipate that the principal use of the results given in this article will be for planning randomized experiments in education that assign schools (rather than individuals) to treatments, there are other potential applications. One involves the use of information external to an experiment to adjust the degrees of freedom of significance tests in designs involving group randomization, called the df* method by its originators (see Murray, Hannan, & Baker, 1996). Although the originators of this method caution that it is important that users should have good reasons to assume that any external estimates used should estimate the same intraclass correlation as that in the experiment, there may be situations in which data from this compilation meet that assumption. Because they are based on relatively large samples, the intraclass correlation estimates reported in this article tend to have small standard errors. Consequently, if they are thought to be appropriate for use in a particular df* computation, they should substantially increase the degrees of freedom used in the test for treatment effects.
A second potential application is to evaluate whether the conclusions of statistical analyses that incorrectly ignored clustering might have changed if those significance tests had taken clustering into account. Hedges (in press-a) has shown how to compute the actual significance level of the usual t statistic when it has been computed from clustered samples (by incorrectly ignoring clustering). The computation of this actual significance level depends on
A third potential application involves the computation of standardized effect size estimates and their standard errors in group-randomized trials. There are several approaches to the computation of effect size estimates in multilevel designs, but in some cases, the computation of estimates and the computation of standard errors requires knowledge of
LARRY V. HEDGES is currently Board of Trustees Professor of Statistics, Professor of Education and Social Policy, and faculty fellow at the Institute for Policy Research at Northwestern University, 2040 North Sheridan Road, Evanston, IL 60610; l-hedges{at}northwestern.edu. His interests include methods for educational and social policy research. E. C. HEDBERG is currently an advanced graduate student in the Department of Sociology at the University of Chicago, NORC Research Centers, 1155 East 60th Street, Chicago, IL 60637; ech{at}uchicago.edu. He is part of many projects that span a wide variety of interests that include the sociology of family and the life course, education and methods. His dissertation research focuses on using context-effect models and dyadic analysis to understand familial social exchange between kin. This material is based upon work supported in part by the National Science Foundation under Grant No. 0129365 and the Spencer Foundation Grant Number 200100308. Received for publication March 16, 2006. Revision received December 13, 2006. Accepted for publication January 2, 2007.
Educational Evaluation and Policy Analysis, Vol. 29, No. 1,
60-87 (2007) This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

T2 and that this total variance is decomposable into a between-cluster variance,
] 

jk is an individual-level residual for the kth person in the jth school, and
j is a random effect (a Level 2 residual) associated with the jth school. In this analysis, the between-person, within-school variance component is 

j is the pretest mean for the jth school,
1j was treated as equal in all clusters (schools). The variance components associated with this analysis are 
j,
j,
j, and
j, are the means of G, B, H, and E in the jth school (cluster). The Level 2 model for the intercept is 






B2 and
q < M – 2) group-level (cluster-level) covariates and p (0 
Ai is the covariate-adjusted effect of the ith treatment,
I = (
(i)j is the random effect of cluster j within treatment i, and 

00,
0 or p 


is defined in terms of the number of clusters assigned to the treatment and control groups (m1 and m2, respectively) as 
A1, and 

A = (
A, we need only in terms of 


) is the level 
T, where the superscript T indicates that these quantities are what is used in the power tables. The calculations on which they are based translate the sample sizes and effect size into degrees of freedom 



1. Similarly, if the analysis makes a covariate adjustment at the individual (within-cluster) level,
, it yields the noncentrality parameter 
















