Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

Click here to sign up for SAGE Journal Email Alerts today!

Sign In to gain access to subscriptions and/or personal tools.
Educational Evaluation and Policy Analysis
This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Web of Science (6)
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Hedges, L. V.
Right arrow Articles by Hedberg, E. C.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Article

Intraclass Correlation Values for Planning Group-Randomized Trials in Education

Larry V. Hedges

Northwestern University

E. C. Hedberg

University of Chicago


    Abstract
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Experiments that assign intact groups to treatment conditions are increasingly common in social research. In educational research, the groups assigned are often schools. The design of group-randomized experiments requires knowledge of the intraclass correlation structure to compute statistical power and sample sizes required to achieve adequate power. This article provides a compilation of intraclass correlation values of academic achievement and related covariate effects that could be used for planning group-randomized experiments in education. It also provides variance component information that is useful in planning experiments involving covariates. The use of these values to compute the statistical power of group-randomized experiments is illustrated.

Key Words: intraclass correlation • cluster randomized trials • experiments • statistical power

MANY social interventions operate at a group level by altering the physical or social conditions. In such cases, it may be difficult or impossible to assign individuals to receive different intervention conditions. In other cases, it may be possible to assign treatments to individuals, but for practical or political reasons, the assignment of individuals to treatments is not feasible. In either situation, field experiments may assign entire intact groups (such as sites, classrooms, or schools) to the same treatment, with different intact groups being assigned to different treatments. Because these intact groups correspond to what statisticians call clusters in sampling theory, this design is often called a group-randomized or cluster-randomized design. Cluster-randomized trials have been used extensively in public health and other areas of prevention science (see, e.g., Donner & Klar, 2000; Murray, 1998). Cluster-randomized trials have become more important in educational research more recently, following increased interest in experiments to evaluate educational interventions (see, e.g., Mosteller & Boruch, 2002). Methods for the design and analysis of group-randomized trials have been discussed extensively by Donner and Klar (2000) and Murray (1998).

The sampling of subjects into experiments via statistical clusters introduces special considerations that need to be addressed in the analysis. For example, a sample obtained from m clusters (such as classrooms or schools) of size n randomized into a treatment group is not a simple random sample of nm individuals, even if it is based on a simple random sample of clusters. Instead, it is a two-stage sample (with one stage of clustering). Consequently, the sampling distribution of statistics on the basis of such clustered samples is not the same as that based on simple random samples of the same size. For example, suppose that the (total) variance of a population with clustered structure (such as a population of students within schools) is {sigma}T2 and that this total variance is decomposable into a between-cluster variance, {sigma}B2, and a within-cluster variance, {sigma}W2, so that {sigma}T2 = {sigma}B2 + {sigma}W 2. Then the variance of the mean of a simple random sample of size mn from that population would be {sigma}T2/mn. However, the variance of the mean of a sample of m clusters, each of size n from that population (with the same total sample size mn) would be [1 + (n – 1){rho}] {sigma}T2/mn, where {rho} = {sigma}B2/({sigma}B2 + {sigma}W2) is the intraclass correlation. Thus, the variance of the mean computed from a clustered sample is larger by a factor of [1 + (n – 1){rho}], which is often called the design effect (Kish, 1965) or variance inflation factor (Donner, Birkett, & Buck, 1981).

Several analytical strategies for cluster-randomized trials are possible, but the simplest is to treat the clusters as units of analysis, that is, to compute mean scores on the outcome (and all other variables that may be involved in the analysis) and carry out the statistical analysis as if the site (cluster) means were the data. If all cluster sample sizes are equal, this approach provides exact tests for the treatment effect, but the tests may have lower statistical power than would be obtained by other approaches (see, e.g., Blair & Higgins, 1986). More flexible and informative analyses are also available, including analyses of variance using clusters as a nested factor (see, e.g., Hopkins, 1982) and analyses involving hierarchical linear models (see, e.g., Raudenbush & Bryk, 2002). For general discussions of the design and analyses of cluster-randomized experiments, see Murray (1998); Bloom, Bos, and Lee (1999); Donner and Klar (2000); Klar and Donner (2001); Raudenbush and Bryk (2002); Murray, Varnell, and Blitstein (2004); or Bloom (2005).

Wise experimental design involves the planning of sample sizes so that the test for treatment effects has adequate statistical power to detect the smallest treatment effects that are of scientific or practical interest. There is an extensive literature on the computation of statistical power, (e.g., Cohen, 1977; Kraemer & Thiemann, 1987; Lipsey, 1990). Much of this literature involves the computation of power in studies that use simple random samples. However, methods for the computation of statistical power of tests for treatment effects using the cluster mean as the unit of analysis (Blair & Higgins, 1986), analysis of variance using clusters as a nested factor (Raudenbush, 1997), and hierarchical linear model analyses (Snijders & Bosker, 1993) are available. For all of these analyses, the noncentrality parameter required to compute statistical power involves the intraclass correlation {rho} (which was defined above but will be defined formally in Equation 1). More complex analyses involving covariates require corresponding information (covariate effects or the conditional intraclass correlations after adjustment for covariates). Thus, the computation of statistical power in cluster-randomized trials requires knowledge of the intraclass correlation {rho}.

Because plausible values of {rho} are essential for power and sample-size computations in planning cluster-randomized experiments, there have been systematic efforts to obtain information about reasonable values of {rho} in realistic situations. One strategy for obtaining information about reasonable values of {rho} is to obtain these values from cluster-randomized trials that have been conducted. Murray and Blitstein (2003) reported a summary of intraclass correlations obtained from 17 articles reporting cluster-randomized trials in psychology and public health, and Murray et al. (2004) gave references to 14 very recent studies that provide data on intraclass correlations for health-related outcomes. Another strategy for obtaining information on reasonable values of {rho} is to analyze sample surveys that have used a cluster-sampling design involving the clusters of interest. Gulliford, Ukoumunne, and Chinn (1999) and Verma and Lee (1996) presented values of intraclass correlations on the basis of surveys of health outcomes.

There is much less information about intraclass correlations appropriate for studies of academic achievement as an outcome. Such information is badly needed to inform the design of experiments that measure the effects of interventions on academic achievement by randomizing schools (Schochet, 2005). One compendium of intraclass correlation values on the basis of five large urban school districts in which randomized trials have been conducted has recently become available (see Bloom, Richburg-Hayes, & Black, 2007 [this issue]). The purpose of this article is to provide a comprehensive collection of intraclass correlations of academic achievement on the basis of national representative samples. We hope that this compilation will be useful in choosing reference values for planning cluster-randomized experiments.


    Key Findings
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
We find that across Grades K–12, the average (unadjusted) intraclass correlation is about .22 for all schools, about .19 for low–socioeconomic status (SES) schools, and about .09 for low-achievement schools. These average intraclass correlations are very similar in reading and mathematics. Note that except in low-achievement schools, these intraclass correlation values are somewhat higher than the guidelines of .05–.15 that are often used. Pretests can explain a substantial amount of the between- and within-school variance when used as covariates. Covariates can substantially increase statistical power by explaining between- and within-school variance. Pretest scores typically explain over three quarters of the between-school variance and over one half of the within-school variance in all schools and in low-SES schools, but they explain somewhat less variance in low-achievement schools. Demographic characteristics are less effective covariates, but they can explain up to one half of the between-school variance in all and low-SES schools. In general, demographic characteristics, when used in addition to pretest scores, explain little additional variance. The remainder of this article gives the methods and data sources that were used, presents the results in detail, and illustrates how to use these results to compute statistical power.


    Dimensions of Designs Considered
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Our analyses focused on intraclass correlations for designs involving the assignment of schools to treatments. Unfortunately, there is a wide variety of designs that might be used to study education interventions, and each of these designs may have its own intraclass correlation (or conditional intraclass correlation) structure. To attempt to provide a reasonable coverage of the designs most likely to be of interest to researchers planning educational experiments, we considered four dimensions of intervention designs. The first dimension of the design is the grade level. The second dimension of the design is what achievement domain (e.g., reading or mathematics) is the dependent variable. The third dimension of the design is the set of covariates that were used in the analysis, if any. Finally, the fourth dimension is the SES or achievement status of schools sampled in the overall population of schools. These four dimensions of designs can vary independently. We examined all possible combinations of them.

Grade Level of Students and Achievement Domain
We examined each grade level from kindergarten through Grade 12 and both mathematics and reading achievement at each grade level, with one exception. The exception was reading achievement in Grade 11, for which data on a national representative sample were not available to us.

Covariates Used in the Design
We consider four data analysis models involving different covariate sets that we believe are likely to be of considerable interest to educational researchers. The first, the unconditional model, involves the testing of treatment effects with no covariates. This is the minimal design but one that is likely to be of interest in many settings in which researchers have little opportunity to collect prior information about the individuals participating in the experiment.

The second model, which we call the demographic covariates model, involves the testing of treatment effects conditional on covariates that are ascriptive characteristics of students frequently invoked in models of educational achievement, namely, gender, race or ethnicity, and SES. This design may be appropriate when researchers can obtain prior, contemporaneous, or retrospective data from administrative records (appropriate because these covariates are unlikely to change).

The third model, which we call the pretest covariates model, involves the testing of treatment effects using pretest scores on the same achievement domain (mathematics or reading) as a covariate. This design is likely to be considerably more powerful than the previous designs but involves the additional cost of collecting another wave of test data and the additional organizational burden of making that data collection in a timely manner.

The fourth model, which we call the pretest and demographic covariates model, involves the testing of treatment effects using the ascriptive characteristics of students (gender, race or ethnicity, and SES) and pretest scores on the same achievement domain as the covariates. This design combines both of the sets of covariates in the previous design.

SES or Achievement Status of Schools Within Their Settings
Some experimenters undoubtedly wish to use a representative sample of schools within whatever setting they choose to study. Consequently, one population of schools we considered was the entire collection of schools within a setting.

Researchers sometimes make decisions to carry out their studies in schools that lie within the middle range of outcomes, omitting schools that have had (or are reputed to have had) the very poorest and the very best outcomes, on the rationale that neither the very poorest schools nor the very best schools give a fair test of an intervention. We operationalized this notion by ordering, on average achievement, the entire sample of schools in a setting and selecting the middle 80% of the schools in each setting, omitting the top and bottom 10% of the schools.

Some interventions are designed to be compensatory. Experimenters investigating such interventions might choose only schools within a particular context that have low mean achievement or large numbers of low-SES students to evaluate the intervention. We operationalized low achievement by ordering, on average achievement, the entire sample of schools in a setting and selecting the lower 50% of the schools, omitting the upper 50% of the schools. We operationalized low SES by ordering, on the proportion of students eligible for free or reduced-price lunch, the entire sample of schools in a setting and selecting the upper 50% of the schools, omitting the bottom 50% of the schools. One might argue for a more extreme definition of low-SES or low-achievement schools (e.g., the lower 30% of schools). We chose the lower 50% of schools to achieve a balance between the construct definition (low achievement or low SES) and sufficient sample size to obtain sufficiently precise estimates of the parameters of interest. The choice we made yields some standard errors that are on the order of .02, corresponding to a 2-SE band on either side of the estimate (a very crude 95% confidence interval) of width .08. Because even this range is large enough to have important substantive consequences, we judged that restricting the proportion of schools in the definition of the low-SES or low-achievement sample (which would decrease sample sizes of those groups) would lead to unacceptable impreciseness.


    Data Sets Used
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
The object of this article is to estimate intraclass correlations and associated variance components for academic achievement in reading and mathematics for the United States and various subpopulations. Consequently, we relied on data from longitudinal surveys with national probability samples, all of which are described in detail elsewhere. We chose longitudinal surveys because we wished to use achievement data collected in earlier years as pretest data for evaluating conditional intraclass correlations relevant for planning studies that would use a pretest as a covariate. In some cases, more than one survey could have provided data on a given grade level. In such cases, we generally report here results on the basis of the survey with the largest sample size, although we made an exception to this principle when the larger sample was for the base year of a longitudinal study that would have provided no pretest data. Some general information about the surveys used in our main analyses is reported in Table 1.


View this table:
[in this window]
[in a new window]

 
TABLE 1 Characteristics of Data Sets Used in This Analysis

 
The results reported for kindergarten, Grade 1, and Grade 3 were obtained from three waves of the Early Childhood Longitudinal Survey (ECLS). The ECLS is a longitudinal study that obtained a national probability sample of kindergarten children in 1,591 schools in 1998 and followed them through the fifth grade (see Tourangeau et al., 2005). Achievement test data were collected in both fall and spring of kindergarten and first grade and in spring only in third and fifth grades. There was no data collection in second and fourth grades. Thus, fall achievement test data collected in the same year could serve as a pretest in kindergarten and first grades, while data collected in the spring of the first grade served as pretest data for the third grade.

The results reported for Grade 2 were obtained from the first follow-up to the first grade (base year) sample, and those reported for Grades 4–6 were obtained from the three follow-ups of the third grade (base year) sample in the Prospects study. The results in reading in Grades 7 and 9 were obtained from the base year and the second follow-up of the seventh grade sample in the Prospects study. Prospects was actually a set of three longitudinal studies, starting with (base year) national probability samples of children in 235, 240, and 137 schools, in Grades 1, 3, and 7, respectively, conducted in 1991 (for a complete description of the study design, see Puma, Karweit, Price, Riccuti, & Vaden-Kiernan, 1997). Achievement test data were collected for 3–4 years thereafter for each sample. Thus, the three Prospects studies collected data in Grades 1 (both fall and spring), 2, and 3; Grades 3, 4, 5, and 6; and Grades 7, 8, and 9. There were pretest data in the base year for Grade 1, but no pretest data for the base years in Grades 3 and 7. For all years except the base year, the previous year’s achievement test data were used as a pretest, and in Grade 1, the test data collected in fall served as a pretest.

The results reported on reading in Grades 8, 10, and 12 and mathematics in Grades 10 and 12 were obtained from the National Educational Longitudinal Study of the Eighth Grade Class of 1988, a longitudinal study that began in 1988 with a national probability sample of eighth graders in 1,050 schools and collected reading and mathematics achievement test data when the students were in Grades 8, 10, and 12 (Curtin et al., 2002). Thus, no pretest data were available for Grade 8, but for Grade 10, the Grade 8 data were used as a pretest, and for Grade 12, the Grade 10 data were used as a pretest.

Finally, the results on mathematics in Grades 7, 8, 9, and 11 were obtained from the base year and follow-ups of the Longitudinal Study of American Youth (LSAY; see J. D. Miller, Hoffer, Suchner, Brown, & Nelson, 1992). The LSAY is a longitudinal study that began in 1987 with two national probability samples, one of 7th graders and one of 10th graders in 104 schools. Data were collected on mathematics and science achievement each year for 4 years, leading to samples from Grades 7 to 12. There were no pretest data in Grade 7, but the previous year’s data served as the pretest for each subsequent year.


    Analysis Procedures
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
The data analysis was carried out using Stata 9.1’s XTMIXED routine for mixed linear model analysis. For each sample and achievement domain, analyses were carried out on the basis of four different models, which we call the unconditional model, the pretest covariate model, the demographic covariates model, and the pretest and demographic covariates model. We describe these explicitly below in hierarchical linear model notation.

The Unconditional Model
The unconditional model involves no covariates at either the individual or school (cluster) level. The Level 1 model for the kth observation in the jth school can be written as


Formula

and the Level 2 model for the intercept is


Formula

where {varepsilon}jk is an individual-level residual for the kth person in the jth school, and {zeta}j is a random effect (a Level 2 residual) associated with the jth school. In this analysis, the between-person, within-school variance component is {sigma}W2 (the variance of {varepsilon}jk), and the between-school variance component is {sigma}B2 (the variance of {zeta}j).

The Pretest Covariate Model
If pretest scores on achievement are available, they can be a powerful covariate and considerably increase power in experimental designs. The pretest covariate model involves using as a covariate the cluster-centered pretest score at the individual level and the school mean pretest score at the school level. We used group (school) mean centering because it leads to more stable estimates of variance components when, as in the present analyses, the covariate values vary substantially across schools (see Raudenbush & Bryk, 2002, p. 143). Thus, the Level 1 model for the kth observation in the jth school can be written as


Formula

and the Level 2 model for the intercept is


Formula

where Xjk is the achievement pretest score for the jth observation in the kth school, Xj is the pretest mean for the jth school, {varepsilon}jk is an individual-level residual, and {zeta}j is a random effect of the jth school (a Level 2 residual); the covariate slope beta1j was treated as equal in all clusters (schools). The variance components associated with this analysis are {sigma}AW2 (the variance of {varepsilon}jk) and {sigma}AB2 (the variance of {zeta}j). In this analysis, the covariate-adjusted between-person, within-school variance component is {sigma}AW2 (the variance of {varepsilon}jk), and the covariate-adjusted between-school variance component is {sigma}AB2 (the variance of {zeta}j).

The Demographic Covariates Model
Sometimes pretest scores are not available, but other background information about individuals is available to serve as covariates. The demographic covariates model includes four covariates at each of the individual and group (cluster) levels. At the individual level, the covariates are dummy variables for male gender and for Black or Hispanic status and an index of mother’s and father’s levels of education as a proxy for SES. As recommended by Raudenbush and Bryk (2002), each of these individual-level covariates was group centered, that is, transformed by subtracting the group mean as shown in the equation for the Level 1 model below. The school-level covariates were the means of the individual-level variables for each school (cluster). Therefore, the Level 1 model for the kth observation in the jth school can be written as


Formula

where Gjk, Bjk, and Hjk are dummy variables for male gender, Black status, and Hispanic status, respectively; E is an index of mother’s and father’s levels of education (which is a proxy for family SES); and Gj, Bj, Hj, and Ej, are the means of G, B, H, and E in the jth school (cluster). The Level 2 model for the intercept is


Formula

and the covariate slopes beta1j, beta2j, beta3j, and beta4j were treated as equal in all clusters (schools). In this analysis the covariate-adjusted between-person, within-school variance component is {sigma}AW2 (the variance of {varepsilon}jk), and the covariate-adjusted between-school variance component is {sigma}AB2 (the variance of {zeta}j).

The Pretest and Demographic Covariates Model
The pretest and demographic covariates model combines the use of an achievement pretest and the individual characteristics of gender, minority group status, and parent’s education as individual- and school-level covariates. Therefore, the Level 1 model for the kth observation in the jth school can be written as


Formula

where all of the symbols are defined as in the models above. The Level 2 model for the intercept is


Formula

and the covariate slopes beta1j, beta2j, beta3j, beta4j, and beta5j were treated as equal in all clusters (schools). In this analysis, the covariate-adjusted between-person, within-school variance component is {sigma}AW2 (the variance of {varepsilon}jk), and the covariate-adjusted between-school variance component is {sigma}AB2 (the variance of {zeta}j).


    The Intraclass Correlation Data
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
The (unconditional) intraclass correlation associated with the unconditional model described above is


Formula(1)

where {sigma}T2 = {sigma}B2 + {sigma}W2 is the (unconditional) total variance. Note that the residuals {varepsilon}jk and {zeta}j correspond to the within- and between-cluster cluster random effects in an experiment that assigned schools to treatments. Consequently, the variance components associated with these random effects and the intraclass correlation correspond to those in a cluster-randomized experiment that assigned schools to treatments and analyzed the data with no covariates.

In the three models involving covariate adjustment, the (covariate-adjusted) intraclass correlation is


Formula(2)

where {sigma}AT2 = {sigma}AB2 + {sigma}AW2 is the (covariate-adjusted) total variance. Note that the residuals {varepsilon}jk and {zeta}j correspond to the within- and between-cluster cluster random effects in an experiment that assigned schools to treatments and used the same covariates as were used in the models with covariates. Consequently, the variance components associated with these random effects and the conditional intraclass correlation {rho}A correspond to those in a cluster-randomized experiment that assigned schools to treatments and analyzed the data with these (individual and school mean) characteristics as covariates.

For each combination of design dimensions (i.e., for each grade level, achievement domain, covariate set, setting, and choice of SES or achievement status within setting), we estimated the intraclass correlation (or conditional intraclass correlation) via restricted maximum likelihood using Stata and computed the standard error of that intraclass correlation estimate using the result given in Donner and Koval (1982). This resulted in 13 (grade levels) x 2 (achievement domains) x 4 (covariate sets) x 4 (SES or achievement statuses within settings) = 416 intraclass correlation estimates (each with a corresponding standard error).

For designs that use covariates, we also provide values of


Formula(3)

the proportion of between-school variance remaining, and


Formula(4)

the proportion of within-school variance remaining, respectively, after covariate adjustment. For designs involving covariates, these two auxiliary quantities ({eta}B2 and {eta}W2) are useful in computing statistical power. Their use is illustrated in a subsequent section of this article.

Two alternative parameters that contain the same information as {eta}B2 and {eta}W2 are RB2 = 1 – {eta}B2 and RW2 = 1 – {eta}W2, the proportion of between- and within-school variance explained by the covariate. We chose to tabulate the {eta}2 values instead of the R2 values because the relation of the {eta}2 values to the noncentrality parameters used in power analysis is simpler.

Note that each of the four analyses involved slightly different variables, and there were missing values on some of these variables in our survey data. We decided to compute each analysis on the largest set of cases that had all of the necessary variables for the analysis in question. This means that each of the four analyses of a given data set is computed on a slightly different set of cases. Because the quantities {eta}W2 and {eta}B2 involve a comparison of two different analyses (one with and one without a particular set of covariates), we believed that it was important to make this comparison using estimates derived from exactly the same set of cases. Consequently, for each of the analyses that involved covariates, we recomputed the estimates of the unadjusted variance components, {sigma}W2 and {sigma}B2, using only the cases that were used to compute the adjusted variance components {sigma}AW2 and {sigma}AB2 and used these particular estimates to compute the {eta}W2 and {eta}B2 values given here.

Although we provide estimates of the standard errors of the intraclass correlations, they should be used with some caution for two reasons. First, the distribution of estimates of the intraclass correlations is only approximately normal. Second, not all of these values are independent of one another, and it is not immediately clear how to carry out a formal statistical analysis of differences between estimates of intraclass correlations computed from the same sample of individuals. Nevertheless, we feel that these standard errors are useful as descriptions of the uncertainty of the individual estimates of intraclass correlations.


    Results
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
We found that the intraclass correlations obtained in the nationally representative sample and the schools in the middle 80% of the achievement distribution had intraclass correleations that were almost identical. Consequently, we present results here only the intraclass correlation data from the entire national sample of schools, those in the upper half of the free and reduced-price lunch distribution (low-SES schools), and those in the lower half of the school mean achievement distribution (low-achievement schools).

The main results of this study are presented in Tables 27 and discussed in the sections that follow. Each table is divided into four vertical panels of three columns each, one panel for each of the four analyses described above. The data for each grade level are given in a different row. In the row for each grade, the columns of each panel provide the estimates of the intraclass correlation ({rho}), the standard error of the estimate of {rho}, and (for all but the unconditional model given in the first panel on the left-hand side) estimates of {eta}B2 and {eta}W2. For example, consider the data in Table 2 for the pretest covariate model for Grade 1, given in the third panel of the table. On the row associated with Grade 1, the values in the columns of the third panel (columns 8–11 of the table) are .125, .0135, .177, and .376, respectively, which correspond to estimates for {rho}A, the standard error of the estimate of {rho}A, {eta}B2, and {eta}W2.


View this table:
[in this window]
[in a new window]

 
TABLE 2 Intraclass Correlations (ICCs) and Variance Components for Mathematics Achievement: All Schools

 

View this table:
[in this window]
[in a new window]

 
TABLE 7 Intraclass Correlations (ICCs) and Variance Components for Reading Achievement: Low-Achievement Schools

 
To help interpret the tables as a whole, the bottom four rows of each table give summary statistics (across grades) of the estimates of {rho}A, {eta}B2, and {eta}W2, including the mean, the intercept (a) and slope (b) of an unweighted regression of the estimates on grade level (with kindergarten equaling Grade 0), and the correlation (r) between estimates and grade level. For example in Table 2, the mean intraclass correlation in the unconditional model is .220, the correlation between grade and intraclass correlation is –.443, and the regression equation for predicting the unconditional intraclass correlation from grade is .242 – .004(grade).

Mathematics Achievement in the Full Population
Table 2 is a presentation of results from the entire national sample in mathematics. The average unconditional intraclass correlation estimate across all grades is .220. Although there is a tendency of the intraclass correlations to be larger at lower grades, in general, there are not large changes across adjacent grade levels. Few of these differences exceed 2 standard errors of the difference. A notable exception is the unadjusted intraclass correlation for Grade 11, for which the difference between Grade 11 and either of the adjacent grades is about 3 standard errors of the difference. None of the differences between adjusted intraclass correlations in adjacent grades is a large as 3 standard errors of the difference, but the values for Grade 2 are somewhat higher (by over 2 standard errors of the difference) and those for Grade 3 somewhat lower than those of adjacent grades.

The linear regression coefficients (the intercept a and slope b) of each of the tabled quantities on grade given at the bottom of each column of the table permits the computation of smoothed estimates of each quantity a + b(grade). For example, the values of a and b for the unadjusted intraclass correlation are a = .242 and b = –.004, so that the smoothed (interpolated) value of the unadjusted intraclass correlation for Grade 11 would be .242 + (–0.004)11 = .198, somewhat higher than the tabled value of .138.

The patterns of reduction of between- and within-cluster (school) variances are generally quite different in models involving different covariates. Specifically, the demographic covariate analyses typically reduced the between-cluster variance to one half to one quarter of its value in the unconditional model (e.g., produced {eta}B2 from .5 to .25), but typically reduced within-cluster variance by 10% or less (e.g., produced {eta}W2 values greater than 0.9). Thus the use of ascriptive characteristics as covariates (as in the demographic covariates model) may lead to increased statistical power. The residualized analyses using pretest score as a covariate typically resulted in larger reductions in between-cluster variance (e.g., produced {eta}B2 values from .3 to .1) and typically also reduced within-cluster variance by a much larger amount than the demographic covariates model (e.g., produced {eta}W2 values from .25 to .5). In general, demographic characteristics explain little additional variance (at either the student or the school level) beyond what is explained by the pretest, and thus their inclusion in analysis models does not appear to be useful if pretest scores are available.

There is one apparent anomaly in the results reported in Table 2. The {eta}2 values for the pretest and demographic covariates model are often larger than those for the pretest covariate model. This is equivalent to saying that the estimated variance accounted for decreases when ascriptive characteristics are added as covariates to the model that already has pretest as a covariate. It is theoretically possible for this to occur in multilevel models when the actual differences are negligible, as they appear to be here. The difference is particularly large in the sixth grade data, however, and appears to be a consequence of differences between the samples used to estimate the two models. For unknown reasons, there is a considerable amount of missing data on the demographic covariates used to create the demographic covariates and pretest and demographic covariates models in the survey providing the sixth grade data (the third follow-up of the Prospects cohort that began in third grade). The same pattern is evident, but to a lesser extent, in the fifth grade {eta}B2 data (based on the second follow-up of the Prospects cohort that began in third grade). We suggest using these values only with great caution. It might be wise to use the smoothed values for the pretest and demographic covariates model in Grade 6 (which would give {eta}B2 = .207 and {eta}W2 = .377) and possibly in Grade 5 (which would give {eta}B2 = .222 and {eta}W2 = .393).

Reading Achievement in the Full Population
Table 3 is a presentation of results from the entire national sample in reading, organized in the same way as Table 2 which reports results for mathematics. The intraclass correlation and adjusted intraclass correlation values in reading are generally quite similar to those in mathematics. The mean (across grade levels) unconditional intraclass correlation in reading was .224. As in mathematics, there is a tendency of the intraclass correlations in reading to become smaller at higher grades, but the changes across adjacent grade levels are often larger. The results for Grade 9 are particularly inconsistent with (having larger values of the intraclass correlations than) the results from either Grade 8 or Grade 10. The results from Grade 2 are also somewhat different (having smaller values of the intraclass correlations) than the results from either Grade 1 or Grade 3. Several of these differences exceed 3 standard errors of the difference. Few of the other differences exceed 2 standard errors of the difference.


View this table:
[in this window]
[in a new window]

 
TABLE 3 Intraclass Correlations (ICCs) and Variance Components for Reading Achievement: All Schools

 
There is less consistency in reading than in mathematics among the adjusted intraclass correlations for the three models involving covariates. However, the general pattern of reduction in between- versus within-cluster variance was similar in reading and in mathematics. That is, there was somewhat greater reduction in between-cluster variance and much greater reduction in within-cluster variance in the pretest covariate model than in the demographic covariates model. As in the case of mathematics achievement in the full population, the pretest and demographic covariates model leads to little additional variance explained at either the school or the individual level compared with the model using only pretest as a covariate.

Mathematics Achievement in Low-SES Schools
Table 4 is a presentation of results in mathematics computed for the schools in the bottom half of the school SES distribution (operationalized by the proportion of students eligible for free or reduced-price lunch) and is organized in the same way as Tables 2 and 3. The mean (across grade levels) unconditional intraclass correlation in mathematics was .195. There is a tendency for the intraclass correlation values in this sample to be a bit smaller than those reported in Table 2 for the entire national population, a tendency that does not hold for the conditional (adjusted) intraclass correlations.


View this table:
[in this window]
[in a new window]

 
TABLE 4 Intraclass Correlations (ICCs) and Variance Components for Mathematics Achievement: Low–Socioeconomic Status Schools

 
There is one substantial anomaly in the results reported in Table 4 that is similar to that in Table 2: The {eta}2 values for the pretest and demographic covariates model are sometimes larger than those for the pretest covariate model, a difference that is particularly large at Grade 6. This anomaly (like that in Table 2) appears to be a consequence of differences between the samples used to estimate the two models. As in Table 2, the same pattern is also evident, but to a lesser extent, in the fifth grade {eta}B2 data. We suggest using these values only with great caution. It might be wise to use the smoothed values for the pretest and demographic covariates model in Grade 6 (which would give {eta}B2 = .195 and {eta}W2 = .453) and possibly in Grade 5 (which would give {eta}B2 = .192 and {eta}W2 = .448).

Reading Achievement in Low-SES Schools
Table 5 is a presentation of results in reading computed for the schools in the bottom half of the school SES distribution (operationalized by the proportion of students eligible for free or reduced-price lunch) and is organized in the same way as Tables 24. The mean (across grade levels) unconditional intraclass correlation is .193. As in the case of mathematics, there is a tendency for the intraclass correlation values in this sample to be a bit smaller than those reported in Table 3 for the entire national population, a tendency that does not hold for the conditional (adjusted) intraclass correlations.


View this table:
[in this window]
[in a new window]

 
TABLE 5 Intraclass Correlations (ICCs) and Variance Components for Reading Achievement: Low–Socioeconomic Status Schools

 
Mathematics Achievement in Low-Achievement Schools
Table 6 is a presentation of results in mathematics computed for the schools in the bottom half of the distribution of school mean mathematics achievement and is organized in the same way as Tables 25. The mean (across grade levels) unconditional intraclass correlation in mathematics was .087. The intraclass correlation values in this sample are considerably smaller than those reported in Table 2 for the entire national population, a tendency that also holds for the conditional (adjusted) intraclass correlations. There is some variation of intraclass correlations across grade levels, but only the difference between Grades 4 and 5 is larger than 2 standard errors of the difference. In general, the intraclass correlations at kindergarten through Grade 4 range from about .09 to .13, in Grades 5–7 from about .05 to .08, and in Grades 8–12 from .075 to .085.


View this table:
[in this window]
[in a new window]

 
TABLE 6 Intraclass Correlations (ICCs) and Variance Components for Mathematics Achievement: Low-Achievement Schools

 
The use of covariates resulted in a much smaller reduction in both between- and within-school variances in this sample than in the unrestricted sample. Specifically, the demographic covariates analyses typically reduced the between-school variance to no less than one half of its value in the unconditional model (e.g., produced {eta}B2 from .5 to .8) but typically reduced within-cluster variance by 5% or less (e.g., produced {eta}W2 values greater than .95). The pretest covariate analyses using pretest score as a covariate typically (but not always) resulted in modestly larger reductions in between-cluster variance (e.g., produced {eta}B2 values from .3 to .8) but typically reduced within-cluster variance by a larger amount than the demographic covariates model (e.g., produced {eta}W2 values from .5 to .8). As in the case of mathematics achievement in the full population, the pretest and demographic covariates model leads to little additional variance explained at either school or individual level compared with the model using only pretest as a covariate. Overall, we find that the intraclass correlation is smaller in this sample than in the full sample, but the explanatory power of pretest and other covariates is also smaller. These two tendencies have opposite effects on statistical power. The smaller intraclass correlation generally leads to larger statistical power, but the smaller explanatory power of covariates generally leads to less statistical power, one partially offsetting the effects of the other.

There is one substantial anomaly in the results reported in Table 6 that is similar to those in Tables 2 and 4: The Grade 2 {eta}2 values for the pretest and demographic covariates model are larger than those for the pretest covariate model. This anomaly (like that in Tables 2 and 4) appears to be a consequence of differences between the samples used to estimate the two models. We suggest using the values for the pretest and demographic covariates model with some caution. It might be wise to use the smoothed values for the pretest and demographic covariates model in Grade 6 (which would give {eta}B2 = .195 and {eta}W2 = .453) and possibly in Grade 5 (which would give {eta}B2 = .192 and {eta}W2 = .448).

Reading Achievement in Low-Achievement Schools
Table 7 is a presentation of results in reading computed for the schools in the bottom half of the distribution of school mean reading achievement and is organized in the same way as Tables 26. The mean (across grade levels) unconditional intraclass correlation in mathematics was .093, and as in the case of reading, the intraclass correlation values in this sample are considerably smaller than those reported in Table 3 for the entire national population, a tendency that also holds for the conditional (adjusted) intraclass correlations.

There is some variation of intraclass correlations across grade levels. The intraclass correlation in Grade 9 is larger (by over 3 standard errors of the difference) than that in either of the adjacent grades. Similarly, the intraclass correlation in Grade 1 is more than 2 standard errors greater than that in kindergarten but less than 2 standard errors of the difference from that in Grade 2. None of the other differences between grades is this large in comparison with their uncertainty. In general, the intraclass correlations at Grades K–4 range from about .10 to .14 and in Grades 5–8 from about .06 to .07, and in Grades 10–12, they are about .05.

As in the case of mathematics, the use of covariates resulted in a much smaller reduction in both between- and within-school variances in this sample than in the entire national sample. Specifically, the demographic covariates analyses typically reduced the between-school variance to no less than one half of its value in the unconditional model (e.g., produced {eta}B2 values from .5 to .8) but typically reduced within-cluster variance by 5% or less (e.g., produced {eta}W2 values greater than .95). The analyses using pretest score as a covariate typically (but not always) resulted in modestly larger reductions in between-cluster variance (e.g., produced {eta}B2 values from .3 to .8) and typically reduced within-cluster variance by a larger amount than the demographic covariates model (e.g., produced {eta}W2 values from .5 to .8). As in the case of mathematics achievement in the full population, the use of both pretest and demographic covariates leads to little additional variance explained at either the school or the individual level compared with the model using only pretest as a covariate. Thus, we find, as in the case of mathematics, that the intraclass correlation is smaller in this sample, but the explanatory power of pretest and other covariates is also smaller, one of these differences partially offsetting the effects of the other on statistical power.

There are several small anomalies in the results reported in Table 7 that are similar to those in Table 6, in which the {eta}B2 values for the pretest and demographic covariates model are slightly larger than those for the pretest covariate model. These anomalies (like those in Table 6) appear to be a consequence of instability in variance component estimates in the sample of low-achievement schools.


    Comparison With Published Experiments
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Although the estimates presented in this article are derived from national probability samples, few experiments actually use national probability samples. Thus, one might question if intraclass correlations obtained from national samples resemble those of experiments actually conducted in education. To obtain some empirical evidence on this question, we searched the two most prestigious education journals that publish experimental studies, the American Educational Research Journal and Educational Evaluation and Policy Analysis, from 1995 to 2005 to find the cluster-randomized experiments with academic achievement as an outcome variable. We found eight reports of experiments that had randomized schools. We were able to obtain at least one unconditional intraclass correlation estimate from seven of these experiments (which required contacting authors in several cases). The eighth study did not treat schools as a random effect in the analyses and therefore could not provide an intraclass correlation value. This yielded a total of 41 intraclass correlation estimates, 14 in mathematics outcomes and 27 in reading outcomes. They ranged from .07 to .31 in mathematics achievement (with a mean of .17) and .05 to .74 in reading achievement (with a mean of .19). Eliminating the largest estimate in reading reduced the average value, but only to .17. Some of this variation is surely due to sampling error of estimation. None of the studies provided a standard error for the intraclass correlation estimates, but the form of the standard error is proportional to the square root of the number of schools (see, e.g., Donner & Koval, 1982). Therefore, these standard errors of the experimental estimates must be considerably larger than the largest of those we report on the basis of survey data (i.e., considerably bigger than .03), because the experiments involved considerably fewer schools than our surveys.

The average (unconditional) intraclass correlation in Tables 2 and 3 for the full national sample is about .22, the average value in Tables 4 and 5 for low-SES schools is about .19, and the average value in Tables 6 and 7 for low-achieving schools is about .09. Therefore, the average value of the intraclass correlation estimates from the published experiments is roughly consistent with the national values for low-SES schools but somewhat larger than the national values for low-achieving schools. This is consistent with the fact that most of the published experiments explicitly targeted, or realized, substantial samples of low-SES or disadvantaged students. It would not be appropriate to draw strong conclusions from such a small sample of empirical evidence, but this evidence does not suggest that the intraclass correlations obtained in published experiments are substantially different than those obtained from corresponding national (e.g., low-SES) samples.


    Agreement Among Intraclass Correlation Estimates From Different Data Sets
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
When it was possible to estimate intraclass correlations for the same grade and achievement domain from more than one survey, we computed estimates from all surveys from which it was possible. Table 8 is a presentation of these estimates for the unconditional and demographic covariates models, along with the difference between each pair of intraclass correlation estimates that should estimate the same value and the standard error of the difference. Too few estimates from the other models could be computed for meaningful comparisons. Because the estimated intraclass correlations are approximately normally distributed in large samples, the difference divided by its standard error should have approximately a standard normal distribution if the two estimates are estimating the same population quantity, and thus a difference larger than 2 standard errors for any particular comparison should happen only about 5% of the time by chance.


View this table:
[in this window]
[in a new window]

 
TABLE 8 Comparisons of Intraclass Correlations (ICCs) Estimated From Different Surveys

 
Although some of the differences are large enough to have practical implications, they are subject to considerable sampling uncertainty. We found that most of the results agreed within sampling error. Overall, 14 of the 18 differences of unadjusted intraclass correlation estimates (across both reading and mathematics) were less than 2 standard errors of the difference. Three of the 13 differences in mathematics exceeded 2 standard errors (ECLS – Prospects1 at Grade 3 and LSAY10 – NELS in Grades 10 and 12). One of the five differences in reading (ECLS – Prospects1 at Grade 3) exceeded 3 standard errors.

However, it is crucial to recognize that the conceptual hypothesis of agreement among data sets that we are testing is that all of the pairs of intraclass correlations are equal. Although the criterion that "differences exceeding 2 standard errors are statistically significant at the 5% level" is (approximately) valid for any single comparison, it is not appropriate for evaluating several comparisons at the same time. To evaluate whether at least one of the comparisons implies a reliable difference, a multiple comparison procedure is needed (see, e.g., R. Miller, 1977). A Bonferroni adjustment for 13 comparisons would require a difference of 2.89 standard errors to be significant at the 5% level, and none of the difference in mathematics is that large. The difference in reading between the estimates from ECLS and Prospects1 at Grade 3 is large enough to be statistically significant, even taking multiple comparisons into account. However, we interpret these comparisons as suggesting that there is a reasonable degree of agreement among the intraclass correlations in these surveys, even though they were conducted as much as a decade apart, by different research organizations, and using different achievement measures.


    Minimum Detectable Effect Sizes
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
One way to summarize the implications of these results for statistical power is to use them to compute the smallest effect size for which a target design would have adequate statistical power. This effect size is often called the minimum detectable effect size (MDES; see Bloom, 1995, 2005). In computing the MDES values reported in this article, we used the value 0.8 with a two-sided test at a significance level of .05 as the definition of adequate power. We considered designs with no covariates and with pretest as a covariate at both the individual and group levels. We considered both reading and mathematics achievement as potential outcomes. Finally, we considered a balanced design with a sample of size of n = 60 per school and m = 10, 15, 20, 25, or 30 schools randomized to each treatment group.

Table 9 gives the MDESs on the basis of parameters given in Tables 2 and 3 that were estimated from the full national sample. Perhaps the most obvious finding is that the corresponding MDES values for mathematics and reading are quite similar. With no covariates, the MDES values typically exceed 0.60 for m = 10 and typically exceed 0.35 even for m = 30. However, the use of pretest as a covariate reduces the MDES values to less than 0.40 for m = 10 and 0.20 or less for m = 30. Although Cohen (1977) proposed the values 0.20 to define small-sized effects and 0.50 to define medium-sized effects, these labels can be misleading in educational policy contexts, in which effect sizes of 0.20 or smaller are often of policy interest, and consequently, experiments may well be designed to detect effects in this range. Effect sizes used in power analyses should be informed by the magnitude of effects that would be policy relevant and by prior empirical evidence about the likely effect of an intervention being evaluated.


View this table:
[in this window]
[in a new window]

 
TABLE 9 Minimum Detectable Effect Sizes With Power 0.80 and n = 60 as a Function of m: All Schools

 
Table 10 gives the MDESs on the basis of parameters given in Tables 4 and 5 that were estimated from the national sample of low-SES schools. These results are remarkably similar to those in Table 9.


View this table:
[in this window]
[in a new window]

 
TABLE 10 Minimum Detectable Effect Sizes With Power 0.80 and n = 60 as a Function of m: Low–Socioeconomic Status Schools

 
Table 11 gives the MDESs on the basis of parameters given in Tables 6 and 7 that were estimated from the national sample of schools in the lower half of the achievement distribution. Because the unconditional intraclass correlations are lower, the MDES values for designs with no covariates are smaller. However, because the covariates are less effective in reducing between- and within-school variance in this sample, the MDES values with pretest as a covariate are not always smaller than in the national sample of all schools. With no covariates, the MDES values typically less than 0.50 for m = 10 and less than 0.30 for m = 30. However, the use of pretest as a covariate typically reduces the MDES values to about 0.30 for m = 10 and 0.20 or less for m = 30.


View this table:
[in this window]
[in a new window]

 
TABLE 11 Minimum Detectable Effect Sizes With Power 0.80 and n = 60 as a Function of m: Low-Achievement Schools

 

    Using the Results of This Study to Compute the Statistical Power of Cluster-Randomized Experiments
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Specialized software for computing statistical power in group-randomized designs can use the intraclass correlation values and RB2 and RW2 values (where R2 = 1 – {eta}2) presented in this article to compute statistical power. Such programs include Optimal Design (Raudenbush & Liu, 2000) and PinT (Snijders & Bosker, 1993). However, such software is not necessary to compute power for studies that randomize schools. In this section, we illustrate the use of the results in this article to compute the statistical power of cluster-randomized experiments. Consider the two-treatment-group design with q (0 ≤ q < M – 2) group-level (cluster-level) covariates and p (0 ≤ p < Nq – 2) individual-level covariates in the analysis. Note that we specifically include the possibility that there are zero (no) covariates at a given level. For example, a design with p = 1 and q = 1 might arise, for example, if there was a pretest that was used as an individual-level covariate and cluster means on the covariate were used as a group-level covariate. We assume also that the individual-level covariate has been centered about cluster means. The structural model for Yijk, the kth observation in the jth cluster in the ith treatment might be described in analysis of covariance (ANCOVA) notation as


Formula

where µ is the grand mean, {alpha}Ai is the covariate-adjusted effect of the ith treatment, {theta}I = ({theta}I1, . . . , {theta}Ip)' is a vector of p individual-level covariate effects, {theta}G = ({theta}G1, . . ., {theta}Gq)'is a vector of q group-level covariate effects, xijk is a vector of p group (cluster) centered individual-level covariate values for the jth cluster in the ith treatment, zij is a vector of q group-level (cluster-level) covariate values for the jth cluster in the ith treatment, {gamma}(i)j is the random effect of cluster j within treatment i, and {varepsilon}Aijk is the covariate-adjusted within-cell residual. Here, we assume that both of the random effects (clusters and the residual) are normally distributed.

The analysis might be carried out either as an ANCOVA with clusters as a nested factor or by viewing the model as a hierarchical linear model and using software for multilevel models such as HLM. In multilevel model notation, it would be conventional to specify a Level 1 (individual-level) model as


Formula

and a Level 2 (cluster-level) model for the intercept as


Formula

where TREATMENTi is a dummy variable for the treatment group, while the covariate slopes in betaj would be treated as fixed effects (betaj = {theta}I), and {zeta}Aj is the random effect of the jth cluster (a Level 2 residual). With the appropriate constraints on the ANCOVA model (i.e., setting 0 for the {alpha}Ai = control group and constraining the mean of the {gamma}A(i)j values to be 0), these two models are identical, and there is a one-to-one correspondence between the parameters and the random effects in the two models. That is, µ = {pi}00, {alpha}Ai = {pi}A01, {theta}G = {pi}02, {theta}I = betaj (for all j), {gamma}A(i)j = {zeta}Aj (with a suitable redefinition of the index j), and {varepsilon}Aijk is identical in both models. The variance components associated with this analysis are {sigma}AW2 (the variance of {varepsilon}Aijk) and {sigma}AB2 (the variance of {zeta}j), where the A in the subscript denotes that these variance components are adjusted for the covariate.

The Intraclass Correlations
Note that if in the experiment, schools were sampled at random, students were sampled at random within schools, and q = p = 0, then {rho} = {sigma}B2/({sigma}B2 + {sigma}W2) is exactly the intraclass correlation that would obtain in a survey that sampled first schools and then students at random. Similarly, if there are covariates in the experiment, schools were sampled at random, students were sampled at random within schools, and q != 0 or p != 0, then {rho}A = {sigma}AB2/({sigma}AB2 + {sigma}AW2) is exactly the adjusted intraclass correlation that would obtain in the analysis of the survey (with appropriate covariates) that sampled first schools and then students at random.


    Hypothesis Testing
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
The object of the statistical analysis is to test the statistical significance of the intervention effect, that is, to test the following hypothesis:


Formula

Or, equivalently,


Formula

The ANCOVA t-test statistic is


Formula(5)

where m is defined in terms of the number of clusters assigned to the treatment and control groups (m1 and m2, respectively) as


Formula

YA1••, and YA2•• are the adjusted means, SA is the pooled within-treatment-groups adjusted standard deviation of cluster means, and the subscript A is used to denote that the means and standard deviation are adjusted for the covariates. The F-test statistic from a one-way ANCOVA using cluster means is of course


Formula(6)

In this case, MSAB = nm (YA1••YA2••)2 and MSAC = nSA2, where SA is the pooled within-treatment-groups standard deviation of the covariate-adjusted cluster means (the standard deviation of the Level 2 residuals). If the null hypothesis is true, the test statistic tA has Student’s t distribution with Mq – 2 degrees of freedom. Equivalently, the test statistic FA has the central F distribution with 1 degree of freedom in the numerator and Mq – 2 degrees of freedom in the denominator when the null hypothesis is true.

When the null hypothesis is false, the test statistic tA has for this analysis a noncentral t distribution with M q – 2 degrees of freedom and noncentrality parameter


Formula(7)

where {delta}A = ({alpha}A1{alpha}A2)/{sigma}AT.

Alternatively (and equivalently), the F statistic has the noncentral F distribution with 1 degree of freedom in the numerator and Mq – 2 degrees of freedom in the denominator and noncentrality parameter


Formula

For the purposes of power computation, expression 7 is not convenient, because the minimum effect size of interest is likely to be known in units of the unadjusted standard deviation rather than the adjusted standard deviation; that is, we are more likely to know {delta} = ({alpha}1{alpha}2)/{sigma}T rather than {delta}A = ({alpha}A1{alpha}A2)/{sigma}AT. In a randomized experiment, covariate adjustment should not affect the treatment effect parameter, so that {alpha}A1{alpha}A2 = {alpha}1{alpha}2, but the covariate adjustment necessarily affects the standard deviation. This is true even if the covariates operate at only one level of the design. Because {sigma}2AT = {sigma}2AB + {sigma}2AW, a covariate adjustment at the individual level will affect {sigma}AT2 via {sigma}AW2, and a covariate adjustment at the cluster level will affect {sigma}AT2 through {sigma}AB2.

To express {lambda}A, we need only in terms of {delta} express {sigma}AT in terms of {sigma}T. A direct derivation shows that


Formula(8)

An alternative, but equivalent, expression of {lambda}A that is considerably more revealing involves {eta}B2, {eta}W2, and the unadjusted intraclass correlation {rho}. This expression is


Formula(9)

Note that the quantity [{eta}W2 + (n{eta}B2{eta}W2){rho}] is analogous to [1 + (n – 1){rho}], Kish’s (1965) design effect. We see that [{eta}W2 + (n{eta}B2{eta}W2){rho}] reduces to [1 + (n 1){rho}] in the analysis without covariates (because {eta}W2 = 2{eta}B = 1), and Equation 9 reduces to the expression given (e.g., in Blair & Higgins, 1986) for the t test conducted using cluster means as the unit of analysis.

We illustrate the use of the t statistic. The power of the one-tailed test at level {alpha} is


Formula(10)

where c({alpha}, {nu}) is the level {alpha} one-tailed critical value of the t distribution with {nu} degrees of freedom (e.g., c[.05, 10] = 1.81), and H(x, {nu}, {lambda}) is the cumulative distribution function of the non-central t distribution with {nu} degrees of freedom and noncentrality parameter {lambda}. The power of the two-tailed test at level {alpha} is


Formula(11)


    Using Power Tables and Power Calculation Software
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Many tabulations (e.g., Cohen, 1977) and programs (e.g., Borenstein, Rothstein, & Cohen, 2001) are available for computing statistical power from designs involving simple random samples, but tables for computing power from the independent-groups t test are the most widely available. Following Cohen’s (1977) framework, such tables typically provide power values on the basis of sample sizes N1T and N2T (often assumed to be equal for simplicity) and effect size {Delta}T, where the superscript T indicates that these quantities are what is used in the power tables. The calculations on which they are based translate the sample sizes and effect size into degrees of freedom {nu}T and non-centrality parameter {lambda}T to compute statistical power. In the case of the two-sample t test, they do so via


Formula

and


Formula

where


Formula

Tables such as Cohen’s (or the corresponding software) can be used to compute the power of the test used in the case of clustered sampling by judicious choice of sample sizes and effect size. We have to enter the table with a configuration of sample sizes and a synthetic effect size (here called the operational effect size) that will yield the appropriate degrees of freedom and noncentrality parameter.

If the actual numbers of clusters assigned are m1 and m2, then entering the power table with sample sizes N1T = m1 q and N2T = m2 yields {nu}T = (m1T + m2T – 2) = M q – 2, the correct degrees of freedom for the test. Of course, many other combinations of sample sizes will also yield the correct degrees of freedom as well and will yield equivalent results as long as the operational effect size is modified in a corresponding manner. The relevant operational effect size using our choice of degrees of freedom is


Formula(12)

where {delta} is the unadjusted effect size, {rho} is the unadjusted intraclass correlation, and {eta}B2 and {eta}W2 are defined in Equations 5 and 6. If the analysis makes a covariate adjustment at the cluster level, {eta}B2 is the appropriate value given in the tables of this article, but if the analysis makes no covariate adjustment at the cluster level (i.e., q = 0), then {eta}B2 {equiv} 1. Similarly, if the analysis makes a covariate adjustment at the individual (within-cluster) level, {eta}W2 is the appropriate value given in the tables of this article, but if the analysis makes no covariate adjustment at the individual level (that is if p = 0), then {eta}W2 {equiv} 1. Note that the value of {Delta}T given in Equation 12 is appropriate, because when this is multiplied by Formula, it yields the noncentrality parameter {lambda}A given in Equation 9. Using {rho} or {rho}A, the cluster sample size n, and the variance ratios {eta}B2 and {eta}W2 to compute operational effect size makes it possible to compute statistical power and sample size requirements for analyses on the basis of clustered samples using these tables and computer programs designed for the two-group t test.


    Example With No Covariates at Either Level
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Consider an experiment that will randomize 10 schools to receive an intervention m1 = m2 = to improve mathematics achievement so that n = 20 students in each school would be part of the experiment. There are no covariates at either individual or group level, so that p = q = 0 and {eta}W2 = {eta}B2 = 1. The analysis will involve a two-tailed t test with significance level {alpha} = .05. Suppose that the smallest educationally significant effect size for this intervention is assumed to be {delta} = 0.50. Suppose further that the schools were chosen to attempt to be represent first graders nationally.

Entering Table 2 on the first row for Grade 1 and the panel for the unconditional model (columns 2–3) gives the intraclass correlation for first graders as {rho} = .228. Then the variance inflation factor is


Formula

so that the noncentrality parameter from Equation 7 is


Formula

Using Equation 11 and the noncentral t-distribution function (e.g., the function NCDF.T in SPSS), with M – 2 = 18 degrees of freedom, c(.05/2, 18) = 2.101, and {lambda} = 2.165, we obtain a two-sided power of p2 = 1 – 0.467 + 0.000 = 0.53.

Alternatively, we could compute the power from tables of the power of the t test such as those given by Cohen (1977). To do so, we first compute the operational effect size given in Equation 12 as


Formula

Cohen’s tables give the statistical power in terms of sample size (in each treatment group) and effect size. Examining Cohen’s Table 2.3.5, we see that the operational effect size of 0.968 is between tabled effect sizes of 0.8 and 1.0. Entering the table with sample size N1T = N2T = 10, we see that a power of 0.39 is tabulated for the effect size of {Delta}T = 0.80, and a power of 0.56 is tabulated for an effect size of {Delta}T = 1.00. Interpolating between these two values, we obtain a power of 0.53 for {Delta}T = 0.97.

Note that in this case (and many others), the operational effect size for the tests based on clustered samples is larger than the actual effect size (in this case 0.97 vs. 0.50). This does not mean that the power of the test for the design based on the clustered sample is larger than that based on a simple random sample with the same total sample size. The reason is that the test using the clustered sample has many fewer degrees of freedom in the error term. For example, a test based on an effect size of {Delta}T = 0.50 and a simple random sample of nm = (10)(20) = 200 in each group would have power essentially 1.0.


    Example With Pretest as a Covariate at Both Individual and Cluster Levels
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
Consider an experiment that will randomize m1 = m2 = 10 schools to receive an intervention to improve first grade reading achievement and that n = 20 students in each school would be part of the experiment. An ANCOVA will be used with pretest as a covariate at both individual and school level (so that p = q = 1) using a two-tailed test with significance level {alpha} = .05. Suppose that the smallest educationally significant effect size for this intervention is {delta} = 0.25. Suppose further that the schools were chosen in an attempt to be representative of first graders nationally.

Entering Table 3 on the first row for Grade 1 and the panel for the unconditional model (columns 3–5) gives the intraclass correlation for first graders as {rho} = .239. Entering Table 2 on the second row for Grade 1 and the panel for the pretest and demographic covariates model (columns 9–11) gives the between- and within-school variance ratios after covariate adjustment as {eta}B2 = .210 and {eta}W2 = .360. Then the variance inflation factor is


Formula

so that the noncentrality parameter from Equation 9 is


Formula

Using Equation 11 and the noncentral t-distribution function (e.g., the function NCDF.T in SPSS), with M – 2 – 1 = 17 degrees of freedom, c(.05/2, 17) = 2.110, and {lambda}A = 2.211, we obtain a two-sided power of p2 = 1 – 0.450 + 0.000 = 0.55.

Alternatively, we could compute the power from tables of the power of the t test such as those given by Cohen (1977). Because there is q = 1 covariate at the school level, N1T = m1 1 = 10 – 1 = 9 and N2T = m2 = 10. Because Cohen’s tables give the statistical power in terms of equal sample sizes (in each treatment group), we will need to interpolate between sample sizes N1T = N2T = 9 and N1T = N2T = 10. Here we compute m = (10 x 10)/(10 + 10) = 5. For N1T = N2T = 9, ÑT = (9 x 10)/(9 + 10) = 4.737, and the operational effect size is


Formula

Examining Cohen’s Table 2.3.5, we see that the effect size {Delta}T = 1.02 is between tabled values of effect sizes of 1.0 and 1.2. Entering the table with sample size N1T = N2T = 9, we see that a power of 0.51 is tabulated for the effect size of {Delta}T = 1.0, and a power of 0.65 is tabulated for an effect size of {Delta}T = 1.2. Interpolating between the two power values (0.51 and 0.65) for N1T = N2T = 9, we obtain a power of 0.524 for {Delta}T = 1.02. This value (0.524) corresponds to the power associated with the effect size of {delta} = 0.25 and a test based on 16 degrees of freedom.

Entering the table with sample size N1T = N2T = 10, we see that a power of 0.56 is tabulated for the effect size of {Delta}T = 1.00, and a power of 0.71 is tabulated for an effect size of {Delta}T = 1.20. Interpolating between the two power values (0.56 and 0.71) for N1T = N2T = 10, we obtain a power of 0.575 for {Delta}T = 1.02. This value (0.575) corresponds to the power associated with the effect size of {delta} = 0.25 and a test based on 18 degrees of freedom.

To obtain the power associated with an effect size of {delta} = 0.25 and a test based on 17 degrees of freedom, we must interpolate once again between these two values (0.524 and 0.575), and we obtain a power value for N1T = 9 and N2T = 10 of p2 = 0.55.

It is worth noting that if no covariates had been used at either level of this analysis (i.e., if p = q = 0 and therefore {eta}B2 = {eta}W2 = 1), the power would have been 0.17. If the pretest as a covariate had been used only at the individual level (i.e., if p = 1, q = 0, {eta}B2 = 1, but {eta}W2 = .360), the power would have increased to 0.18. But if the pretest had been used as a covariate only at the school level (i.e., if p = 0, q = 1, {eta}W2 1, but {eta}B2 = .210), the power would have increased to 0.43. This illustrates the fact that covariates at the (group) cluster level can have far more impact on the power than covariates at the individual level.


    Conclusions
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 
The values of intraclass correlations and variance components presented in this article provide some guidance for the selection of intraclass correlations for planning cluster-randomized experiments. These values suggest that for experiments that have samples as diverse as the nation as a whole and for those using low-SES schools, somewhat larger values of the intraclass correlation (roughly .15–.25) may be appropriate than the .05–.15 guidelines that have sometimes been used. The guideline of .05–.15 is more consistent with the values of unadjusted intraclass correlations among low-achieving schools and those of covariate-adjusted intraclass correlations we found.

In using these values, it is important to keep in mind that these analyses do not separately estimate the between-district and between-state components of variance. Therefore, these two components of variance are included here as part of the between-school variance. This is desirable if the values are to be used in connection with designs that involve schools from several districts or states. However, if the design involves schools from only a single district or state, the estimates reported here may overestimate the relevant intraclass correlations to some degree. Unfortunately, it is unclear just how much of an impact this may have. We suspect that these influences are not large, because a general rule of thumb in both sample surveys and cluster-randomized experiments is that variance components (and therefore contributions to intraclass correlations) of larger units tend to be smaller in magnitude, even though their impact on design effects may be large (because effects on variance inflation factors are proportional to the unit sample size multiplied by the intraclass correlation). Our attempts to explore this question by calculating intraclass correlations with the inclusion of state dummy variables in some of the surveys yielded only negligible effects. Note that the inclusion of multiple districts and states in national samples is also likely to have some impact on the effectiveness of the covariates in explaining between-and within-school variation. It is likely that the somewhat greater between-school variation in national samples leads to a larger intraclass correlation but also to larger covariate effects, so that these impacts partially cancel one another in their effects on statistical power.

A more detailed compilation is available from the authors providing values for regions of the country, settings with different levels of urbanicity, and regions crossed with levels of urbanicity. However it is important to recognize that there is a trade-off between bias (estimating exactly the right value of the intraclass correlation in a particular context) and variance (the sampling uncertainty of that estimate). The variance of the intraclass correlation estimate is driven primarily by the number of clusters (in this case, schools). Although the intraclass correlations we computed in a particular region and setting are more specific and therefore likely to have less bias as estimates of the intraclass correlation in an experiment that is to be conducted within a particular region and context, the sample size used to estimate the intraclass correlations is smaller, and thus the estimate is subject to greater sampling uncertainties. Our analyses suggest that although there is often statistically significant variation in intraclass correlations between regions and settings, the magnitude of this variation is typically small. Thus, it is not completely clear whether more specific estimates are always better (i.e., more accurate) for planning purposes.

It is important to note that the power computations illustrated in this article apply to two-level experiments in which students are nested within schools. If the sampling design used is actually a three-level design (e.g., if students are sampled by classrooms within schools) then the power computations given here (or given by specialized software for computing power in two-level designs) would not be correct. Consider a sample (e.g., for a treatment group) obtained by selecting m schools, then p classrooms within each school, and then n students within each classroom. This is not a simple random sample of mpn individuals, nor is it a (two-stage) clustered sample obtained by randomly selecting pn students within each cluster (school). Instead, it is a three-stage cluster sample of m clusters (schools) and p subclusters (classrooms), with n students randomly selected within each subcluster (classroom). The sampling distribution of statistics based on such three-stage clustered samples is not the same as those based on two-stage clustered samples of the same size. For example, suppose that the (total) variance of a population with clustered structure (such as a population of students within classrooms within schools) is {sigma}T2, and that this total variance is decomposable into a between-school variance {sigma}S2, a between-classroom variance {sigma}C 2, and a within-classroom variance {sigma}W2, so that {sigma}T2 = {sigma}S2 + {sigma}C2 + {sigma}W2. Then the variance of the mean of a simple random sample of size mpn from this population would be {sigma}T2/mpn, and the variance of the mean of a two-stage cluster sample of m clusters, each of size pn from that population (with the same sample size pn per school and the same total sample size mpn) would be [1 + (pn–1){rho}S] {sigma}T2/mpn, where {rho}S = {sigma}S2/{sigma}T2 is the cluster-level (school-level) intraclass correlation. The variance of the mean computed from a three-stage clustered sample of m schools, p classrooms within each school, and n students within each classroom would be [1 + (pn–1) {rho}S + (n–1){rho}C] {sigma}T2/mpn, where {rho}C = {sigma}C2/{sigma}T2 is the subcluster-level (classroom-level) intraclass correlation. Note that the design effect in the three-stage cluster sample [1 + (pn–1){rho}S + (n–1){rho}C] is larger than that in the two-stage cluster sample of the same size [1 + (pn–1){rho}S], which implies that the estimated treatment effect (which is just a difference between means) estimated from the three-stage cluster sample, is less precise.

This difference in precision of treatment effect estimates leads to a difference in the non-centrality parameters that determine statistical power. In a two-level experiment, the treatment effects are estimated from two-stage cluster samples, leading to the noncentrality parameter (with no covariates) of


Formula(13)

where {delta} is the effect size (mean difference standardized by {sigma}T). In a three-level experiment, the treatment effects are estimated from three-stage cluster samples, leading to the noncentrality parameter (with no covariates) of


Formula(14)

which is generally smaller than that computed from Equation 13. Therefore, the statistical power of three-level experiments that assign schools to treatments is generally smaller than that of the analogous experiments with two-level designs having the same number of schools and students (see Konstantopoulos, 2006). Note, however, that the issue here is not in which analysis is used (two-vs. three-level) but which sampling design is used (one vs. two stages of clustering within a two-vs. three-stage sampling design).

Although we anticipate that the principal use of the results given in this article will be for planning randomized experiments in education that assign schools (rather than individuals) to treatments, there are other potential applications. One involves the use of information external to an experiment to adjust the degrees of freedom of significance tests in designs involving group randomization, called the df* method by its originators (see Murray, Hannan, & Baker, 1996). Although the originators of this method caution that it is important that users should have good reasons to assume that any external estimates used should estimate the same intraclass correlation as that in the experiment, there may be situations in which data from this compilation meet that assumption. Because they are based on relatively large samples, the intraclass correlation estimates reported in this article tend to have small standard errors. Consequently, if they are thought to be appropriate for use in a particular df* computation, they should substantially increase the degrees of freedom used in the test for treatment effects.

A second potential application is to evaluate whether the conclusions of statistical analyses that incorrectly ignored clustering might have changed if those significance tests had taken clustering into account. Hedges (in press-a) has shown how to compute the actual significance level of the usual t statistic when it has been computed from clustered samples (by incorrectly ignoring clustering). The computation of this actual significance level depends on {rho}. The values in this compilation provide some guidelines on values of {rho} that might be used for sensitivity analyses to see if a conclusion about the statistical significance of a treatment effect might not have held if clustering had been taken into account.

A third potential application involves the computation of standardized effect size estimates and their standard errors in group-randomized trials. There are several approaches to the computation of effect size estimates in multilevel designs, but in some cases, the computation of estimates and the computation of standard errors requires knowledge of {rho} (see Hedges, in press-b). In cases in which the report of the experiment itself does not include information that can be used to compute an estimate of {rho}, this compilation may provide some idea of a range of plausible values to incorporate into sensitivity analyses used in connection with effect sizes from experiments that assign schools to treatment.


    Footnotes
 
LARRY V. HEDGES is currently Board of Trustees Professor of Statistics, Professor of Education and Social Policy, and faculty fellow at the Institute for Policy Research at Northwestern University, 2040 North Sheridan Road, Evanston, IL 60610; l-hedges{at}northwestern.edu. His interests include methods for educational and social policy research.

E. C. HEDBERG is currently an advanced graduate student in the Department of Sociology at the University of Chicago, NORC Research Centers, 1155 East 60th Street, Chicago, IL 60637; ech{at}uchicago.edu. He is part of many projects that span a wide variety of interests that include the sociology of family and the life course, education and methods. His dissertation research focuses on using context-effect models and dyadic analysis to understand familial social exchange between kin.

This material is based upon work supported in part by the National Science Foundation under Grant No. 0129365 and the Spencer Foundation Grant Number 200100308.

Received for publication March 16, 2006. Revision received December 13, 2006. Accepted for publication January 2, 2007.


    References
 TOP
 Abstract
 Key Findings
 Dimensions of Designs Considered
 Data Sets Used
 Analysis Procedures
 The Intraclass Correlation Data
 Results
 Comparison With Published...
 Agreement Among Intraclass...
 Minimum Detectable Effect Sizes
 Using the Results of...
 Hypothesis Testing
 Using Power Tables and...
 Example With No Covariates...
 Example With Pretest as...
 Conclusions
 References
 

  • Blair, RC, & Higgins, JJ. (1986). Comment on "Statistical power with group mean as the unit of analysis. Journal of Educational Statistics, 11, 161-169[CrossRef][Web of Science]
  • Bloom, HS. (1995). Minimum detectable effects: A simple way to report statistical power of experimental designs. Evaluation Review, 19, 547-556[Abstract/Free Full Text]
  • Bloom, HS In Bloom, HS (Ed.). (2005). Randomizing groups to evaluate place-based programs. Learning more from social experiments: Evolving analytic approaches. New York: Russell Sage
  • Bloom, HS, Bos, JM, & Lee, SW. (1999). Using cluster random assignment to measure program impacts: Statistical implications for the evaluation of educational programs. Evaluation Review, 23, 445-469[Abstract/Free Full Text]
  • Bloom, HS, Richburg-Hayes, L, & Black, AR. (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis, 29(1), 30-59[Abstract/Free Full Text]
  • Borenstein, M, Rothstein, H, & Cohen, J. (2001). Power and precision. Teaneck, NJ: Biostat
  • Cohen, J. (1977). Statistical power analysis for the behavioral sciences (2). New York: Academic Press
  • Curtin, TR, Ingels, SJ, Wu, S, & Heuer, R. (2002). User’s manual: NELS:88 base-year to fourth followup. Washington, DC: National Center for Education Statistics
  • Donner, A, Birkett, N, & Buck, C. (1981). Randomization by cluster. American Journal of Epidemiology, 114, 906-914[Abstract/Free Full Text]
  • Donner, A, & Klar, N. (2000). Design and analysis of cluster randomization trials in health research. London: Arnold
  • Donner, A, & Koval, JJ. (1982). Design considerations in the estimation of intraclass correlation. Annals of Human Genetics, 46, 271-277[Web of Science][Medline] [Order article via Infotrieve]
  • Gulliford, MC, Ukoumunne, OC, & Chinn, S. (1999). Components of variance and intraclass correlations for the design of community-based surveys and intervention studies. Data from the Health Survey for England 1994. American Journal of Epidemiology, 149, 876-883[Abstract/Free Full Text]
  • Hedges, LV. Correcting a significance test for clustering. Journal of Educational and Behavioral Statistics. in press-a
  • Hedges, LV. Effect sizes in cluster randomized designs. Journal of Educational and Behavioral Statistics. in press-b
  • Hopkins, KD. (1982). The unit of analysis: Group means versus individual observations. American Educational Research Journal, 19, 5-18[Abstract/Free Full Text]
  • Kish, L. (1965). Survey sampling. New York: Wiley
  • Klar, N, & Donner, A. (2001). Current and future challenges in the design and analysis of cluster randomization trials. Statistics in Medicine, 20, 3729-3740[CrossRef][Web of Science][Medline] [Order article via Infotrieve]
  • Konstantopoulos, S. (2006). Statistical power in three-level designs (Working paper). Evanston, IL: Northwestern University, Institute for Policy Research
  • Kraemer, HC, & Thiemann, S. (1987). How many subjects? Statistical power analysis in research. Newbury Park, CA: Sage
  • Lipsey, MW. (1990). Design sensitivity: Statistical power analysis for experimental research. Newbury Park, CA: Sage
  • Miller, JD, Hoffer, T, Suchner, RW, Brown, KG, & Nelson, C. (1992). LSAY codebook. DeKalb: Northern Illinois University
  • Miller, R. (1977). Simultaneous statistical inference. New York: Springer-Verlag
  • Mosteller, F, & Boruch, R. (Eds.) (2002). Evidence matters: Randomized trials in education research. Washington, DC: Brookings Institution
  • Murray, DM. (1998). Design and analysis of group-randomized trials. New York: Oxford University Press
  • Murray, DM, & Blitstein, JL. (2003). Methods to reduce the impact of intraclass correlation in group-randomized trials. Evaluation Review, 27, 79-103[Abstract/Free Full Text]
  • Murray, DM, Hannan, PJ, & Baker, WL. (1996). A Monte Carlo study of alternative responses to intraclass correlation in community trials. Evaluation Review, 20, 313-337[Abstract/Free Full Text]
  • Murray, DM, Varnell, SP, & Blitstein, JL. (2004). Design and analysis of group-randomized trials: A review of recent methodological developments. American Journal of Public Health, 94, 423-432[Abstract/Free Full Text]
  • Puma, MJ, Karweit, N, Price, C, Riccuti, A, & Vaden-Kiernan, M. (1997). Prospects: Final report on student outcomes, Vol. II: Technical report. Cambridge, MA: Abt Associates
  • Raudenbush, SW. (1997). Statistical analysis and optimal design for cluster-randomized experiments. Psychological Methods, 2, 173-185[CrossRef][Web of Science]
  • Raudenbush, SW, & Bryk, AS. (2002). Hierarchical linear models. Thousand Oaks, CA: Sage
  • Raudenbush, SW, & Liu, X. (2000). Statistical power and optimal design for multisite randomized trials. Psychological Methods, 5, 199-213[CrossRef][Web of Science][Medline] [Order article via Infotrieve]
  • Schochet, PZ. (2005). Statistical power for random assignment evaluations of educational programs. Princeton, NJ: Mathematica Policy Research
  • Snijders, T, & Bosker, J. (1993). Standard errors and sample sizes for two-level research. Journal of Educational Statistics, 18, 237-259[CrossRef][Web of Science]
  • Tourangeau, K, Brick, M, Le, T, Nord, C, West, J, & Hausken, EG. (2005). Early childhood longitudinal study, kindergarten class of 1998–99. Washington, DC: National Center for Education Statistics
  • Verma, V, & Lee, T. (1996). An analysis of sampling errors for demographic and health surveys. International Statistical Review, 64, 265-294[Web of Science][Medline] [Order article via Infotrieve]

Educational Evaluation and Policy Analysis, Vol. 29, No. 1, 60-87 (2007)
DOI: 10.3102/0162373707299706


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?


This article has been cited by other articles:


Home page
J Pediatr PsycholHome page
J. A. Durlak
How to Select, Calculate, and Interpret Effect Sizes
J. Pediatr. Psychol., October 1, 2009; 34(9): 917 - 928.
[Abstract] [Full Text] [PDF]


Home page
EDUCATIONAL EVALUATION AND POLICY ANALYSISHome page
J. Spybrook and S. W. Raudenbush
An Examination of the Precision and Technical Accuracy of the First Wave of Group-Randomized Trials Funded by the Institute of Education Sciences
Educational Evaluation and Policy Analysis, September 1, 2009; 31(3): 298 - 318.
[Abstract] [Full Text] [PDF]


Home page
REVIEW OF EDUCATIONAL RESEARCHHome page
R. Gersten, D. J. Chard, M. Jayanthi, S. K. Baker, P. Morphy, and J. Flojo
Mathematics Instruction for Students With Learning Disabilities: A Meta-Analysis of Instructional Components
Review of Educational Research, September 1, 2009; 79(3): 1202 - 1242.
[Abstract] [Full Text] [PDF]


Home page
Eval RevHome page
S. Konstantopoulos
Incorporating Cost in Power Analysis for Three-Level Cluster-Randomized Designs
Eval Rev, August 1, 2009; 33(4): 335 - 357.
[Abstract] [PDF]


Home page
JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICSHome page
M. A. Rotondi and A. Donner
Sample Size Estimation in Cluster Randomized Educational Trials: An Empirical Bayes Approach
Journal of Educational and Behavioral Statistics, June 1, 2009; 34(2): 229 - 237.
[Abstract] [Full Text] [PDF]


Home page
JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICSHome page
P. Z. Schochet
Statistical Power for Regression Discontinuity Designs in Education Evaluations
Journal of Educational and Behavioral Statistics, June 1, 2009; 34(2): 238 - 266.
[Abstract] [Full Text] [PDF]


Home page
Am Educ Res JHome page
S. W. Raudenbush
Advancing Educational Policy by Advancing Research on Instruction
American Educational Research Journal, March 1, 2008; 45(1): 206 - 230.
[Abstract] [Full Text] [PDF]


Home page
JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICSHome page
L. V. Hedges
Effect Sizes in Cluster-Randomized Designs
Journal of Educational and Behavioral Statistics, December 1, 2007; 32(4): 341 - 370.
[Abstract] [Full Text] [PDF]


Home page
JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICSHome page
L. V. Hedges
Correcting a Significance Test for Clustering
Journal of Educational and Behavioral Statistics, June 1, 2007; 32(2): 151 - 179.
[Abstract] [Full Text] [PDF]


Home page
EDUCATIONAL EVALUATION AND POLICY ANALYSISHome page
H. S. Bloom, L. Richburg-Hayes, and A. R. Black
Using Covariates to Improve Precision for Studies That Randomize Schools to Evaluate Educational Interventions
Educational Evaluation and Policy Analysis, March 1, 2007; 29(1): 30 - 59.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Web of Science (6)
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Hedges, L. V.
Right arrow Articles by Hedberg, E. C.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

AER home page RER home page EPA home page JEB home page RRE home page