Original articlesAnalysis of Case-Cohort Designs
Introduction
Large cohort designs with few observed failures may require enormous resources to ascertain covariate values. Case-cohort designs can reduce data collection by efficiently sampling the censored (nondiseased) individuals. Unlike the nested case-control design, sampling is done a priori without regard to case status or time. While conceptually simple, the analysis of a case-cohort design is nontrivial and may be daunting if relying on the statistical literature for guidance. This article describes the techniques needed to fit such models and the software that can be used.
Analysis of case-cohort data resembles a standard Cox [1] regression approach with some modification. We assume that if data on the full cohort were available, then a standard Cox regression analysis would be used. Observed failures are typically more influential on the parameter estimates than censored observations. Accordingly, Prentice [2] proposed the case-cohort design to analyze cohort data efficiently when most observations are censored. Conceptually, a random sample of the cohort, or “subcohort,” is designated prospectively as the source of comparison observations for the observed failures. All failures are included whether they occur in the random sample or not, but censored observations are included only if in the subcohort.
The design appears to be very efficient because controls can be used in all risk sets for which they qualify. Furthermore, as the random sample subcohort is chosen without regard to outcome, several failure time outcomes can be analyzed with the same comparison group. Despite these advantages, the design is used infrequently in practice, with most investigators choosing nested case-control designs instead. For the period 1990–1998 MEDLINE shows 484 occurrences of the keywords “nested case-control” versus 55 occurrences of “case-cohort.” Some investigators may have been deterred by the difficult variance estimation and lack of software. Others may have been influenced by arguments that nested case-control designs may be more statistically efficient in some circumstances 3, 4. Finally, others may be simply unaware of the design and its potential advantages.
This article describes methods of analysis including different weighting schemes used in estimation. We describe how a robust covariance matrix is computed to give standard errors for the parameter estimates. Details of how to fit the models in standard software are given. The nickel refinery dataset described in Breslow and Day [5] is used to illustrate both the case-cohort and nested case-control designs. This is an occupational cohort with staggered entry and fixed covariates. We perform a small simulation that considers different sampling fractions and shows how the estimated βs, standard errors, and efficiency vary with the analytic method.
Section snippets
Case-cohort design
The term “case-cohort” was coined by Prentice [2] to describe a design that is a cross between a cohort design and a case-control design, incorporating the best features of both. The design was actually proposed earlier by Miettinen [6] and called the “case-base” design, but Prentice extended the design to include failure time analysis. We describe the case-cohort design as if a prospective study was being conducted, although in many cases a retrospective study is actually performed. Consider a
Case-cohort analysis
Consider a proportional hazards model with no ties among the observed failure times. For individual i at time t, let zi(t) be the covariate vector (possibly time-dependent), and let Yi(t) indicate whether person i is at risk at time t. We assume a standard exponential form for the relative risk. If covariates are evaluated on everyone, a standard Cox model is used. If person i fails at time tj, then the contribution to the partial likelihood, assuming no tied failure times, is
Robust estimation of the variance
The score contributions from the pseudolikelihood maximization are not independent owing to the method of sampling [2]. Intuitively, the correlation arises because a case outside the subcohort suddenly appears at its own failure time and was not previously included in the earlier failure times. Consequently, martingale theory cannot be directly applied and more complicated asymptotics are required 2, 7. Prentice [2] proposed a variance estimator that corrects for this correlation among score
Design considerations
In the introduction the case-cohort design was described for a prospective study with everyone entering at time 0. The design is also applicable to open cohorts with staggered entry. In this case it is necessary to have each new member of the cohort have probability α of being a member of the subcohort. The exact variance estimate of Prentice [2] is difficult to compute in open cohorts, so the robust variance estimate is preferred.
In some cases, the subcohort may become small after many
Software for modeling case-cohort data
Some recent changes in software make fitting case-cohort data much easier. Improvements in S-Plus and the SAS procedure PHREG now allow direct modeling with appropriate construction of the dataset. We have written an SAS macro that computes the weighted estimates and the robust covariance matrix. This macro is available on the internet through Statlib (http://lib.stat.cmu.edu/general/robphreg). The program allows any of the three weighting schemes discussed here.
To use the SAS macro, it is
Comparison to the nested case-control design
The nested case-cohort design is often used in the same setting that a case-cohort design might be used. After all outcomes are determined, risk sets are formed at each failure time that enumerate all controls at risk at that same time point [12]. For a 1:m nested design, m controls are selected at random from those available at that time point. Within the risk set, sampling is without replacement, but individuals are selected with replacement across time points. Individuals are included only
Example
As an example, consider the Welch nickel refinery workers and subsequent development of nasal cancer. Breslow and Day [5, p. 223] show an analysis that uses “years since first employed” as a time axis. We consider the same model using the four continuous variables: (1) log (age at first employment—10 years old); (2) (year first employed—1915)/10; (3) (year first employed—1915)2/100; and (4) log (exposure + 1). The full cohort results are shown at the bottom of Table 5.10 in Breslow and Day [5].
Simulation results
Suppose that rather than obtain covariate information for the entire nickel refinery cohort, sampling of the cohort was performed. We conducted a small-scale simulation comparing the sampling methods under different sampling fractions. For each sampling scheme, 200 samples were drawn from the full cohort and an analysis with four continuous covariates was performed. Our interest centers on the parameter estimate of log exposure, its standard error, and the associated hypothesis test.
For the
Discussion
The limited simulation suggests that the case-cohort design may be more efficient in some applications than the nested case-control design and that the unweighted analysis may be preferable. The weighted analysis may be appealing intuitively, but it could be biased away from the null hypothesis.
Langholz and Thomas 3, 4 reported that the efficiency of the case-cohort design compared with the nested case-control design was less than expected in standard survival analyses and could even be
Acknowledgements
This research was supported in part by National Cancer Institute grants CA61114 and CA63731 and the Centers for Disease Control RFP 200-95-0947. The SAS program is modeled on SAS Version 6.10 sample program PHR610EX.SAS. SAS procedure PHREG is described in the SAS Institute publication SAS/STAT Software: Changes and Enhancements through Release 6.11, 1996. The S-Plus function coxph is described in Splus4: Guide to Statistics, 1997, from Mathsoft Corporation in Seattle, WA. The Epicure program
References (25)
Nested case-control studies
Prev Med
(1994)- et al.
A large scale prospective cohort study on diet and cancer in The Netherlands
J Clin Epidemiol
(1990) Regression models and life tables
J R Stat Soc Series B
(1972)A case-cohort design for epidemiologic cohort studies and disease prevention trials
Biometrika
(1986)- et al.
Nested case-control and case-cohort methods of sampling from a cohortA critical comparison
Am J Epidemiol
(1990) - et al.
Efficiency of cohort sampling designsSome surprising results
Biometrics
(1991) - et al.
Statistical Methods in Cancer Research Volume 2The Design and Analysis of Cohort Studies
(1987) Design options in epidemiologic researchAn update
Scand J Work Environ Health
(1982)- et al.
Asymptotic distribution theory and efficiency results for case-cohort studies
Ann Stat
(1988) Robust variance estimation for the case-cohort design
Biometrics
(1994)