Article Text

## Abstract

**Objective:** To discuss the use of discrete time hazards models for the analysis of occupational and environmental cohort data.

**Methods:** Analytical data structures and regression methods for discrete time hazards models are described. This approach is illustrated via analyses of data from a study of mortality in a cohort of chemical workers exposed to dioxin.

**Results and conclusions:** Analyses employing a discrete time hazards model facilitate examination of observed and expected counts, the calculation of attributable fractions, and empirical description of the estimated hazard rates. In addition, this approach can be used to fit non-multiplicative models, such as the linear hazards ratio model (which has been employed in epidemiological analyses of a variety of environmental and occupational exposures).

## Statistics from Altmetric.com

The Cox proportional hazards model is widely used in epidemiological analyses of cohort data.1 Unlike Poisson regression methods applied to grouped data, the Cox proportional hazards model permits an analysis in which survival time is treated as a continuous variable and explanatory variables that were originally measured on a continuous scale may be analysed without forming categories.2 Another attractive aspect of Cox’s proportional hazards model is that it requires no assumptions about the shape of the distribution of survival times in the study population. This approach allows control for potential confounding by a temporal variable (eg, attained age) that serves as the primary scale for the analysis via conditioning on this time factor.

Cox (1972) proposed a discrete time analogue to the continuous time proportional hazards model.3 The logit model for discrete time can be fitted via standard logistic regression methods, and Prentice and Gloeckler (1978) showed that by specifying the complementary log-log link you obtain comparable parameter estimates to those obtained via fitting an underlying proportional hazards model in continuous time.4 5

#### What this paper adds

While the Cox proportional hazards model is widely used in analyses of epidemiological data, there is a useful discrete time analogue to the continuous time proportional hazards model.

A data analyst can employ standard logistic regression methods to fit a discrete time hazards model, which may be more familiar than Cox regression; and the analyst may find that this discrete time approach simplifies basic aspects of the analysis, particularly construction of time-dependent covariates.

This approach may be used in settings where event occurrence may be thought of in a discrete time framework (eg, events ascertained during annual screenings); this approach may also be an attractive option for the analysis of longitudinal data in settings where event times are recorded quite precisely.

Discrete time hazards models offer an important complement to the set of tools available for analysis of occupational and environmental cohort data.

There are some settings in which the study data naturally lead to thinking about event occurrence in a discrete time framework. One such setting arises when cases only are ascertained at regular, discrete points in time. In an occupational context, for example, if disease status is ascertained only during annual screenings or periodic surveying of a study cohort then a data analyst may not know the date of incidence, but rather simply know that the event occurred during a particular discrete time interval. Another setting in which discrete time models are frequently used is when continuous time data have been grouped into intervals, for example, due to administrative practices (eg, recording of events in intervals defined by billing periods), confidentiality concerns (eg, truncation of dates to protect privacy), or data collection constraints (eg, respondents may not be able to supply dates more precise than year). In such settings two or more individuals may have an identical event time; such occurrences are referred to as tied events. Discrete time models offer a framework for readily handling situations in which tied events occur.

Some investigators, however, find discrete time models attractive even for the analysis of longitudinal data in which event times are recorded quite precisely and time is viewed as a continuous metric. A data analyst can employ a technique, logistic regression, that may be more familiar than Cox regression; and, when contrasted with Cox regression, some people find that the discrete time hazards approach simplifies basic aspects of the analysis, particularly if the analysis involves time-dependent covariates.5 6

This paper discusses the use of discrete time hazards models specifically in the context of analysis of occupational and environmental cohort data. Examples are provided of some of the attractive aspects of this approach. These methods are illustrated using data for a cohort study of workers exposed to dioxin and followed over time to ascertain deaths due to cancer.

## Methods

Consider a hypothetical cohort study in which incident cases of a non-repeatable event (eg, death, or first diagnosis of disease) have been ascertained over a period of follow-up. Suppose that a data analyst is interested in the association between an exposure variable, *X*, and disease incidence in this cohort. If it is a closed cohort (ie, a cohort with no staggered entry or censoring) then the association between disease status and *X* might be quantified by contrasting the incidence proportion for each level of the exposure variable (as long as exposure status was not time-varying). Unfortunately, cohorts typically are not closed, but rather suffer the problems of staggered entry and censoring. Censoring of observations may occur because people are lost to follow-up or simply because many people remain disease-free and under study at the end of follow-up. Given these limitations, a data analyst cannot estimate incidence proportions in their study cohort over the entire study period.

Over a narrow interval of observation, however, the epidemiological cohort data may conform quite well to a closed cohort. Suppose that continuous time is divided into a sequence of *j* contiguous time periods. People observed at the start of an interval may be observed over the span of that interval with little or no loss to follow-up. By partitioning the study period into a sequence of narrow intervals, a data analyst may be able to validly estimate the incidence proportion over each of these sub-periods.7 The group of people observed at the start of the interval who are at risk of the event is often referred to as a *risk set*; let *P _{ij}* denote the probability that person

*i*in risk set

*j*will experience the target event during that time interval, conditional on their event-free survival up to the start of time interval

*j*. Willet and Singer refer to

*P*as the discrete time hazard; it has the form of the expected number of events per unit interval of time, and consequently is sometimes called a rate.

_{ij}### Data structure for discrete time analyses

A standard person-oriented data structure is one in which each person in the study cohort has one record in the analytical data file (ie, there is one row of data per person under study). This row of data would typically record the entry and exit times from the study, a binary indicator of censoring status, and a set of explanatory variables of interest.

In contrast, an analysis of the discrete time hazards function requires transforming a person-oriented data structure into a person-period data structure, in which each person has multiple records in the data file.5 6 Specifically, person *i* contributes one row of data for each time period that person *i* is at risk of the event. A binary indicator of case status is associated with each observation. The indicator is assigned a value of “0” for each observation until the date of last observation; at that point, a value of “1” is assigned to cases while a value of “0” is assigned to censored observations. Each record for a person-period is associated with an indicator of the time period of observation, for example, as indexed by attained age. Other time-varying (eg, cumulative exposure) or fixed (eg, sex) explanatory variables are readily accommodated in this data structure.

### Regression models for discrete time hazards

Since *P _{ij}* is bounded by 0 and 1 it can be modelled as having a logistic dependence on a set of predictors as,

where *t*_{1ij}-*t _{Jij}* are binary indicator variables representing the

*J*time periods of observation and

*x*

_{1ij}-

*x*are explanatory variables of substantive interest. This model allows for a parametric description of the baseline hazard function via the parameters

_{Pij}*α*and estimation of the impact of the explanatory variables on the log odds of disease via a vector of parameters

_{1}-α_{J}*β*. The model implies that the predictor variables are linearly associated with the logistic transform of the hazard.

_{1}– β_{P}For some occupational and environmental cohort analyses, however, the complementary log-log transformation may be preferable to the logit transformation of the hazard. Survival times typically are measured on a continuous scale and if tied events occur it is usually a consequence of grouping or coarse measurement of the time scale. In such settings, Prentice and Gloeckler (1978) showed that fitting a discrete time hazards model with a complementary log-log link (sometimes abbreviated as clog-log), of the form log[−log(1−*P _{ij}*)] =

*α*

_{1}

*t*

_{1ij+}

*α*

_{2}

*t*

_{2ij+…+}

*α*

_{J}t_{Jij+}β_{1}

*χ*

_{1ij+}

*β*

_{2}χ_{2ij+…+}

*β*results in parameters,

_{P}χ_{Pij}*β*, that are equivalent to the population parameters for the underlying proportional hazards model presumed to generate the observed data.4

The antilog of a coefficient from a model with a logit link is an estimator of a ratio of hazard odds, while the antilog of a coefficient from a model with a complementary log-log link is an estimator of a hazard ratio. Analyses using the logit and complementary log-log link functions tend to result in similar parameter estimates and test statistics as long as the conditional probability of case failure is not large.5

### Modelling the baseline hazard

Singer and Willet describe a flexible form of the discrete time hazards model that includes a separate binary indicator variable for each time period of observation.6 In the SAS statistical package, a complementary log-log model for discrete time hazards with time intervals indexed by the variable *period*, an explanatory variable *X*, and a binary outcome variable *case*, could be fitted via PROC LOGISTIC, as follows:

proc logistic data = ;

class period (param = ref);

model case(event = “1”) = period X/LINK = CLOGLOG; run;

Estimation of a unique parameter for each time interval of observation in order to describe the baseline hazard function may be inefficient; in fact, a regression model may not converge if there is a time interval without observed events. Consequently, it is often useful to explore the fit of a restricted model for the dependence of the baseline hazard function on time. Consider an example in which person-periods are enumerated by intervals of attained age. Rather than including a series of indicator variables for these time intervals, a data analyst might assign a score to each period (eg, the period-specific mean value of the time scale) and utilise a model in which the complementary log-log hazard is constrained to be linear in age, *t*, as follows:

proc logistic data = ;

model case(event = “1”) = t X/LINK = CLOGLOG; run;

Alternatively, a data analyst might model the baseline hazard function using a polynomial, fractional polynomial, or spline function of the time scale *t*.

### Examination of baseline hazards, fitted values and attributable fractions

For any specified covariate pattern a data analyst can calculate the fitted hazard from the discrete time hazards model. Tabulations of observed versus predicted counts provide useful information about model fit. Predicted numbers of events, or fitted values, may be calculated by using the estimated parameters for a model (multiplied by the appropriate covariate pattern) to compute the estimated hazard probability for each person-period. The summation of the hazard probabilities of all person-periods in a group defined by a covariate pattern of interest is the predicted number of events for that group. Fitted values for an estimated model may be output via most standard statistical packages. Using the SAS statistical software package, fitted values, *fv*, may be written to a file named *fitted* as follows:

proc logistic data = ;

model case(event = “1”) = t X/LINK = CLOGLOG;

output out = fitted p = fv; run;

Consider a regression analysis of the association between an exposure *X* of primary interest, adjusting for the baseline variation in the hazard with time as a linear function of *t*, via a model of the form log[−log(1−*P _{ij}*)] =

*α*

_{0}+

*α*

_{1}

*t*+

_{ij}*β*

_{1}

*χ*. The calculation of the hazard for a specified covariate pattern under the complementary log-log model is obtained by the inverse transformation,

_{ij}*P*= 1−exp(−exp(

_{ij}*α*

_{0}+

*α*

_{1}

*t*

_{ij}+

*β*

_{1}

*χ*

_{ij})), yielding the fitted value for each person-period. Other useful quantities include the estimated number of background events,

*bk*(ie, the hazard estimated in the absence of exposure to the agent of interest, 1−exp(−exp(

*α*

_{0}+

*α*

_{1}

*t*

_{ij})), and the estimated number of excess events,

*ex*(the fitted value minus the background,

*ex*=

*fv*−

*bk*). The ratio of excess to fitted values may be referred to as the attributable fraction,

*AF*= (

*ex*/

*fv*).

The calculation of the hazard for any specified covariate pattern under the complementary log-log model is obtained via the same inverse transformation. Using this approach, one can calculate the estimated hazard rate for a given attained age, for example, compare the fitted baseline age-specific rates to expectations based upon the literature, or estimate the impact of exposure in terms of absolute rates (ie, rate differences) rather than relative rates (ie, hazard ratios). Such results can aid in interpretation of findings and assessment of their consistency with prior research.

### Non-multiplicative models

The above presentations are exponential models in which the log hazard of case occurrence is modelled as a function of a linear predictor. One useful alternative to the exponential model is the linear hazards ratio model of the form

Specialised software, such as the PEANUTS module of the EPICURE statistical package, is typically employed to fit linear hazards ratio models.8 However, in the discrete time hazards framework, maximum likelihood estimates for each of the parameters in the above model, along with their associated standard errors, can be obtained via SAS PROC NLMIXED.9 Consider a model in which the time scale is modelled as a linear function of the variable *t*, with an explanatory variable *X*, which may be fitted via SAS as follows:

**proc nlmixed** data = ;

nu = exp(a0+a1*t)*(1+b1*X);

p = **1**−exp(−nu);

model case ∼ binary(p);

**run**;

#### Example: a cohort study of cancer mortality among workers exposed to 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD)

The study cohort includes 3538 male workers employed at eight plants that produced chemicals contaminated with TCDD.10 11 All male workers who had ever been employed in a TCDD-exposed job between 1 January 1942 and 31 December 1984 were eligible to be included in the study; vital status follow-up was conducted through 31 December 1993. Start of follow-up began on date of employment and continued to date of death, date last observed, or the study’s end, whichever came first. The outcome of interest in these analyses is cancer mortality, defined by underlying cause death coded to the International Classification of Diseases, 8th Revision (codes 140–207). The exposure variable of interest is estimated TCDD exposure levels. A job-exposure matrix was developed for these workers to derive a daily exposure score; for each worker these scores were accumulated over time to derive annual exposure scores.

A person-period data structure was generated with one record per person-month of observation. The association between cancer mortality and TCDD was examined via fitting of discrete time hazards model with a complementary log-log link of the form log[−log(1−*P _{ij}*)] =

*α*

_{1}

*t*

_{1ij+}

*α*

_{2}

*t*

_{2ij+…+}

*α*

_{J}t_{Jij+}β_{1}

*χ*

_{1ij+}

*β*

_{2}χ_{2ij+…+}

*β*. A parsimonious baseline model was developed to describe the relationship between attained age and cancer risk. This model included a linear-quadratic function of the natural logarithm of attained age; for consistency with prior analyses we also included indicator terms for birth cohort (year of birth in quartiles). For comparability with previously reported regression analyses of these data, the primary analyses examined the association between cancer mortality and TCDD score calculated under a 15-year exposure lag assumption; an exposure-response trend was estimated in which TCDD score was log transformed (with 0.001 added to each worker’s cumulative exposure score to avoid taking the logarithm of 0).10 The results obtained via discrete time hazards regression using SAS PROC LOGISTIC, employing a complementary log-log link, were compared with those obtained via fitting of Cox proportional hazards regression models to these data via the SAS PHREG procedure. Observed and expected counts were tabulated and reported to assess model fit, and excess cases were compared with fitted values in order to derive an estimate of the attributable fraction among those exposed to TCDD.

_{P}χ_{Pij}In order to illustrate the use of this method for analyses that employ a non-multiplicative model form, we also estimated the association between TCDD score and cancer mortality via a linear hazards ratio model of the form

The results obtained via discrete time hazards regression using SAS PROC NLMIXED were compared with those obtained via fitting of a proportional hazards regression model via the statistical program PEANUTS in the EPICURE software package.10

## Results

A total of 256 deaths due to cancer were ascertained among the 3538 workers included in this analysis. Table 1 shows the distribution of birth cohort among cases and non-cases in the study cohort. More than half of the cancer deaths were ascertained among workers born prior to 1922, while less than a quarter of the non-cases were born in that period. Decedents due to cancer were employed for a longer average duration than non-cases, tended to start employment at older ages than non-cases, and accrued higher average cumulative exposures than non-cases.

Table 2 reports the association between cancer mortality and the natural logarithm of cumulative exposure under a 15-year lag assumption, as estimated via standard log-linear Cox proportional hazards regression model. Table 2 also reports the results of analysis of these data implemented via a discrete time hazards model; unlike the Cox model, the discrete time hazards model explicitly includes parameters to describe the baseline hazard function in terms of a linear-quadratic function of the logarithm of attained age. The estimated association between cancer mortality and cumulative exposure score derived via the discrete time hazards model was very similar to the estimate obtained via Cox regression.

Table 3 reports the observed, background, and excess number of events by categories of cumulative exposure. A large proportion of the observed person-periods (55%) had cumulative exposure scores <1 unit; consequently, the logarithm of these exposure scores was negative (the smallest negative value being −6.91, since 0.001 was added to all scores to avoid taking the logarithm of zero). As shown in table 3, given the negative exposure scores, coupled with a positive dose-response trend, the fitted excess number of cases in the lowest cumulative exposure category is negative. Overall, the fitted model implies that 42 of the 256 cancer deaths were exposure-associated excess cases; the attributable fraction of deaths among workers was 16.7%.

Table 4 reports the results of a fitted linear hazards ratio model. The estimated association between TCDD exposure score (log transformed and under a 15-year lag assumption), as described by the log hazard ratio obtained via fitting of a discrete time hazard model with a complementary log-log link was similar to that obtained via fitting a linear proportional hazards model to the data via the PEANUTS program in the EPICURE statistical package.

## Discussion

This paper discusses the use of discrete hazards models for analyses of epidemiological cohort data. These methods employ a full likelihood and exact methods for tied data. These methods permit a data analyst to directly estimate the baseline hazard function and to test hypotheses about the dependence of the hazard function on time. Furthermore, as illustrated in table 3, tabulations of fitted values can aid in regression model diagnostics and calculation of quantities such as the attributable fraction. In the illustrative example, examination of fitted values highlighted the implications of the positive regression coefficient for the logarithm of TCDD score in a setting where many workers had negative TCDD scores (due to taking the logarithm of values less than unity). Furthermore, this paper illustrates how analyses employing discrete time hazards regression methods can accommodate multiplicative (ie, log-linear) models as well as non-multiplicative models such as a linear hazards ratio model.

In the occupational epidemiology literature, Poisson regression methods remain a widely used approach to analysis of rates. Typically, Poisson regression models are fitted to grouped data (ie, a multidimensional table in which person-time and events are cross-classified by categories of predictor variables). Given that all covariates are categorised for the tabulation of person-time and events into a multidimensional table, estimation of an exposure-response trend using a continuous exposure metric requires that a score must be assigned to each exposure category. Poisson regression results may be sensitive to decisions about the cut points used to categorise continuous exposure variables and the method used to assign scores to exposure categories.12 The use of the discrete hazards model for analyses of epidemiological cohort data does not require categorisation of exposure variables; the only grouping of the study data that is necessitated is grouping of time into intervals. It should be noted that the same person-period data structure (ie, ungrouped person-time data) may be analysed via Poisson regression methods; in such an approach the data analyst can derive estimates of rate ratios from a log-linear Poisson regression model fitting.13 The discrete time hazard models described in this paper are similar to the Poisson regression approach applied to a person-period data structure. A potential appeal to analysing such data via the complementary log-log model, as Prentice has shown, is that regression analysis under the complementary log-log model leads to direct estimation of the population parameters under the assumption that the data are derived from a proportional hazards model.4

The methods illustrated in this paper may be extended to allow direct fitting of models that incorporate parametric latency functions.14 That paper described analyses of data derived from the case-control study design; the use of a discrete time hazards model permits extension of that approach to data derived from the cohort study design.

Prior analyses of the data for the occupational cohort of chemical workers examined in this paper demonstrated a statistically significant increasing trend in relative risk of all cancer mortality with increasing exposure level (on a log scale, under a 15-year exposure lag assumption).10 11 These results were readily replicated via discrete time hazards regression, as well as extended by calculation of the estimated absolute hazard rate for given exposure patterns, the observed and expected case counts, the attributable fraction of cancer deaths among those exposed, and the estimated association under a linear hazards ratio model.

One limitation of discrete time models for survival analysis is that the required person-period data structure often will include a large number of records; this is especially true when the number of individuals under study is large and the time intervals are small. In some settings this can become burdensome, although computational issues related to handling data files are of less concern now than in the past. Another potential limitation of discrete time models for survival analysis is that they require that the data analyst model the effects of all temporal factors; in contrast, the Cox regression model allows control for potential confounding by the temporal variable that serves as the primary scale for the analysis. Nonetheless, discrete time hazards models offer an important complement to the available approaches for occupational cohort data analysis.

## Footnotes

Funding This project was supported by grant R01-CA117841 from the National Cancer Institute, National Institutes of Health.

Provenance and Peer review Not commissioned; externally peer reviewed.

Competing interests None.