Article Text

Download PDFPDF

Smoothing is soothing, and splines are fine
  1. K Steenland
  1. Correspondence to:
 Dr K Steenland
 Emory University, Rollins School of Public Health, 1518 Clifton Road, Atlanta, GA 30322, USA;

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Commentary on the paper by Eisen et al (Occup Environ Med, October 2004)*

Eisen and colleagues have provided a good example of the use of smoothing splines in a thorough analysis of exposure-response data, for a study of lung cancer in relation to silica exposure.1 Exposure-response data are increasingly important for two reasons.

First, as noted by Bradford Hill, a positive exposure-response provides support for a causal interpretation of an association. In the case of silica and lung cancer, evidence of a positive-exposure response in several studies has provided important support for the original 1997 IARC judgement that silica is a class I (definitive) carcinogen. That judgement has remained controversial because in some studies the exposed population has not had a higher lung cancer rate than the non-exposed comparison group. Some have argued that this may be because the surface properties of silica change in different settings and may have different toxicities, so that in some cases silica may not increase lung cancer risk. However, the explanation may simply be that in some cohorts there were not enough highly exposed subjects. Our own exposure-response analysis of 10 silica exposed cohorts (60 000 workers) indicated that indeed there is a positive exposure-response for silica but that the increase in risk is seen primarily at higher exposures, and the overall slope of the exposure-response curve is relatively low compared to classic lung carcinogens such as nickel and asbestos.5 This relatively low slope may be the reason why it has been it has been difficult to show that silica does indeed cause lung cancer.

Second, exposure-response data provide necessary data for regulators to conduct quantitative risk assessment. Regulators want to know the amount of excess risk incurred due to exposure at different levels, and only good exposure-response data can answer this question. For example, the US OSHA typically sets limits based on a level of exposure over a working lifetime which permits an excess risk of at most 1 per 1000 (0.1%) above background risk. The background lifetime risk of lung cancer is about 5%. OSHA would therefore seek a permissible limit for silica exposure which would allow a lifetime risk of no greater than 5.1%. The current limit for silica exposure is 0.1 mg/m3. It is clear from exposure-response analyses of silica, including the paper by Eisen and colleagues,1 that the current limit is too permissive. Several analyses indicate that the lifetime risk of lung cancer after 40 years of exposure at the standard results in an excess risk of the order of 1–2% rather than the goal of 0.1%.

Splines and other types of smoothing functions are a middle ground between traditional categorical analyses of exposure, which avoid any parametric assumptions and let the “data speak for itself”, and the also traditional parametric analysis which uses exposure as a continuous variable in a model in which—sometimes without realising it—the investigator imposes a shape on the exposure-response curve. For example, in using logistic or Cox regression, the investigator is assuming the log of the rate ratio is a linear function of exposure. Yet this assumption requires justification, as this model may not fit the data, and a thorough search for the best model, with the best fit to the data, needs to be conducted.

Categorical analyses have their own limitations. The investigator must choose the number and placement of the cutpoints defining the categories, which may be arbitrary and which may heavily influence the apparent “shape” of the exposure-response. Furthermore within each category a categorical analysis assumes that there is a single exposure effect—that is, the rate ratio is constant across the exposure category, an obviously false assumption when the category is reasonably wide.

Smoothing functions do not impose a particular form of a simple parametric model on the data, yet avoid some of the pitfalls of categorical analysis by being less dependent on the choice of cutpoints and by providing a continuous curve which is not a step function. They are primarily useful graphically, for seeing the shape of the exposure-response curve. The shape of the curve may help provide a hint for choosing the best simple parametric model which will provide a concise summary of the exposure-response and be useful for quantitative risk assessment. One type of smoothing function, splines, may themselves be used for quantitative risk assessment, because they permit a quantitative estimation of risk for any specific level of exposure.

The idea of smoothing functions stems from using a simple moving average of “y” across local regions of “x”, often a weighted average in which the centre points in the region have more weight than the outermost points, and the average is calculated for one region after another as one moves across the x-axis. This produces a smooth curve in which the investigator imposes minimal constraints on the shape of the curve. Such curves have a long history, including, for example, the common moving average of the stock market calculated across time. These are non-parametric curves, in that there is no simple function with a few parameters which can summarise the curve.

Splines are an extension of this idea in which a regression of “y” on “x” is carried out in each local region as one moves across the “x” axis. Cubic splines are one common type of spline in which the effect measure (for example, the log of the rate ratio) is regressed on a cubic function of exposure (the x-axis), across several different regions or categories of exposure, spanning the entire range of exposure. A single smooth curve across these regions is then produced. Penalised splines, as used by Eisen and colleagues,1 are another variant of splines in which there is a penalty for rapid change in slope of the curve in any given region of the x-axis. At this point it is not clear whether they offer any particular advantage over more traditional cubic or quadratic splines. The details of the difference between different types of spline functions need not overly concern investigators, as long as they understand the basic idea. Essentially the software increasingly makes spline functions available to investigators, although the epidemiologist may need a statistician’s help for such programming.

One important point of the analysis by Eisen et al is the influence of outlier observations, in this case the influence of two non-cases with very high exposure values. These two controls resulted in the downward shape of the exposure-response curve in the highest regions of exposure. Eisen et al analyse their data with and without these two outliers. Without them, the curve tends to continue to increase at the highest exposures. They note that the shape of the curve at the low and medium dose region does not really change whether the outliers are included or not. This is important because it is this relatively low and medium dose region that in practice is of importance to risk assessors. It is quite likely that measurement error is greater in such extreme high dose regions where there are little data. An alternative approach, also illustrated by Eisen et al, is to consider the log of exposure rather than exposure itself. Taking the logs tend to reduce the influence of the highest exposures. A log transformation of exposure tends to result in curves in which the rate ratio tends to stop increasing or plateau at the highest exposures, a phenomenon which seems consistent with data observed for a large number of occupational carcinogens.4 There are a number of plausible reasons for such a plateau, including mismeasurement at highest exposures, an exhaustion of susceptibles at high exposure, and saturation of biological pathways.

Clearly it is incumbent on epidemiologists to collect as good exposure data as possible. But then the job is not over. We must use the rich exposure data to the fullest in our exposure-response analyses, and new analysis techniques have become available for this.2,3,6

Commentary on the paper by Eisen et al (Occup Environ Med, October 2004)*