Article Text

Original article
Inside the black box: starting to uncover the underlying decision rules used in a one-by-one expert assessment of occupational exposure in case-control studies
  1. David C Wheeler1,2,
  2. Igor Burstyn3,
  3. Roel Vermeulen4,
  4. Kai Yu5,
  5. Susan M Shortreed6,
  6. Anjoeka Pronk7,
  7. Patricia A Stewart8,
  8. Joanne S Colt1,
  9. Dalsu Baris1,
  10. Margaret R Karagas9,
  11. Molly Schwenn10,
  12. Alison Johnson11,
  13. Debra T Silverman1,
  14. Melissa C Friesen1
  1. 1Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
  2. 2Department of Biostatistics, Virginia Commonwealth University, Richmond, Virginia, USA
  3. 3Department of Environmental and Occupational Health, Drexel University, Philadelphia, Pennsylvania, USA
  4. 4Environmental and Occupational Health Division, Institute for Risk Assessment Sciences, Utrecht University, Utrecht, The Netherlands
  5. 5Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
  6. 6Biostatistics Unit, Group Health Research Institute, Seattle, Washington, USA
  7. 7TNO, Zeist, Netherlands
  8. 8Stewart Exposure Assessments, LLC, Arlington, Virginia, USA
  9. 9Department of Community and Family Medicine, Dartmouth Medical School, Hanover, New Hampshire, USA
  10. 10Maine Cancer Registry, Augusta, Maine, USA
  11. 11Vermont Cancer Registry, Burlington, Vermont, USA
  1. Correspondence to Dr Melissa C Friesen, Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Blvd, Room 8106 MSC 7240, Bethesda, MD 20892-7240, USA; friesenmc{at}mail.nih.gov

Abstract

Objectives Evaluating occupational exposures in population-based case-control studies often requires exposure assessors to review each study participant's reported occupational information job-by-job to derive exposure estimates. Although such assessments likely have underlying decision rules, they usually lack transparency, are time consuming and have uncertain reliability and validity. We aimed to identify the underlying rules to enable documentation, review and future use of these expert-based exposure decisions.

Methods Classification and regression trees (CART, predictions from a single tree) and random forests (predictions from many trees) were used to identify the underlying rules from the questionnaire responses, and an expert's exposure assignments for occupational diesel exhaust exposure for several metrics: binary exposure probability and ordinal exposure probability, intensity and frequency. Data were split into training (n=10 488 jobs), testing (n=2247) and validation (n=2248) datasets.

Results The CART and random forest models’ predictions agreed with 92–94% of the expert's binary probability assignments. For ordinal probability, intensity and frequency metrics, the two models extracted decision rules more successfully for unexposed and highly exposed jobs (86–90% and 57–85%, respectively) than for low or medium exposed jobs (7–71%).

Conclusions CART and random forest models extracted decision rules and accurately predicted an expert's exposure decisions for the majority of jobs, and identified questionnaire response patterns that would require further expert review if the rules were applied to other jobs in the same or different study. This approach makes the exposure assessment process in case-control studies more transparent, and creates a mechanism to efficiently replicate exposure decisions in future studies.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

What this paper adds

  • Expert-based exposure assessment of occupational risk factors in population-based case-control studies is challenging and time consuming, and is criticised for its lack of transparency. Evaluating exposures in these studies often requires exposure assessors to review each study participant's reported occupational information job by job, to derive exposure estimates.

  • The structured format of occupational history and job-specific modules in questionnaires, however, makes it possible to identify underlying expert exposure decision rules.

  • The present study is the first to use the statistical learning techniques of classification and regression trees and random forests to identify the underlying decision rules of an exposure assessor.

  • The good agreement between the model predictions and one-by-one expert evaluations provides support for extracting transparent identifiable decision rules from previously made expert assessments. This approach can be used to focus expert review efforts on questionnaire response patterns that had poorer prediction replication.

Introduction

Exposure assessment of occupational risk factors in population-based studies is challenging. These studies rely on subject-reported lifetime occupational histories, and in some studies, on the subjects’ responses to more detailed questions in job-specific or industry-specific modules. Typically, one or more exposure assessors review the questionnaire responses one job at a time to ascertain exposure—a time-consuming activity when each subject reports an average of six jobs over a lifetime.1–3 While experts may document their decision rules, these exposure decisions are rarely explicitly published, and thus, provide no mechanism for others to evaluate or reproduce these assessments. This lack of transparency faces substantial criticism.4 ,5 As a result, alternative approaches are being implemented that use structured model-based exposure assessments to apply expert-based decision rules based on patterns in questionnaire responses.1 ,6

If decision rules can be successfully applied to questionnaire responses in epidemiologic studies, it raises the question of whether we can learn from the patterns of exposure decisions previously made by experts and apply them to other studies. We refer to these patterns as latent or underlying decision rules, because although the experts have used rules in making their assessments, the explicit rules that relate the questionnaire response patterns to the exposure decision may not have been documented. These underlying decision rules are valuable, given the time-consuming nature of the assessments and the limited number of available experts with broad knowledge about historical occupational situations.

To determine if underlying decision rules can be identified, we applied two statistical learning approaches designed to extract patterns and relationships between variables7 ,8—classification and regression trees (CART) and random forests—to questionnaire responses, and the associated expert-based exposure estimates for occupational diesel exhaust exposure in the population-based New England Bladder Cancer case-control study.9 Uncovering these rules has several important benefits. First, it can provide a mechanism for replicating the decision rules for other subjects within or across studies. Second, it can reduce the burden on the expert if the existing rules can be applied to the questionnaire responses and then reviewed by an expert, as was shown in the development of an asthma-specific job-exposure matrix.10 Third, it makes the decision rules transparent, thus providing a way for other experts to evaluate and improve the rules.

Methods

Model overview

We focused on tree-based statistical learning approaches because decision trees are able to handle non-linearity, interactions and missing values, while producing interpretable decision rules.11–13 Tree-based approaches predict decisions based on a sequential splitting pattern that resembles an upside-down tree, with the ‘root’ at the top, below which are nodes that divide observations into branches. At the bottom are ‘leaves’ that provide the predicted assignment (figure 1). The nodes are selected iteratively, with the most predictive variable at each node used to split the observations into two branches according to that variable. Within each branch, the splitting continues until the model meets specified stopping criteria, such as a complexity parameter set to control the growth of the tree, or a minimum number of observations per leaf.

Figure 1

Illustrative decision tree for the classification of 100 jobs by diesel exhaust exposure. The terminal nodes at the bottom of the tree are leaves with labels for exposure classification (0=unexposed, 1=exposed), number of jobs in leaf, and percent agreement of tree-based classifications with exposure status assigned by an expert.

To illustrate, we show a fictional simple decision tree in figure 1 to classify 100 jobs into unexposed and exposed to diesel exhaust. The decision rules represented by the tree are revealed by starting at the root and evaluating the condition at each node to determine which branch to follow until a leaf (an exposure decision) is encountered (0=unexposed; 1=exposed, 1st label from top). Each leaf also reports the number of jobs assigned to the leaf (2nd label), and the proportion of jobs accurately classified within the leaf (3rd label). This tree has four leaves, a depth of two nodes, and uses three variables to classify jobs as exposed or unexposed. In this example, the most predictive variable is ‘smelled exhaust’. If the subject neither smelled exhaust nor worked on a construction site in a particular job, the model classified the job as unexposed (0), which agreed with the expert's assignments for 90% of the 10 jobs in that leaf. Conversely, if a subject smelled exhaust and was a truck driver, the model classified the job as exposed (1), with 90% agreement with the expert's assignments for the 30 jobs in that leaf.

Each decision tree can also be written as a series of conditions that provides a clear interpretation of the questionnaire response patterns that lead to an exposure assessment decision. For example, the conditions that created the second leaf from the left in figure 1 are:

Rule:(exposure classification=1; probability job assigned exposure by expert=80%)

Smelled:exhaust=no

Construction:site=yes

The probability in the first line of the decision rule above is the relative frequency of the classified jobs that agreed with the expert assignment of being exposed (exposure classification=1) and not smelling exhaust, but working at a construction site (32/40 in this case).

Random forest models are based on CART, but can improve the predictive performance of CART by averaging the predictions across many simple trees.14 In CART, the entire training dataset and all entered variables are evaluated to derive one tree. By contrast, random forest models develop hundreds of trees, where each tree is trained on a random subset of the observations (jobs) and on a random subset of the input (questionnaire response) variables.15

Study population

To examine whether or not CART and random forest models could identify an expert's decision rules, we used 14 983 jobs reported by the subjects from the New England Bladder Cancer Study (n=1213 cases, 1418 controls).9 Each participant completed a lifetime occupational history (OH) questionnaire. The OH had open-ended questions asking the job title, name and location of employer, type of service or product provided, year started and stopped, work frequency, activities and tasks, the tools and equipment and chemicals and materials handled. In addition, ‘did you ever work near diesel engines or other types of engines’ and ‘did you ever smell diesel exhaust or other types of engine exhaust’ were asked for each job. Answers to the occupational histories may have triggered a module that asked more detailed diesel exhaust and non-diesel exhaust questions for 67 jobs/industries; a module was completed for 64% of the reported jobs.

Diesel exhaust exposure estimates

The jobs were reviewed one by one by an industrial hygienist to assign the probability, intensity and frequency of diesel exhaust exposure.6 Probability was assessed based on the estimated proportion of workers likely exposed to diesel exhaust for the reported information, including task, job or industry and decade, with estimated cut points of <5% (none/negligible, category 0), 5–49% (low, 1), 50–79% (medium, 2), and ≥80% (high, 3). Approximately 75% of the jobs were assessed as having none or a negligible probability of exposure. Intensity was assessed on a continuous scale as the estimated average level of respirable elemental carbon (REC, μg/m3) in the workers breathing zone during tasks where diesel exhaust occurred, and categorised with cut points of <0.25 (none/incidental, category 0), 0.25 to <5 (low, 1), 5 to <20 (medium, 2), and ≥20 (high, 3) μg/m3 REC. Frequency was assessed on a continuous scale as the estimated average number of hours per week exposed to diesel exhaust, and categorised with cut points of <0.25 (none/negligible, category 0), 0.25 to <8 h per week (low, 1), 8 to <20 h per week (medium, 2), and ≥20 (high, 3) hours per week.

Identifying questions related to diesel exhaust

We reviewed the responses to the occupational histories and the job-specific and industry-specific modules to identify variables that could be potential determinants of an expert's exposure assignment. All categorical variables were recoded into dichotomous variables. This recoding does not change the information provided to the expert, it merely changes the form of the variables to a more convenient modelling structure. From the free-text responses in the occupational histories, we coded diesel exhaust information into standardised variables,6 resulting in 51 dichotomous OH variables, such as ‘job had traffic exposure’, ‘job used diesel equipment’, and ‘job start year’. We included variables for 83 two-digit and 169 three-digit standardised industry codes,16 and 61 two-digit and 134 three-digit standardised occupation codes.17 From the module responses, we coded 67 dichotomous variables identifying the administered module, 1 variable indicating no module was completed, 1 variable indicating a module with diesel exhaust-related questions was completed, and 154 variables derived from questions directly or indirectly related to diesel exhaust exposure. Examples of module variables included ‘traffic-exposed job’, ‘equipment powered by diesel’, and ‘industry=heavy construction’. Overall, 498 variables were extracted from the occupational histories, and 223 variables from the modules were extracted.

Model development

We used the rattle package18 ,19 in R,20 which interfaces with the rpart21 and randomForest22 R packages to develop CART and random forests, respectively. Both approaches were used to predict a binary probability metric (none/low=0 vs medium/high=1) to evaluate the models’ ability to separate the jobs into exposed and unexposed categories, so that, at a minimum, the model predictions could focus the expert review on the more likely exposed jobs. The models were also used to predict ordinal metrics (0–3) for probability, intensity and frequency of exposure, which were treated as discrete non-ordered categories in the model.

We randomly split the jobs into three datasets to get unbiased estimates for prediction errors for each model: (1) a training dataset comprising 70% of the data (n=10 488) to build the models; (2) a testing dataset comprising 15% of the data (n=2247) to choose the optimal model among candidate models within a given class of models based on the estimated prediction error and (3) a validation dataset comprising 15% of the data (n=2248) used to evaluate the final model predictions. The prediction errors in the testing dataset were used to determine the final set of explanatory variables to input into the model.

Building a CART model requires the user to define tuning parameters that control the tree size. We kept constant values for the minimum number of jobs to allow a split within a node (20), the minimum number of jobs within a leaf (7), and the maximum node depth (30), the default settings in the rpart package. For each metric, we examined complexity parameters ranging from 0.0001 (most complex model) to 0.1 (simplest model). We selected the model with the lowest relative cross-validated error using 10-fold cross-validation in the training dataset8 ,23 ,24 to prevent the tree from overfitting the training data at the expense of the fit of the testing and validation data.

Random forests models combine many trees, where each tree is built from a random sample of the training data. The data left out in the training data when building any particular tree is referred to as the out-of-bag sample. An average prediction error for a random forest model can be calculated from averaging the prediction error from each of the hundreds of trees’ out-of-bag samples. We used 300 trees because the average prediction error stabilised in the out-of-bag data after 100 trees. We used the square root of the number of input variables as the number of variables to consider at each split when building the individual trees.24 We set the complexity parameter in the random forest model to the same value used for the best identified CART model.

Model evaluation

We evaluated the predictions of the best identified CART and random forest against the expert assignments within the validation dataset based on the overall agreement with the expert's assignments, and the percent agreement for each exposure category.

We conducted two sensitivity analyses. First, we examined the agreement of the models’ predictions compared with the expert's assignments for models restricted to only OH variables (including the two supplementary diesel exposure questions), rather than all potential variables, to examine the reliability of the models’ predictions when fewer or non-specific occupational data are collected. Second, we examined the sensitivity of the prediction reliability of the CART model (with complexity parameter=0.01) to the size of the training dataset by systematically increasing the number of jobs used in the training set in increments of 5%, to determine if we could reliably predict exposure assignments if we had exposure decisions for only a subset of the data. For each training set size, the prediction error was calculated based on the remaining data. We resampled 100 training sets of each size to estimate the distribution of prediction errors.

Results

Decision rules

We first present a simple CART model that classified jobs into binary probability categories that was user-constrained to a high value (0.01) for the complexity parameter to limit the growth of the tree (figure 2). The tree had a prediction agreement of 93.2% in the validation set. The most predictive variable was ‘worked near or smelled exhaust’, which was constructed from the two diesel exhaust-related questions in the occupational histories.

Figure 2

Classification and regression trees decision tree classifying jobs into exposed (0) and unexposed (1) categories. The labels in each leaf, in order, are the predicted exposure category, the number of jobs in the leaf, and the percent of predictions in the leaf that agree with the expert estimate. Variables from the occupational history (OH) are designated OH; variables from the modules are designated M.

We observed somewhat better agreement when we allowed the tree to grow larger. Variables identified in decision rules for these more complex CART models are listed in supplemental material, table 1. Decision rules from the CART models are available by contacting the corresponding author.

Model performance

Binary probability

For binary exposure, both the CART and random forest models exhibited a high overall agreement with the expert's rating (92–94%) in the validation dataset (table 1). Higher agreement was observed in jobs assessed as negligible/low exposed by the expert (93–95%) and lower agreement was observed in jobs assessed as medium/high exposed (79–92%). The models restricted to the OH variables had about 10% lower agreement in the medium/high category than the full models, but the two models had similar agreement in the negligible/low category. The random forest and CART models had the same agreement in the negligible/low category, but the agreement for the medium/high category was 1–4% higher in the random forest models than the CART models.

Table 1

Proportion of exposure predictions from CART and random forest models that agreed with the expert exposure estimate in the validation dataset (n=2248)

Ordinal probability

For ordinal exposure probability, the CART and random forests models’ predictions agreed with 85–89% of the expert's assignments (table 1). The agreement was highest (97–98%) for jobs assessed as unexposed by the expert. For jobs with a high rating, the agreement dropped from 85% when all variables were used, to 68–72% when restricted to OH variables. The agreement was 7–43% for jobs assessed by the expert as having a low or medium rating. CART models had higher agreement with the expert's assignments than random forest models for jobs assessed as having low (32% vs 23%) and medium ratings (21% vs 14%) when all variables were used. Poorer agreement was observed in the categories with lower prevalence.

Intensity

For exposure intensity, the CART and random forests models’ predictions agreed with 87–90% of the expert's assignments (table 2). Both CART and random forest models were able to well predict jobs with no exposure (agreement 96–98%), and had moderate to moderately high agreement with the expert ratings for jobs with low intensity (64–71%), medium intensity (41–57%), and high intensity (60–65%).

Table 2

Cross-tabulation of the CART model-predicted assignments versus expert assignments and proportion of predicted estimates that agreed with expert assignments in the validation dataset (n=2248)

Frequency

For exposure frequency, the CART and random forests models’ predictions agreed with 83–87% of the expert's assignments (table 2). The predictions for jobs assessed as having no frequency of exposure agreed well (97–98%) with the expert's assignments. The agreement was poor to moderate for jobs rated as low (26–52%) or medium (12–39%) frequency, and moderate for jobs rated as high frequency (57–65%). Agreement was consistently higher for the models’ fit using all variables, compared with using only the OH variables, but no consistent pattern was observed for the random forest models compared with the CART models.

Pattern of disagreements

The cross-tabulations of the predicted estimates for probability, intensity and frequency from the CART model, compared with the expert's estimates for the validation dataset, are shown in table 2. Similar patterns were observed for these three metrics. When a disagreement occurred, the CART model tended to predict a lower exposure rating than the expert for the two middle categories. It was rare for the CART model to predict a medium or high exposure rating when the expert assigned an unexposed rating (eg, probability metric: 23 jobs), or for the CART model to predict a low or unexposed rating when the expert assigned a high rating (eg, probability metric: 40 jobs). Similar patterns were observed for the random forest models (not shown).

Training set size

The CART model's prediction error generally decreased as the number of jobs used in the training dataset increased, with a plateau occurring when at least 3750 randomly chosen jobs (25% of the nearly 15 000 jobs) were used (figure 3). The largest median validation error occurred when using 5% of the data for training. The variance in prediction error was generally largest at the extreme training sample sizes (5%, 95%), where there were few jobs to use to either train the model or to evaluate the model's performance, respectively.

Figure 3

Classification and regression trees prediction errors in the validation dataset, as the size of the training set varies for four exposure metrics: binary exposure probability, ordinal probability, intensity and frequency of exposure. Each boxplot is based on 100 randomly selected training sets to estimate the model using all variables (complexity parameter=0.01), with the prediction error estimated on the validation set.

Discussion

We applied statistical learning methods to explain and predict an expert's exposure estimates derived from subjects’ responses to an occupational questionnaire in a case-control study of occupational diesel exhaust exposure and bladder cancer. We found that the models had excellent ability to reproduce the expert's assignments for a binary probability metric and for the unexposed category for three ordinal metrics. For the ordinal metrics, the models had poor to moderately high ability to reproduce the experts’ assignments for the exposed categories. However, the models identified the groups of questionnaire response patterns where agreement was poor. Thus, we recommend a two-stage assignment process to apply the resulting decision rules to unassessed jobs: initial assignment of decision rules, followed by expert review of the jobs identified by the model as more difficult to correctly classify.

Our CART and random forest models had similar predictive ability to that of artificial neural network (ANN) models used by Black et al25 to predict a dichotomous exposure for benzene exposure. ANN models are also a statistical learning approach, but ANNs use internal weights that cannot be easily reviewed for plausibility by outside experts.10 ,11 We used, instead, tree-based methods, such as CART, because tree-based models provide both a visual and easily understood set of rules underlying the expert's exposure decisions. The extracted rules do not necessarily represent the decision process used by the expert. Instead, the models’ rules extract questionnaire response patterns that best predict an expert's exposure decision. Black and colleagues25 suggest that 60% of an assessor's time can be saved by application of ANN models to identify unexposed jobs. We also predict a substantial reduction in the exposure assessment burden from using CART models to assign exposure in subsequent studies. Exposure assessors can focus their efforts on evaluating jobs that the tree-based methods found more difficult to classify, such as when the probability assessment for a leaf straddles the assignment cut point (eg, 20–80%).

While random forests generally outperform CART models in prediction,8 the CART and random forest approaches used here performed similarly in predicting the expert's assignments. Any slight reduction in performance of the CART models compared with the random forest models is a trade-off for the CART models’ greatly improved interpretability, that is, having only one decision tree rather than hundreds. The CART models’ predictive abilities across the exposed categories might be improved further if the ordinal nature of the exposure metrics is considered instead of the categorical treatment used in the functions called by the R package rattle.

Overall, 66 of the 498 OH variables, and 40 of the 223 module variables were predictive in a CART model for at least one exposure metric (see supplemental material, table 1). Coding the OH questions was a time consuming, but essential step to developing the input variables for the models, and required an occupational health professional. Without this, our potential explanatory determinants from the occupational histories would have been restricted to standardized occupation codes, standardized industry codes, job start and stop years, and the two supplementary questions, whereas the extracted decision rules revealed that the coded OH variables were important determinants. Limiting the models to using only the OH variables had little effect on the ability to reproduce the expert's classification of jobs as unexposed, but generally decreased the ability to reproduce the expert's classification of jobs into exposed categories. In our study, the adequate predictive ability of the models based only on the occupational histories likely reflects the use of the two engine/exhaust-related questions into the occupational histories, because the constructed variable derived from these two study-specific questions generally appeared to be the most predictive variable in all models. These two questions represent, in part, the subjects’ self-assessment of exposure. However, these questions related to all types of engines and exhausts and not solely diesel exhaust, and thus, the self-assessment was not a perfect predictor of exposure status. The classification trees revealed that the expert's review considered whether there was additional supporting information for diesel exhaust sources in the responses. The increased ability of the models when using all variables (the OH and modules) to replicate the expert's assignments provides support for using modules in population-based studies to capture important within-job differences, but using modules can require a substantial time burden on the study participant, and a substantial study cost to the interview and exposure assessment. The extracted decision rules, however, have identified the most important diesel-related questions, which can be used to simplify subsequent questionnaires for similar populations. This may reduce participant burden without losing much in the ability to reliably assign exposures using expert judgment.

Our sensitivity analyses revealed only small differences in performance between models. For example, overall agreement improved only 1% for the binary probability metric when the complexity parameter moved from 0.01 to 0.0006, although the number of rules needed to explain the model increased from 11 to 55, indicating that even a simple model was able to predict the binary exposure status well. Similarly, when we varied the number of jobs used in the training dataset, the prediction error plateaued for all metrics when at least 25% (3750 jobs) of the jobs was used. This suggests that an expert may be able to assess a random subset of the jobs, after which CART models can be developed to provide reliable exposure predictions for the remaining jobs. The required size may, however, vary based on the number of jobs, prevalence of exposure and the predictive ability of the model. Additional sensitivity analyses could be used to evaluate the appropriate minima for the number of observations per node and leaf. Some important determinants may not be captured using CART models when few subjects answered a particular question, or when the minima are set too high. Thus, the determinants extracted here reflect common, not rare, exposure scenarios. Our focus on specificity, rather than sensitivity, in capturing determinants is appropriate, because high specificity generally minimises the expected attenuation in exposure-response associations when exposure prevalence is low.26

The predictions of CART and random forest models are likely only as valid as the expert's exposure assessment,27 although the models could reduce some error from inconsistently applied rules. Therefore, our measures of agreement do not provide reassurance that exposures are classified correctly and provide only insights into reliability of the estimates if a CART model was used to assign exposures instead of one-by-one expert assessment. However, because no gold standards exist, extracting these rules provides an important first step to opening the black box to provide transparent decision rules so that other exposure assessors can review and revise the rules and, thereby, improve the quality of the assessments. Review of the models can be used to recognise discrepancies within the expert's estimates, and to determine whether the decision rules could be improved by identifying additional explanatory variables, whether the condition is so rare that it cannot be captured in the model, or whether the expert's estimates should be improved.28 After these improvements are made from the internal and external reviews, new models can be developed to improve upon previous models.

Extrapolation outside of the scope of the study should be done carefully, and may require important modifications to secular and geographic trends in exposure. Diesel exhaust exposure may also represent a best-case scenario. Diesel exhaust exposure is relatively common compared with other exposure agents often evaluated within population-based case-control studies, and it may be easier for subjects to identify and recall due to familiarity of diesel exhaust in the general population. In addition, the job-specific and industry-specific modules used in this study were specifically developed to collect information on diesel exhaust exposure. Our study provides a first step in demonstrating that CART models are able to extract underlying patterns between questionnaire responses and an expert's ratings for an agent that had been the focus of the questionnaires and had a reasonable prevalence rate. Future evaluations are needed to examine the utility of these models to extract decision rules for other agents that have lower exposure prevalences and that were not the primary focus. In addition, future evaluations are needed to determine whether similar decision rules would be extracted from exposure estimates provided by multiple independent or panels of experts.

Statistical learning approaches, such as CART and random forest models, offer great promise for explaining and predicting expert-based exposure estimates. Our approach was specific to extracting decision rules from previously made exposure assessments, and can only be used in settings that are similar to those of where the decision rules were derived from. For evaluating new exposures or new settings, we encourage exposure assessors to develop deterministic rules based on the questionnaire responses and to program transparent assessments.1 ,6 The statistical learning approaches used here are straightforward to apply using the free graphical user interface (GUI) package (rattle) within R, making these approaches accessible. The resulting models had excellent specificity allowing expert assessment of unexposed jobs to be reproduced with a great level of fidelity. The sensitivity was generally moderate, but, nonetheless, could reduce review time, especially if the models’ estimates of the probability of belonging to an exposure category were used to triage jobs for further expert review. We encourage other researchers to apply these types of models to expert-based exposure assessments to describe the underlying decision rules. This will provide important insights into the rationales for exposure decisions, identify where exposure decisions may be inconsistent, and identify the most important information used by the expert to make an exposure decision. Building this body of knowledge will allow us to refine questionnaires to reduce subject burden and more rapidly provide exposure estimates in subsequent studies to test the reproducibility of findings across populations.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:

Footnotes

  • Contributors DCW and MCF designed the statistical learning analysis approach to assess the participants’ exposure to diesel exhaust. DCW conducted all statistical analyses. IB, RV, KY and SMS assisted in the statistical design. DCW, MCF, IB, RV, KY, SMS, AP, PAS and DTS provided interpretation of the methods and their application. PAS, JSC, DB, MRK, MS, AJ, SC and DTS initiated and designed the bladder cancer case-control study, including the development of tools to collect occupational information, and supervised all aspects of data collection and uses of the study data. DCW and MCF drafted and revised the paper based on feedback provided from all authors.

  • Funding The research was funded by the Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics.

  • Competing interests None.

  • Ethics approval National Cancer Institute.

  • Provenance and peer review Not commissioned; externally peer reviewed.