Article Text

Download PDFPDF
Original article
Inside the black box: starting to uncover the underlying decision rules used in a one-by-one expert assessment of occupational exposure in case-control studies
  1. David C Wheeler1,2,
  2. Igor Burstyn3,
  3. Roel Vermeulen4,
  4. Kai Yu5,
  5. Susan M Shortreed6,
  6. Anjoeka Pronk7,
  7. Patricia A Stewart8,
  8. Joanne S Colt1,
  9. Dalsu Baris1,
  10. Margaret R Karagas9,
  11. Molly Schwenn10,
  12. Alison Johnson11,
  13. Debra T Silverman1,
  14. Melissa C Friesen1
  1. 1Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
  2. 2Department of Biostatistics, Virginia Commonwealth University, Richmond, Virginia, USA
  3. 3Department of Environmental and Occupational Health, Drexel University, Philadelphia, Pennsylvania, USA
  4. 4Environmental and Occupational Health Division, Institute for Risk Assessment Sciences, Utrecht University, Utrecht, The Netherlands
  5. 5Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
  6. 6Biostatistics Unit, Group Health Research Institute, Seattle, Washington, USA
  7. 7TNO, Zeist, Netherlands
  8. 8Stewart Exposure Assessments, LLC, Arlington, Virginia, USA
  9. 9Department of Community and Family Medicine, Dartmouth Medical School, Hanover, New Hampshire, USA
  10. 10Maine Cancer Registry, Augusta, Maine, USA
  11. 11Vermont Cancer Registry, Burlington, Vermont, USA
  1. Correspondence to Dr Melissa C Friesen, Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Blvd, Room 8106 MSC 7240, Bethesda, MD 20892-7240, USA; friesenmc{at}


Objectives Evaluating occupational exposures in population-based case-control studies often requires exposure assessors to review each study participant's reported occupational information job-by-job to derive exposure estimates. Although such assessments likely have underlying decision rules, they usually lack transparency, are time consuming and have uncertain reliability and validity. We aimed to identify the underlying rules to enable documentation, review and future use of these expert-based exposure decisions.

Methods Classification and regression trees (CART, predictions from a single tree) and random forests (predictions from many trees) were used to identify the underlying rules from the questionnaire responses, and an expert's exposure assignments for occupational diesel exhaust exposure for several metrics: binary exposure probability and ordinal exposure probability, intensity and frequency. Data were split into training (n=10 488 jobs), testing (n=2247) and validation (n=2248) datasets.

Results The CART and random forest models’ predictions agreed with 92–94% of the expert's binary probability assignments. For ordinal probability, intensity and frequency metrics, the two models extracted decision rules more successfully for unexposed and highly exposed jobs (86–90% and 57–85%, respectively) than for low or medium exposed jobs (7–71%).

Conclusions CART and random forest models extracted decision rules and accurately predicted an expert's exposure decisions for the majority of jobs, and identified questionnaire response patterns that would require further expert review if the rules were applied to other jobs in the same or different study. This approach makes the exposure assessment process in case-control studies more transparent, and creates a mechanism to efficiently replicate exposure decisions in future studies.

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.