Article Text
Abstract
Objectives We applied machine learning approaches to efficiently assist multiple experts to transparently estimate occupational lead exposure in a case-control study of renal cell carcinoma.
Method We used hierarchical cluster models to classify the 7154 study jobs with occupational history and job/industry questionnaires into 360 groups with similar responses. Each group was reviewed independently by two or three experts and was assigned probabilities of lead exposure (<5%, ≥5– <50%, ≥50%) for three time periods (<1980, 1980–1994, ≥1995). When the group’s mean response pattern suggested within-group exposure variability, experts identified programmable conditions that defined the rating differences where possible or flagged the group for further review. After splitting jobs that overlapped time periods at the calendar cut point, the 9992 job/time periods were assigned their relevant expert/group/time period estimate. Classification and regression tree (CART) models were developed to predict each expert’s expected assignment, based on previous decisions, to assign estimates for jobs in groups that expert had not assessed and for jobs requiring further review.
Results In preliminary analyses, CART models predicted 91–96% of the experts’ pre-1995 estimates and 77–96% of ≥1995 estimates. CART estimates were assigned to 3–48% of the job/time periods, varying by expert. Overall, 92% of the job/time periods were assigned the same estimate by at least two experts.
Conclusions Our framework reduced the number of exposure decisions needed from each expert compared to job-by-job assessment. Future work will use CART models to identify differences between experts to be resolved and incorporate frequency and intensity of lead exposure estimates.