Article Text
Abstract
Objective Free-text job descriptions from lifetime occupational history questionnaires are the starting point for nearly all occupational exposure assessment activities in epidemiologic studies. This information is used to code job descriptions into standardized occupation classification (SOC) systems. We describe updates to SOCcer, an algorithm that incorporates natural language processing to automatically code job descriptions to SOC-2010.
Methods We recently released SOCcer 2.0. It improved on the original algorithm by 1) expanding the training data set to include job descriptions from population-based epidemiologic studies and 2) revising the scoring algorithm to account for nonlinearity in the classifiers. However, perfect prediction is not possible because of the lack of gold standard approach on which to train the algorithm and the brevity of the job descriptions provided by participants, which may fit multiple codes. We have adapted SOCcer to be used in the data collection process to allow the study participant to serve as their own coder when completing a web-based occupational questionnaire. SOCcer reads the participants open-ended job title and tasks responses in real time and proposes a short list of best-fitting SOC-2010 codes for each job. The study participant reviews the list and selects the code that best fits their job.
Results In a validation set of 11,943 jobs, SOCcer’ highest scoring code had 50% and 63% agreement with a consensus expert-assigned code at the 6- and 3-digit level, respectively. Agreement increased linearly with algorithm score. The expert’s code was in the top 3 scoring codes from SOCcer for >70% of the jobs, lending support for providing a short list of codes for the study participants to review. Pilot testing is underway.
Conclusion Automated coding, especially in real time, has the potential to substantially reduce the efforts needed to code jobs in large epidemiologic studies and improve the codes accuracy.