Article Text

Download PDFPDF

O-15 Occupational Health: A Multi-Cohort Job Title Cleaning Project by Algorithm
Free
  1. Ellen Sweeney1,
  2. Christopher Baker,
  3. Mohammad Sadnan Al Manir,
  4. Deobrah Addey,
  5. Yunsong Cui,
  6. Hicks Jason,
  7. Cheryl Peters,
  8. Grace Shen Tu,
  9. Jennifer Vena,
  10. Anil Adisesh
  1. 1Dalhousie University, Canada

Abstract

Introduction Occupational data in prospective cohort studies is often underutilized due to the human and financial resources required to code open-ended text, such as job titles. Recognizing the value of occupational data in health research, as well as potential errors associated with manual coding, an Automated Coding Algorithm (ACA)-NOC algorithm was developed utilizing a Natural Language Processing approach.

Objectives We tested the ACA-NOC algorithm on two regional cohorts of a pan-Canadian cohort study, which represents the largest dataset an algorithm of this kind has been applied to. This process will harmonize and greatly expand the utility of the occupational data, enrich the research platforms, and further refine the efficiency of the algorithm.

Methods The ACA-NOC algorithm was tested on data from the Canadian Partnership for Tomorrow’s Health (CanPath), a longitudinal cohort examining the role of genetic, environmental, lifestyle, and behavioural factors in the development of cancer and chronic disease. Using an iterative and interactive approach, the algorithm was applied to job title data from 111,000 questionnaires from two regional cohorts, coding the data to the Canadian National Occupation Classification (NOC) system. The algorithm was further refined based on each round of analysis, increasing the quantity of accurately coded data.

Results Results from this research demonstrate the ability to refine the ACA-NOC algorithm with a 10% overall improvement in exact matching from the baseline algorithm. There were also instances where the algorithm performance was superior to the manual coding. The utilization of the algorithm offers significant savings in time, human resources and cost compared to a singular manual coding approach.

Conclusions The coding and harmonization of this multi-cohort data demonstrates the value of the ACA-NOC algorithm, while increasing the utility of the CanPath data and research related to occupational health. Future research may involve comparisons between CanPath and international cohorts.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.