Article Text
Abstract
Introduction Ongoing studies into the use of algorithms for the automated coding of job titles to the Canadian National Occupation Classification have performance accuracy which are at least equivalent to manual coding accuracy. Moreover automated coding provides significant time savings. These studies have identified that both natural language processing and machine learning algorithms are effective for auto coding. Whereas NLP based and machine learning approaches both rely on bespoke rules, and existing data sets, machine learning models can proliferate bias from training data if not corrected.
Objectives The goal of the study is to explore the impact of altering sex/gender ratios in training data sets on overall performance of the machine learning based prediction of NOC codes using patient provided job titles.
Methods Using data participant patient data provided by Atlantic PATH, training data sets were prepared for 100 4-digit NOC categories. The data sets were prepared with sex/gender ratios of 50/50 30/70, 70/30. The data sets were used to train ENENOC machine learning platform and tested on a set of manually coded job titles provided by Atlantic PATH CanPATH . Performance levels were contrasted for all 4-digit NOC categories used in the study.
Results Initial results in this preliminary study have identified that sex and gender are variables that can influence auto coding performance, however the extent to which overall coding accuracy is impacted is relative minor. Further studies are required with larger training sets to fully explore the extent of sex and gender as contributing variables to bias to ENENOC.
Conclusion We initiated studies to investigate the impact of sex and gender bias on performance of the ENENOC algorithm. Together, the ENENOC contributed training and test sets provide a suitable framework for ongoing work in this area.