Article Text
Abstract
Introduction Occupational encoding is a technique that allows job titles provided by study participants to be categorized according to their role in the labor force. Encoding has primarily been a slow error-prone manual process which is ripe for automation.
Objectives Our goals was to design and test an automated coding prototype using machine learning techniques.
Methods The prototype classification system ENENOC (the ENsemble Encoder for the National Occupational Classification) is comprised of series of steps involving data cleaning, exact match search, multi classifier ensembling, hierarchical classification, and multiple output selection. In the absence of exact matching between job title input and NOC category descriptions, the input data is embedded using the TF-IDF algorithm and Doc2Vec. The embeddings are fed into a hierarchical, ensemble classifier that uses classical machine learning techniques: Random Forests, Support Vector Machine and K-Nearest Neighbour. Ensemble encoding is achieved using a majority-voting system. The hierarchical two tier classification methodology first predicts the first digit of the NOC code followed while the second tier predicts the second third and fourth digit of the NOC code for the input data. The combined approach produces a single, 4-digit code as a top choice, as well as four alternate NOC codes, that serve as additional ranked choice based on the Doc2Vec model.
Results The prototype was benchmarked on a manually annotated data set comprising of 64,000 records. It produced a top-1 Per-Digit Macro F1-Score of 0.65 and a top-5 Per-Digit Macro F1-Score of 0.76, both of which are highly within published accuracy ranges for manual coding (44% to 89% inter-annotator agreement). ENENOC coded 30,000 job titles in 3 hours.
Conclusion The ENENOC prototype is a sophisticated ENsemble Encoder for the National Occupational Classification which has state of the art performance accuracy with significant speed improvements over manual coding.