Article Text
Abstract
Background/aim Recent studies suggested machine learning as an alternative for supervised linear regression (SLR) in developing Land Use Regression models for air pollution exposure assessment. However, few studies have made direct comparisons. This study aimed to develop novel models using machine learning approaches, and compare the model performance to SLR models using an external dataset for validation.
Methods A set of novel European-wide models were developed to estimate 2010 annual means for NO2 and PM2.5, based on AIRBASE routine monitoring data. Satellite observations, chemical transport model estimates, land use and traffic data were used as predictor variables. The alternative algorithms we used included shrinkage techniques (lasso, elastic net, ridge), ensemble learning (bagging, boosting, random forest), support vector machine and a super-learner algorithm. Besides 5-fold cross-validation, we also performed external validation using data from the ESCAPE study to evaluate the model performance. The novel models were compared to the previously developed models (SLR for both NO2 and PM2.5, with additional kriging on residuals in PM2.5 models).
Results Random forest suggested a moderate improvement in cross-validation with R2 of 0.66 for NO2 models compared to the conventional supervised linear regression model (R2=0.58), while the external validation R2 was lower (0.46 compared to 0.50). The super-learner algorithm had the highest external validation R2 of 0.51, which was less than 0.01 higher than the original supervised linear regression model.
For PM2.5, most of the machine learning methods showed similar or worse performance compared to the original supervised linear regression model. The super-learner algorithm had the highest cross-validation R2 of 0.72, which was 0.02 higher than the supervised linear regression model. However, no machine learning algorithm showed better performance in external validation.
Conclusion Machine learning algorithms did not perform better than supervised linear regression in our Europe-wide datasets.