Article Text

Download PDFPDF
OP III – 4 Exposure assessment models for no2 and pm2.5 in the elapse study: a comparison of supervised linear regression and machine learning approaches
  1. Jie Chen1,
  2. Kees de Hoogh2,3,
  3. Maciek Strak1,
  4. Jules Kerckhoffs1,
  5. Roel Vermeulen1,
  6. Bert Brunekreef1,
  7. Gerard Hoek1
  1. 1Institute for Risk Assessment Sciences, Faculty of Veterinary Medicine, Utrecht, Netherlands
  2. 2Swiss Tropical and Public Health Institute, Basel, Switzerland
  3. 3University of Basel, Basel, Switzerland

Abstract

Background/aim Recent studies suggested machine learning as an alternative for supervised linear regression (SLR) in developing Land Use Regression models for air pollution exposure assessment. However, few studies have made direct comparisons. This study aimed to develop novel models using machine learning approaches, and compare the model performance to SLR models using an external dataset for validation.

Methods A set of novel European-wide models were developed to estimate 2010 annual means for NO2 and PM2.5, based on AIRBASE routine monitoring data. Satellite observations, chemical transport model estimates, land use and traffic data were used as predictor variables. The alternative algorithms we used included shrinkage techniques (lasso, elastic net, ridge), ensemble learning (bagging, boosting, random forest), support vector machine and a super-learner algorithm. Besides 5-fold cross-validation, we also performed external validation using data from the ESCAPE study to evaluate the model performance. The novel models were compared to the previously developed models (SLR for both NO2 and PM2.5, with additional kriging on residuals in PM2.5 models).

Results Random forest suggested a moderate improvement in cross-validation with R2 of 0.66 for NO2 models compared to the conventional supervised linear regression model (R2=0.58), while the external validation R2 was lower (0.46 compared to 0.50). The super-learner algorithm had the highest external validation R2 of 0.51, which was less than 0.01 higher than the original supervised linear regression model.

For PM2.5, most of the machine learning methods showed similar or worse performance compared to the original supervised linear regression model. The super-learner algorithm had the highest cross-validation R2 of 0.72, which was 0.02 higher than the supervised linear regression model. However, no machine learning algorithm showed better performance in external validation.

Conclusion Machine learning algorithms did not perform better than supervised linear regression in our Europe-wide datasets.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.