Skip to main contentSkip to navigation
B
Projects
ML / AIProject

Predicting Air Quality in Beijing

Academic Project · UCLouvain LELEC2870 · December 2019

Regression models trained on 7 684 records of meteorological and weather data from Beijing (2013–2017) to predict PM2.5 concentration. The project covers feature engineering, correlation and mutual-information selection, PCA extraction, and a benchmark of seven models evaluated with the Bootstrap 632 method.

PythonPyTorchSklearnSeaborn31.42 µg/m³ best error

MLP error vs neurons per layer

Line chart of MLP error against number of neurons per hidden layer

Mutual information matrix

Heatmap of mutual information between all input features and the PM2.5 output
7 684Training recordsMarch 2013 – Feb 2017
17Engineered featuresAfter cyclic encoding
7Models benchmarkedLinear to deep neural nets
31.42Best error (µg/m³)Bagging trees, selected features
Methodology

Five-stage ML pipeline

The project follows a rigorous end-to-end pipeline from raw sensor data to validated model selection, emphasising sound error estimation throughout.

01

Feature engineering

Time encoded cyclically via sin/cos (daily and yearly periods). Wind direction mapped to a wind-rose angle. All 17 features normalized with standard scaling.

02

Feature selection

Correlation and mutual information used jointly to drop low-relevance features (station, rain, swd…) and redundant ones (temp, pressure, time). Final set: 7 features.

03

Feature extraction (PCA)

PCA reduces the 17-feature set to principal components. Three components capture the dominant variance (error ≈ 55.5 µg/m³), but non-linear dependencies limit its usefulness here.

04

Error estimation

Bootstrap 632 method (MLxtend) used throughout — low bias and low variance, making it more reliable than simple k-fold. 10 splits yield σ = 0.255 µg/m³.

05

Model selection

Seven models trained and cross-compared across all three feature sets. Bootstrap aggregating trees with selected features win at 31.42 µg/m³.

Results

Seven models, three feature sets

Every model is evaluated with the Bootstrap 632 error (µg/m³) on the full feature set, the 7-feature selected set, and the PCA-reduced set. Green bold = lowest error for that model.

Bootstrap Aggregating TreesBest
Full31.60
Selected31.42
PCA41.78

Chosen as the final model. Selected features outperform the full set thanks to reduced dimensionality. No overfitting observed up to depth 33.

Multilayer PerceptronBest
Full31.94
Selected37.52
PCA46.16

20 hidden layers · 100 neurons · 300 epochs. Full features win here — the network learns its own feature weighting. PyTorch implementation.

Regression TreeGood
Full40.39
Selected40.23
PCA51.48

Best depth ≈ 10–11. PCA set overfits at depth 7. Full and selected features plateau without further overfitting.

K-Nearest NeighbourGood
Full44.12
Selected38.51
PCA48.58

Selected features strongly outperform full set — high-dimensional Euclidean distance degrades neighbour quality. Best at K = 4.

LassoBaseline
Full44.06
Selected44.07
PCA53.92

L1 regularisation sets low-relevance weights to zero. Best at λ = 0.01. Redundant with linear regression given the large training set.

Ridge RegressionBaseline
Full44.15
Selected44.38
PCA53.91

Performance nearly constant for λ ∈ [0.01, 100], then underfits. Best at λ = 0.61.

Linear RegressionBaseline
Full44.35
Selected44.50
PCA54.10

Validates the feature selection — 7 selected features match the full-set error. PCA performs poorly due to non-linear dependencies.

Conclusion

Key findings

Ensemble methods dominate

Bootstrap aggregating trees halve the error compared to linear baselines (31.42 vs 44.35 µg/m³), confirming that ensemble diversity is the single most impactful lever here.

Feature selection vs extraction

Mutual-information-based selection (7 features) matches or beats the full 17-feature set for all non-neural models. PCA consistently underperforms due to non-linear feature dependencies.

Neural networks need all features

The MLP achieves its lowest error with the full feature set — unlike tree-based models. The network learns its own relevance weighting, making manual selection counterproductive.