Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Data augmentation alters feature importance in XGBoost for CVD prediction

2025·0 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Machine learning models are powerful tools for cardiovascular disease (CVD) prediction, but their performance is often limited by dataset size and class imbalance. While data augmentation techniques can address these issues, their impact on model interpretability and the relative importance of clinical predictors remains poorly understood. This study investigates how different data augmentation strategies affect the performance and feature importance hierarchy of an Extreme Gradient Boosting (XGBoost) model for CVD prediction. This study conducted an ablation study using a public CVD dataset. Three XGBoost models were developed and compared: a baseline model trained on original data, a model trained with data augmented by the Synthetic Minority Over-sampling Technique (SMOTE), and a model using a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP). Model performance was evaluated using accuracy, F1-score, and AUC. Feature importance was quantified and compared across models using the Gain metric. All models demonstrated high predictive performance on the independent test set, with the SMOTE-augmented model achieving an accuracy and AUC of 1.0. Data augmentation fundamentally altered the model's feature importance. In the baseline model, 'oldpeak' (Gain: 8.25) and 'slope' (Gain: 7.01) were the top predictors. In contrast, 'slope' became the single most dominant feature in both the SMOTE (Gain: 27.49) and WGAN-GP (Gain: 36.68) augmented models. Data augmentation can significantly reshape the predictive strategy of a high-performance machine learning model. For high-quality datasets, the primary effect of augmentation may be the re-prioritization of predictive features rather than a direct improvement in classification accuracy. These findings underscore the critical need to evaluate the impact of synthetic data on model interpretability before clinical application.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareArtificial Intelligence in Healthcare

Volltext beim Verlag öffnen

Data augmentation alters feature importance in XGBoost for CVD prediction

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen