Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Data augmentation alters feature importance in XGBoost for CVD prediction
0
Zitationen
4
Autoren
2025
Jahr
Abstract
Machine learning models are powerful tools for cardiovascular disease (CVD) prediction, but their performance is often limited by dataset size and class imbalance. While data augmentation techniques can address these issues, their impact on model interpretability and the relative importance of clinical predictors remains poorly understood. This study investigates how different data augmentation strategies affect the performance and feature importance hierarchy of an Extreme Gradient Boosting (XGBoost) model for CVD prediction. This study conducted an ablation study using a public CVD dataset. Three XGBoost models were developed and compared: a baseline model trained on original data, a model trained with data augmented by the Synthetic Minority Over-sampling Technique (SMOTE), and a model using a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP). Model performance was evaluated using accuracy, F1-score, and AUC. Feature importance was quantified and compared across models using the Gain metric. All models demonstrated high predictive performance on the independent test set, with the SMOTE-augmented model achieving an accuracy and AUC of 1.0. Data augmentation fundamentally altered the model's feature importance. In the baseline model, 'oldpeak' (Gain: 8.25) and 'slope' (Gain: 7.01) were the top predictors. In contrast, 'slope' became the single most dominant feature in both the SMOTE (Gain: 27.49) and WGAN-GP (Gain: 36.68) augmented models. Data augmentation can significantly reshape the predictive strategy of a high-performance machine learning model. For high-quality datasets, the primary effect of augmentation may be the re-prioritization of predictive features rather than a direct improvement in classification accuracy. These findings underscore the critical need to evaluate the impact of synthetic data on model interpretability before clinical application.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.