Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Augmenting small tabular health data for training prognostic ensemble machine learning models using generative models

2025·1 Zitationen·BMC Medical Informatics and Decision MakingOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Small datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution and is often used for imaging and time series data, but there are no evaluations on its potential benefits for tabular health data. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data. Evaluate data augmentation using generative models on tabular health data and assess the impact of diversity versus increasing the sample size. Using 13 large health datasets, we performed a simulation to evaluate the impact of data augmentation on the prediction performance (as measured by the ROC-AUC, the area under the receiver operating characteristic curve) on binary classification gradient boosted decision tree models. Four different synthetic data generation models were evaluated. We also built a generalized linear mixed effect model to assess the variable importance for model performance improvements from augmentation. We illustrate the proposed method on seven small real datasets as an application. A comparison of augmentation with resampling (which is a proxy for a larger dataset with minimal impact on diversity) was performed. Augmentation improves prognostic performance for datasets that have higher cardinality categorical variables and lower baseline ROC-AUC. No specific generative model consistently outperformed the others. For the seven small application datasets, augmenting the existing data results in an increase in ROC-AUC between 4.31% (ROC-AUC from 0.71 to 0.75) and 43.23% (ROC-AUC from 0.51 to 0.73), with an average 15.55% relative improvement, demonstrating the nontrivial impact of augmentation on small datasets (p = 0.0078). Augmentation ROC-AUC was higher than resampling only ROC-AUC (p = 0.016). The diversity of augmented datasets was higher than the diversity of resampled datasets (p = 0.046). This study demonstrates that data augmentation using generative models can have a marked benefit in terms of improved predictive performance for machine learning models on tabular health data, but only for datasets that meet baseline data complexity and predictive performance criteria. Our mixed effect model identified the most influential characteristics of the dataset and can help end-users have a more realistic expectation of the augmentation performance for a new dataset. Furthermore, augmentation performed better when having a smaller dataset, which is consistent with the argument that greater data diversity due to augmentation is beneficial. Not applicable.

Autoren

Institutionen

Themen

Machine Learning in HealthcareArtificial Intelligence in Healthcare and EducationSepsis Diagnosis and Treatment

Volltext beim Verlag öffnen

Augmenting small tabular health data for training prognostic ensemble machine learning models using generative models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen