Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Identification of key cardiovascular disease predictive factors from the China Health and Retirement Longitudinal Study dataset using machine learning-based algorithms
0
Zitationen
15
Autoren
2026
Jahr
Abstract
Cardiovascular diseases (CVDs) are a leading cause of death but are largely preventable. Consequently, devising effective predictive models to identify high-risk factors, and subsequently shape earlier interventional strategies, are of great importance. Machine-learning algorithms have been identified as a tool to devise such models, as they have greater flexibility and scalability versus traditional ones. This study aimed to identify predictive features for CVD risk, by applying 5 machine learning algorithms to the China Health and Retirement Longitudinal Study (CHARLS) dataset. 91,232 individuals from CHARLS, collected in 2011, 2012, 2015, and 2016, were randomly divided, in a 4:1 ratio, into training (72,985; 12,433 with CVD) and testing (18,247; 3,109 with CVD) datasets. This dataset was preprocessed by imputing missing values and standardizing features using StandardScaler, which was first fitted on the training dataset, then applied to training and testing sets, resulting in 61 features incorporated into 5 machine learning algorithms: logistic regression, random forest, eXtreme gradient boosting, light gradient-boosting machine (LightGBM), and deep neural network. The optimal hyperparameters and classification thresholds were identified for each algorithm, and their predictive capabilities assessed by receiver operating characteristic (ROC) curves. The top 15 predictive features from each algorithm were identified by Shapley Additive exPlanations (SHAP), followed by Venn diagrams and UpSet plots to determine which features were shared among all 5 algorithms. Out of the 61 predictors, 9 were continuous (ex. blood pressure, biomarker levels), 6 ordinal (ex. education, self‑rated health), and 46 binary (ex. smoking status, comorbidity flags). Furthermore, LightGBM had the highest area under the curve, 0.8002 (95% CI = 0.7917–0.8089), under ROC, indicating that it had the strongest predictive capabilities. Venn diagrams and UpSet plots found that out of the top 15 predictive features for CVD risk from each algorithm identified under SHAP, 8 were shared across all 5 algorithms: hypertension, self-rated health, dyslipidemia, age, digestive disease, retirement status, chest pain, and lung disease. A risk prediction model was successfully devised for CVD, involving 8 key clinically relevant and easily obtainable factors, thereby providing a useful tool for screening high-risk populations, enabling early CVD prevention and treatment.
Ähnliche Arbeiten
Biostatistical Analysis
1996 · 35.445 Zit.
UCI Machine Learning Repository
2007 · 24.290 Zit.
An introduction to ROC analysis
2005 · 20.586 Zit.
The use of the area under the ROC curve in the evaluation of machine learning algorithms
1997 · 7.096 Zit.
A method of comparing the areas under receiver operating characteristic curves derived from the same cases.
1983 · 7.061 Zit.