OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 14.03.2026, 00:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Identification of key cardiovascular disease predictive factors from the China Health and Retirement Longitudinal Study dataset using machine learning-based algorithms

2026·0 Zitationen·Journal of Translational MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

15

Autoren

2026

Jahr

Abstract

Cardiovascular diseases (CVDs) are a leading cause of death but are largely preventable. Consequently, devising effective predictive models to identify high-risk factors, and subsequently shape earlier interventional strategies, are of great importance. Machine-learning algorithms have been identified as a tool to devise such models, as they have greater flexibility and scalability versus traditional ones. This study aimed to identify predictive features for CVD risk, by applying 5 machine learning algorithms to the China Health and Retirement Longitudinal Study (CHARLS) dataset. 91,232 individuals from CHARLS, collected in 2011, 2012, 2015, and 2016, were randomly divided, in a 4:1 ratio, into training (72,985; 12,433 with CVD) and testing (18,247; 3,109 with CVD) datasets. This dataset was preprocessed by imputing missing values and standardizing features using StandardScaler, which was first fitted on the training dataset, then applied to training and testing sets, resulting in 61 features incorporated into 5 machine learning algorithms: logistic regression, random forest, eXtreme gradient boosting, light gradient-boosting machine (LightGBM), and deep neural network. The optimal hyperparameters and classification thresholds were identified for each algorithm, and their predictive capabilities assessed by receiver operating characteristic (ROC) curves. The top 15 predictive features from each algorithm were identified by Shapley Additive exPlanations (SHAP), followed by Venn diagrams and UpSet plots to determine which features were shared among all 5 algorithms. Out of the 61 predictors, 9 were continuous (ex. blood pressure, biomarker levels), 6 ordinal (ex. education, self‑rated health), and 46 binary (ex. smoking status, comorbidity flags). Furthermore, LightGBM had the highest area under the curve, 0.8002 (95% CI = 0.7917–0.8089), under ROC, indicating that it had the strongest predictive capabilities. Venn diagrams and UpSet plots found that out of the top 15 predictive features for CVD risk from each algorithm identified under SHAP, 8 were shared across all 5 algorithms: hypertension, self-rated health, dyslipidemia, age, digestive disease, retirement status, chest pain, and lung disease. A risk prediction model was successfully devised for CVD, involving 8 key clinically relevant and easily obtainable factors, thereby providing a useful tool for screening high-risk populations, enabling early CVD prevention and treatment.

Ähnliche Arbeiten