OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 22:20

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Enhanced Heart disease prediction using LLM ranked feature selection, Dynamic custom Kernel

2026·0 Zitationen·Machine Learning with ApplicationsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

Heart disease, a major cause of death worldwide, accounts for millions of deaths each year. This makes it critical to detect heart disease at an earlier stage so that a treatment plan, including medications and counselling, can be started. Machine learning (ML) algorithms trained on large datasets have made it possible to predict heart disease more effectively. Traditional machine learning approaches provide statistical correlations, but often lack explicit integration of clinical knowledge, which limits their usefulness in real-world scenarios. This paper investigates the use of Large Language Model (LLM) combined with Retrieval-Augmented Generation (RAG) to derive clinically grounded feature relevance based on medical guidelines. A curated corpus of medical guidelines and practice protocols from internationally approved organizations was used to train the RAG pipeline. The features were ranked using LLM powered by RAG, and themost important features were selected and used in a Support Vector Machine (SVM) with a custom kernel. A custom formulation combining linear and non linear functions were explored as an auxiliary modeling component. This enables the model to keep the clinical importance of the features, linear transparency and also captures complex interactions using a polynomial function. This approach is evaluated on the UCI Heart Disease dataset, which includes data from Cleveland, Hungary, Switzerland, and VA Medical Center in Long Beach. This study conducted in two parts one using only Cleveland alone and a full set of data using all 4 regions. This integration of statistical learning with LLM driven reasoning supports cardiovascular risk assessment in a clinically informed manner. This approach helps to identify clinically relevant features for the learning process. On the Cleveland dataset the model achieved an accuracy of 95%, an F1 score of 0.936, and an AUC-ROC of 0.973, but it was comparable with traditional models and without weighted kernel due to the size of the data. When applied on the combined data set, using the entire UCI dataset, the model achieved an accuracy of 93.3%, F1 score 0.923 and AUC-ROC of 0.961. Statistical testing showed that the weighted and unweighted kernels performed similarly, suggesting that the primary contribution arises from clinically guided feature selection rather than kernel weighting. The combination of statistical methods and reasoning from LLM models improves both the effectiveness and clarity of predictions. This entire process helps as a stepping stone to developing more reliable AI systems that doctors can trust and use in their daily practice. This paper also includes a comparative study of logistic regression, decision tree, random forest, gradient boosting, and support vector machine with RBF, sigmoid, linear and polynomial kernels. • This paper applies a strategy that integrates the Large Language Model (LLM) with Retrieval-Augmented Generation (RAG) to create a clinically weighted and ranked list of features based on the medical guidelines. • The features were ranked using LLM powered by RAG, most important features were selected and used in a Support Vector Machine (SVM) with a new custom kernel, which combines weighted features with linear components and polynomial functions for heart disease prediction. • This approach is analytically validated on the UCI Heart Disease dataset, which includes data from Cleveland, Hungary, Switzerland, and VA Medical Center in Long Beach. • This study conducted in two parts one using only Cleveland alone and a full set of data using all 4 regions. This integration of statistical learning with LLM driven reasoning improves both the accuracy and interpretability of cardiovascular risk assessment. • On the Cleveland dataset the model achieved an accuracy of 95 and an AUC-ROC of 0.973, but it was comparable with traditional models and without weighted kernel due to the size of the data. When applied on the combined data set, using the entire UCI dataset, the model outperformed the traditional models with an accuracy of 93.3 showed that the performance differences between the weighted and unweighted kernels were not significant in most configurations, the proposed framework consistently maintained or modestly improved performance across datasets.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in HealthcareMachine Learning in HealthcareArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen