Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Accuracy of commercial large language models in cardiovascular risk stratification: a vignette-based pilot study

2026·0 Zitationen·European Heart Journal - Digital HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background Cardiovascular disease (CVD) remains the leading cause of mortality worldwide. Effective primary prevention relies on accurate risk stratification, yet validated tools such as SCORE2 are frequently underused in clinical practice, with clinicians often relying on subjective, error-prone judgment. Large language models (LLMs) may facilitate automation of cardiovascular risk factor extraction and classification from free-text clinical narratives, potentially improving adherence to risk-based preventive strategies. Objective Assess the accuracy of two commercially available LLMs in classifying 10-year cardiovascular risk according to the ESC prevention guidelines, based on extraction and interpretation of risk factors from Portuguese-language clinical vignettes. Methods A set of 30 synthetic clinical vignettes in Portuguese was created to reflect real-world primary prevention scenarios, stratified by age, sex, and risk categories according to European Society of Cardiology (ESC) guidelines. Content validity was confirmed by three board-certified cardiologists (Item-CVI = 1.00). Two LLMs (Gemini 2.0 Flash and ChatGPT-4o) processed the vignettes using standardized zero-shot prompts to extract relevant risk factors and classify patients into one of four risk categories (low, intermediate, high and very high risk). Primary performance metrics included overall accuracy, sensitivity and specificity for detecting high/very-high risk individuals, and agreement measured by Cohen’s κ. Comparative performance was assessed using McNemar’s test. Results ChatGPT-4o achieved higher overall classification accuracy (66.7%; 95% CI: 48.8–80.8) compared with Gemini 2.0 Flash (56.7%; 95% CI: 39.2–72.6). Gemini exhibited superior sensitivity for high/very-high risk identification (86.7% vs 66.7%) but lower specificity (46.7% vs 80.0%). Cohen’s κ indicated moderate agreement with the reference for both models: Gemini κ = 0.41 (95% CI: 0.17–0.65) and ChatGPT-4o κ = 0.54 (95% CI: 0.30–0.77). Weighted κ values favored ChatGPT-4o (0.65 vs 0.51). No statistically significant difference in performance was observed between models (McNemar’s p = 0.61). Conclusions Both LLMs demonstrated moderate accuracy in cardiovascular risk stratification using SCORE2. ChatGPT-4o provided higher overall accuracy and specificity, while Gemini 2.0 Flash offered greater sensitivity favoring high-risk case detection. These preliminary findings suggest that LLMs may assist in automated cardiovascular risk assessment, but further refinement strategies and validation in real-world settings is warranted before integration into preventive care workflows.LLM Confusion Matriz

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareCardiovascular Health and Risk Factors

Volltext beim Verlag öffnen

Accuracy of commercial large language models in cardiovascular risk stratification: a vignette-based pilot study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen