Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking large language models for cardiovascular risk stratification using clinical vignettes
0
Zitationen
11
Autoren
2025
Jahr
Abstract
<title>Abstract</title> Large language models (LLMs) show promise for cardiovascular risk stratification, though their performance compared with clinical guidelines requires validation. We benchmarked eleven contemporary LLMs using 30 bilingual (Portuguese/English) outpatient vignettes comparing their classifications against expert-adjudicated European Society of Cardiology guidelines using SCORE2. Models achieved near-perfect extraction of traditional risk factors (micro-F1 0.97–0.99) but only moderate agreement for three-class ESC risk categories (best weighted kappa 0.69, 95% CI 0.44–0.84). Ten out of eleven showed systematic underestimation of risk. LLMs struggled with SCORE2 numeric computation, with mean absolute error exceeding 5 percentage points in all but one. Most models correctly identified guideline exceptions requiring alternative assessment, beyond SCORE2, in more than 95% of cases. No significant performance differences between languages were found. While LLMs excel at structured data extraction and eligibility screening, their inconsistent risk stratification and poor numeric accuracy preclude autonomous clinical use, warranting further refinement.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.