Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Abstract 4369198: Performance of Large Language Models in Analyzing Common Hypertension Scenarios in Clinical Practice
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Background/Objective: Hypertension is the most prevalent chronic disease in primary care and a leading cause of cardiovascular morbidity and mortality. Despite existing guidelines, therapeutic inertia and suboptimal control persist. Large language models (LLMs) offer a potential valuable addition to augment clinical decision-making, yet their reliability for guideline-driven tasks remains unverified. This study evaluated the accuracy and safety of hypertension management recommendations generated by three LLMs compared to expert responses. Methods: Fifty-one clinical vignettes representing 17 core hypertension management concepts were constructed by hypertension experts. Each case was submitted to three LLMs (GPT-4, Gemini, MedLM) and a hypertension expert also wrote the “gold standard” answers. Three blinded expert reviewers rated each response on a 4-point accuracy scale, a binary safety (safe/unsafe) scale, and attempted to identify the source (LLM vs. expert) providing the response. Ratings were analyzed using mean scores, percentages of accurate and safe responses, and inter-rater agreement. Results: GPT-4 had the highest accuracy (83%) and safety (86%) scores among LLMs but remained inferior to expert responses (92% accuracy, 93% safety). Gemini and MedLM performed significantly worse (accuracy: 64% and 35%; safety: 73% and 39%, respectively). GPT-4 generated the most guideline-concordant responses (46%) among the three LLMs (Gemini 35%, MedLM 14%), but remains lower than experts’ responses (68%). Evaluators misidentified LLM responses as expert-written in 10 to 25% of cases, particularly with GPT-4. Inter-rater reliability for accuracy ratings was highest for expert-generated responses (ICC 0.81), with progressively lower agreement for GPT-4 (0.76), Gemini (0.70), and MedLM (0.68). A similar pattern was observed for safety and source discrimination ratings. The agreement was strongest for safety assessments and weakest for source discrimination. Conclusion: Among three tested LLMs, GPT-4 demonstrated closer agreement to expert decisions thereby showing greater potential for supporting hypertension management. However, current LLMs’ versions frequently produce inaccurate or unsafe recommendations and remain inferior to expert judgment. Human-in-the-loop supervision remains essential when deploying LLMs for clinical decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.