Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessing the Limitations of Large Language Models in Clinical Practice Guideline–Concordant Treatment Decision-Making on Real-World Data: Retrospective Study
2
Zitationen
16
Autoren
2025
Jahr
Abstract
Background: Studies have shown that large language models (LLMs) are promising in therapeutic decision-making, with findings comparable to those of medical experts, but these studies used highly curated patient data. Objective: This study aimed to determine if LLMs can make guideline-concordant treatment decisions based on patient data as typically present in clinical practice (lengthy, unstructured medical text). Methods: We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR; n=24) or transcatheter aortic valve replacement (TAVR; n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, LLaMA-2, Mistral, PaLM 2, and DeepSeek-R1) were queried using either anonymized original medical reports or manually generated case summaries to determine the most guideline-concordant treatment. We measured agreement with the heart team using Cohen κ coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using the frequency bias index (FBI; FBI >1 indicated bias toward TAVR). Results: When presented with original medical reports, LLMs showed poor performance (Cohen κ coefficient: -0.47 to 0.22; ICC: 0.0-1.0; FBI: 0.95-1.51). The LLMs' performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (Cohen κ coefficient: -0.02 to 0.63; ICC: 0.01-1.0; FBI: 0.46-1.23). Qualitative analysis revealed instances of hallucinations in all LLMs tested. Conclusions: Even advanced LLMs require extensively curated input for informed treatment decisions. Unreliable responses, bias, and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.561 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.452 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.948 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.797 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- Humboldt-Universität zu Berlin(DE)
- Deutsches Herzzentrum der Charité(DE)
- German Centre for Cardiovascular Research(DE)
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin(DE)
- Freie Universität Berlin(DE)
- Charité - Universitätsmedizin Berlin(DE)
- St Bartholomew's Hospital(GB)
- École Polytechnique Fédérale de Lausanne(CH)
- Technische Universität Berlin(DE)