OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 15:45

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of Large Language Models in the Clinical Management of Patients With Upper Gastrointestinal Bleeding : Insights From Real-World Patient Data

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2025

Jahr

Abstract

ABSTRACT Objective Upper gastrointestinal bleeding (UGIB) is a life-threatening emergency requiring rapid risk assessment. Current scoring tools have limited accuracy. Large language models (LLMs) may support clinical decision-making, but their role in UGIB management is unclear. This study evaluated LLMs for patient risk classification, prediction of endoscopic findings, and alignment with routine clinical decision-making. Methods In this retrospective study, we analyzed electronic health records(EHRs) of 384 UGIB patients presented to two referral centers in Karaj, Iran, between March and December 2024. Included cases underwent upper gastrointestinal endoscopy; incomplete records were excluded. Five LLMs including GPT-5, Llama 4, Gemini-2.5-Flash, DeepSeek R1, and Grok were assessed using in-context learning for (i) risk classification, (ii) prediction of probable endoscopic findings, and (iii) clinical justification generation. Performance metrics included accuracy, precision, recall, and F1-score, compared with conventional machine learning models. Two gastroenterologists independently assessed justifications across seven domains: relevance, clarity, originality, completeness, specificity, correctness, and consistency. Results All LLMs outperformed conventional models (highest baseline accuracy 0.54). GPT-5 achieved the highest risk classification accuracy (0.66), followed by Llama 4 (0.64). Grok performed best in predicting endoscopic findings (0.32). gastroenterologists noted variability in reasoning: GPT-5 and Grok provided the most complete justifications, though GPT-5 occasionally over-classified urgent cases. Llama-4 and Gemini-2.5-Flash were less specific, while DeepSeek R1 offered detailed patient summaries but lacked predictive outputs. Conclusions LLMs improved UGIB risk prediction and generate interpretive reasoning, but accuracy limitations, inconsistent reasoning, and occasional risk misclassification highlight the need for clinician oversight and prospective validation before clinical use. Key Messages What is already known on this topic UGIB is a medical emergency requiring rapid risk stratification and timely management. LLMs are promising tools for clinical decision support, but their role in UGIB management remains unclear. What this study adds LLMs can improve risk prediction and interpretive reasoning in UGIB, but limitations in accuracy, inconsistent reasoning, and occasional misclassification highlight the need for clinician oversight and prospective validation. How this study might affect research, practice, or policy LLMs provide structured, human-readable explanations that could support clinical decision-making, potentially reduce unnecessary emergency endoscopies, improve care efficiency, and alleviate physician workload.

Ähnliche Arbeiten