OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 25.03.2026, 05:29

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

S1467 Artificial Intelligence in IBD: Comparing the Clinical Utility of Four Leading Large Language Models

2025·0 Zitationen·The American Journal of Gastroenterology
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

Introduction: Artificial intelligence is increasingly being integrated into clinical practice across medical specialties, including gastroenterology. Despite this growing adoption, questions remain about the accuracy and reliability of large language models (LLMs) when addressing complex subspecialty clinical scenarios, particularly in specialized fields like inflammatory bowel disease (IBD). This study aimed to evaluate the performance of 4 prominent large language models (LLMs)—OpenAI's ChatGPT-4, Anthropic's Claude Sonnet 3.7, Google's Gemini 2.5 Flash, and OpenEvidence—in responding to IBD-related clinical questions. Our study compared their accuracy and reference utilization across both multiple-choice and open-ended question formats. Methods: Forty-six IBD questions based on current guidelines and clinical knowledge were input into all 4 LLMs between May 15 and May 22, 2025. Each LLM was assessed in both MCQ and open-ended formats. Responses were graded for correctness, with “unable to answer” classified as incorrect. Reference usage was also recorded. Chi-square tests were used to compare accuracy and citation behavior across models. Results: OpenEvidence achieved the highest accuracy in the MCQ format (87.0%), followed by ChatGPT-4 and Claude 3.7 (84.8% each), and Gemini (82.6%) (P < .001). In the open-ended format, Gemini had the highest accuracy (69.6%), followed by OpenEvidence (60.9%), Claude 3.7 (56.5%), and ChatGPT-4 (54.3%) (P < .001). Reference citation varied markedly: OpenEvidence provided references in 100% of cases, while other LLMs cited sources inconsistently or not at all (P < 0.0001). All LLMs attempted to answer 100% of the questions, but varied significantly in accuracy and citation transparency. Conclusion: LLMs demonstrate wide variability in their ability to answer IBD-related clinical questions. The MCQ format improves performance across all models. OpenEvidence consistently outperformed other models in both accuracy and citation use, likely due to its training on medical-specific content and its built-in ability to reference sources. These features may enhance the reliability and clinical usefulness of LLMs in gastroenterology. Further studies are needed to optimize their deployment in clinical and educational settings.

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingCOVID-19 diagnosis using AI
Volltext beim Verlag öffnen