Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
S1467 Artificial Intelligence in IBD: Comparing the Clinical Utility of Four Leading Large Language Models
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Introduction: Artificial intelligence is increasingly being integrated into clinical practice across medical specialties, including gastroenterology. Despite this growing adoption, questions remain about the accuracy and reliability of large language models (LLMs) when addressing complex subspecialty clinical scenarios, particularly in specialized fields like inflammatory bowel disease (IBD). This study aimed to evaluate the performance of 4 prominent large language models (LLMs)—OpenAI's ChatGPT-4, Anthropic's Claude Sonnet 3.7, Google's Gemini 2.5 Flash, and OpenEvidence—in responding to IBD-related clinical questions. Our study compared their accuracy and reference utilization across both multiple-choice and open-ended question formats. Methods: Forty-six IBD questions based on current guidelines and clinical knowledge were input into all 4 LLMs between May 15 and May 22, 2025. Each LLM was assessed in both MCQ and open-ended formats. Responses were graded for correctness, with “unable to answer” classified as incorrect. Reference usage was also recorded. Chi-square tests were used to compare accuracy and citation behavior across models. Results: OpenEvidence achieved the highest accuracy in the MCQ format (87.0%), followed by ChatGPT-4 and Claude 3.7 (84.8% each), and Gemini (82.6%) (P < .001). In the open-ended format, Gemini had the highest accuracy (69.6%), followed by OpenEvidence (60.9%), Claude 3.7 (56.5%), and ChatGPT-4 (54.3%) (P < .001). Reference citation varied markedly: OpenEvidence provided references in 100% of cases, while other LLMs cited sources inconsistently or not at all (P < 0.0001). All LLMs attempted to answer 100% of the questions, but varied significantly in accuracy and citation transparency. Conclusion: LLMs demonstrate wide variability in their ability to answer IBD-related clinical questions. The MCQ format improves performance across all models. OpenEvidence consistently outperformed other models in both accuracy and citation use, likely due to its training on medical-specific content and its built-in ability to reference sources. These features may enhance the reliability and clinical usefulness of LLMs in gastroenterology. Further studies are needed to optimize their deployment in clinical and educational settings.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.303 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.155 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.555 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.453 Zit.