Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Performance of Large Language Models on European Gastroenterology Board-Style Questions: Analysis of Reasoning Versus Non-Reasoning Architectures
0
Zitationen
8
Autoren
2026
Jahr
Abstract
<b>Background</b>: While large language models (LLMs) have demonstrated proficiency in medical examinations, their comparative performance on European gastroenterology assessments remains underexplored, particularly regarding architectural differences between reasoning and non-reasoning models. This study benchmarks five state-of-the-art LLMs-DeepSeek-R1, ChatGPT-o1, ChatGPT-4o, Gemini-1.5-Pro, and Llama-3.1-405B (All versions January 2025)-using 203 board-style questions from validated ESEGH preparation materials. <b>Methods</b>: Questions from two commercial ESEGH preparation banks were administered five times per model using standardized prompts. Accuracy, consistency, and domain-specific performance across clinical, diagnostic, and therapeutic questions were analyzed. Four practicing gastroenterologists validated human performance under uniform conditions. <b>Results</b>: ChatGPT-o1 achieved the highest overall accuracy at 84.0% (95% CI: 81.8-86.3), followed closely by ChatGPT-4o (81.7%), DeepSeek-R1 (79.0%), and Llama-3.1-405B (77.2%), while Gemini-1.5-Pro significantly underperformed with 68.5% accuracy (difference vs. ChatGPT-o1: 15.5 percentage points, 95% CI: 11.9 to 19.1, <i>p</i> < 0.01). Although all models exhibited high internal consistency ≥98.4% average agreement across repeated attempts, with 94.6-98.0% of questions answered identically in all five attempts), greater consistency did not necessarily correspond to higher accuracy. Domain-specific analysis revealed that diagnostic questions were answered most accurately, whereas clinical examination questions posed considerable challenges. Topic analysis demonstrated that questions on small intestine disorders were answered with the highest accuracy, in contrast to the lower performance observed in bariatric and pancreatic disorders. Notably, reasoning models, which employed explicit chain-of-thought strategies, outperformed non-reasoning counterparts (81.5% vs. 75.8%, difference: 5.7 percentage points, 95% CI: 3.4 to 8.0, <i>p</i> < 0.001), particularly on therapy questions and complex bait-and-switch formats. Practicing gastroenterologists achieved substantially lower accuracy (mean: 50.9%, range: 37.9-69.0%) compared to all LLMs. All models exceeded the current ESEGH passing threshold of 61.5%, with the top four models surpassing this benchmark by 15.7-22.5 percentage points. <b>Conclusions</b>: This benchmarking study demonstrates that current LLMs, particularly those with reasoning architectures, achieve high accuracy on European gastroenterology board-style questions. However, significant performance gaps in specific domains highlight limitations that must be addressed before clinical application. These findings provide a baseline for evaluating LLM capabilities in European medical contexts.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.551 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.942 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- Hacettepe University(TR)
- University of Minnesota(US)
- Masaryk Memorial Cancer Institute(CZ)
- University of Minnesota Medical Center(US)
- Palacký University Olomouc(CZ)
- Masaryk University(CZ)
- University Hospital Brno(CZ)
- University Hospital Olomouc(CZ)
- University of Basel(CH)
- IRCCS Humanitas Research Hospital(IT)
- Humanitas University(IT)