Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Performance of Large Language Models on European Gastroenterology Board-Style Questions: Analysis of Reasoning Versus Non-Reasoning Architectures

2026·0 Zitationen·Journal of Clinical MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background: While large language models (LLMs) have demonstrated proficiency in medical examinations, their comparative performance on European gastroenterology assessments remains underexplored, particularly regarding architectural differences between reasoning and non-reasoning models. This study benchmarks five state-of-the-art LLMs-DeepSeek-R1, ChatGPT-o1, ChatGPT-4o, Gemini-1.5-Pro, and Llama-3.1-405B (All versions January 2025)-using 203 board-style questions from validated ESEGH preparation materials. Methods: Questions from two commercial ESEGH preparation banks were administered five times per model using standardized prompts. Accuracy, consistency, and domain-specific performance across clinical, diagnostic, and therapeutic questions were analyzed. Four practicing gastroenterologists validated human performance under uniform conditions. Results: ChatGPT-o1 achieved the highest overall accuracy at 84.0% (95% CI: 81.8-86.3), followed closely by ChatGPT-4o (81.7%), DeepSeek-R1 (79.0%), and Llama-3.1-405B (77.2%), while Gemini-1.5-Pro significantly underperformed with 68.5% accuracy (difference vs. ChatGPT-o1: 15.5 percentage points, 95% CI: 11.9 to 19.1, p < 0.01). Although all models exhibited high internal consistency ≥98.4% average agreement across repeated attempts, with 94.6-98.0% of questions answered identically in all five attempts), greater consistency did not necessarily correspond to higher accuracy. Domain-specific analysis revealed that diagnostic questions were answered most accurately, whereas clinical examination questions posed considerable challenges. Topic analysis demonstrated that questions on small intestine disorders were answered with the highest accuracy, in contrast to the lower performance observed in bariatric and pancreatic disorders. Notably, reasoning models, which employed explicit chain-of-thought strategies, outperformed non-reasoning counterparts (81.5% vs. 75.8%, difference: 5.7 percentage points, 95% CI: 3.4 to 8.0, p < 0.001), particularly on therapy questions and complex bait-and-switch formats. Practicing gastroenterologists achieved substantially lower accuracy (mean: 50.9%, range: 37.9-69.0%) compared to all LLMs. All models exceeded the current ESEGH passing threshold of 61.5%, with the top four models surpassing this benchmark by 15.7-22.5 percentage points. Conclusions: This benchmarking study demonstrates that current LLMs, particularly those with reasoning architectures, achieve high accuracy on European gastroenterology board-style questions. However, significant performance gaps in specific domains highlight limitations that must be addressed before clinical application. These findings provide a baseline for evaluating LLM capabilities in European medical contexts.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsMachine Learning in Healthcare

Volltext beim Verlag öffnen

Comparative Performance of Large Language Models on European Gastroenterology Board-Style Questions: Analysis of Reasoning Versus Non-Reasoning Architectures

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen