OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 23:05

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of Large Language Models in Metabolic Bariatric Surgery: a Comparative Study

2025·0 Zitationen·Obesity SurgeryOpen Access
Volltext beim Verlag öffnen

0

Zitationen

11

Autoren

2025

Jahr

Abstract

The rapid integration of Large Language Models (LLMs) into healthcare necessitates a rigorous evaluation of their performance in specialized medical fields. In metabolic bariatric surgery (MBS), LLMs have the potential to revolutionize education and clinical support, yet their accuracy and reliability are not well-established. This study provides a critical assessment of the capabilities of current LLMs in the context of MBS. This cross-sectional validation study assessed the performance of six LLMs (ChatGPT-3.5, ChatGPT-4o, Gemini, Copilot, GROK, and DeepSeek) in answering 100 evidence-based binary and multiple-choice questions related to MBS. Questions were constructed from international guidelines and categorized into six thematic domains. Expert consensus answers served as the reference standard, with inter-rater reliability measured using Fleiss’ κ. Model outputs were scored for accuracy. Comparisons across LLMs were first assessed using an overall test for differences between multiple related groups. Pairwise comparisons were then conducted between LLMs to identify specific differences in performance. Across the dataset, the mean number of correct LLM responses per question was 3.9 (SD = 1.8). ChatGPT-4o achieved the highest accuracy (66.0%), while DeepSeek recorded the lowest (60.0%). Accuracy varied across domains, highest for indications/contraindications (78.7%) and complications/management (68.0%), and lowest for preoperative preparation (52.0%) and postoperative care (58.4%). Binary questions yielded higher accuracy (69.1%) than multiple-choice questions (62.0%). Inter-expert reliability was substantial (κ = 0.742, 95% CI: 0.71–0.77). Agreement between LLMs and experts ranged from fair (DeepSeek κ = 0.349) to moderate (ChatGPT-4o κ = 0.446). No significant accuracy differences were detected across models (Friedman test, p = 0.662). LLMs represent a promising, yet imperfect, adjunct in MBS education. Their utility is currently limited by inconsistencies in accuracy, particularly in areas requiring nuanced clinical judgment. While these models can supplement traditional learning resources, they are not yet a substitute for expert clinical guidance. This study underscores the need for continued refinement and validation of LLMs to ensure their safe and effective integration into clinical practice. LLMs show moderate accuracy in bariatric surgery education, strongest in guideline-based domains. Newer models (ChatGPT-4o, Gemini, Copilot) performed slightly better, but gains were modest. Accuracy was higher for binary than multiple-choice questions.

Ähnliche Arbeiten