Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of Large Language Models in Metabolic Bariatric Surgery: a Comparative Study
0
Zitationen
11
Autoren
2025
Jahr
Abstract
The rapid integration of Large Language Models (LLMs) into healthcare necessitates a rigorous evaluation of their performance in specialized medical fields. In metabolic bariatric surgery (MBS), LLMs have the potential to revolutionize education and clinical support, yet their accuracy and reliability are not well-established. This study provides a critical assessment of the capabilities of current LLMs in the context of MBS. This cross-sectional validation study assessed the performance of six LLMs (ChatGPT-3.5, ChatGPT-4o, Gemini, Copilot, GROK, and DeepSeek) in answering 100 evidence-based binary and multiple-choice questions related to MBS. Questions were constructed from international guidelines and categorized into six thematic domains. Expert consensus answers served as the reference standard, with inter-rater reliability measured using Fleiss’ κ. Model outputs were scored for accuracy. Comparisons across LLMs were first assessed using an overall test for differences between multiple related groups. Pairwise comparisons were then conducted between LLMs to identify specific differences in performance. Across the dataset, the mean number of correct LLM responses per question was 3.9 (SD = 1.8). ChatGPT-4o achieved the highest accuracy (66.0%), while DeepSeek recorded the lowest (60.0%). Accuracy varied across domains, highest for indications/contraindications (78.7%) and complications/management (68.0%), and lowest for preoperative preparation (52.0%) and postoperative care (58.4%). Binary questions yielded higher accuracy (69.1%) than multiple-choice questions (62.0%). Inter-expert reliability was substantial (κ = 0.742, 95% CI: 0.71–0.77). Agreement between LLMs and experts ranged from fair (DeepSeek κ = 0.349) to moderate (ChatGPT-4o κ = 0.446). No significant accuracy differences were detected across models (Friedman test, p = 0.662). LLMs represent a promising, yet imperfect, adjunct in MBS education. Their utility is currently limited by inconsistencies in accuracy, particularly in areas requiring nuanced clinical judgment. While these models can supplement traditional learning resources, they are not yet a substitute for expert clinical guidance. This study underscores the need for continued refinement and validation of LLMs to ensure their safe and effective integration into clinical practice. LLMs show moderate accuracy in bariatric surgery education, strongest in guideline-based domains. Newer models (ChatGPT-4o, Gemini, Copilot) performed slightly better, but gains were modest. Accuracy was higher for binary than multiple-choice questions.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.250 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.109 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.482 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.434 Zit.