Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study
0
Zitationen
97
Autoren
2026
Jahr
Abstract
Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p < 0.001). Fully guideline-concordant ratings (score = 3) were most frequent for DeepSeek (80%) vs. ChatGPT (0%) and Gemini (3.3%). Internal consistency was excellent (α > 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use. This is the first randomized, blinded evaluation comparing ChatGPT-4, Gemini 2.0 Flash, and DeepSeek-V3 in metabolic and bariatric surgery (MBS). DeepSeek-V3 achieved the highest accuracy, with 80% of responses rated fully guideline-concordant, surpassing ChatGPT-4o and Gemini 2.0 Flash. Expert agreement was excellent (Cronbach’s α > 0.90; ICC > 0.88), reinforcing the reliability of scoring. The study partially aligns with the QUEST framework: Quality and Understanding were addressed; Expression, Safety, and Trust require further evaluation. Findings underscore the need for domain-specific validation of LLMs before clinical integration in MBS.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.380 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.243 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.671 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.496 Zit.
Autoren
- Mohamed Hany
- Mohamed H. Zidan
- Chetan Parmar
- Shahab Shahabi Shahmiri
- Hashem Altabbaa
- Ahmed El-Shamarka
- Ahmed Amgad
- Islam Abdelkhalek
- Abdullah A. Assal
- Marwan Emad Abdou
- Mohammad Kermansaravi
- MBS Global Expert Collaborative Group
- Abdelrahman Nimeri
- Adel Abou-Mrad
- Ahmed Abokhozima
- Ala Wafa
- Amir Davarpanah Jazi
- André Lázaro
- Andrea Schroeder
- Andrew G Robertson
- Angelo Iossa
- Anıl Ergin
- Anna Casajoana
- Anwar Ashraf Abouelnasr
- Aparna Govil Bhasker
- Ashraf Haddad
- Asim Shabbir
- Benjamin Clapp
- Carlos Augusto Scussel Madalosso
- Carlos Padrón
- Cem Emir Güldoğan
- Christine Stier
- Cüneyt KIRKIL
- Daniel Moritz Felsenreich
- Richa Jaiswal
- Ebrahim Aghajani
- Estuardo Behrens
- Farah A. Husain
- Farnaz Rahimi
- Ghulam Siddiq
- Giovanni Lezoche
- Heykel Mebarek
- Hosam Mohamed Mostafa Elghadban
- Ivaylo Tzvetkov
- Karl Peter Rheinwalt
- Kazunori Kasama
- Levon N. Grigoryan
- Maria Antonieta Barrera
- Mariano Palermo
- Masoud Rezvani
- M. Di Paola
- Michael Talbot
- Michel Gagner
- Michel Vix
- Miguel-A Carbajo
- Mohamad Hayssam ElFawal
- Mohamed Ibrahim Bahnasy
- Mohamed Mokhtar Arafat
- Mohamed Ammar
- Mousa Khoursheed
- Natán Zundel
- Nikolaos Pararas
- Nuru Bayramov
- Otto Montoya
- Panagiotis Laïnas
- Paolo Gentileschi
- Patrick Noel
- Paulina Salminen
- Piotr Major
- Ramen Goel
- Rob Snoekx
- Rodolfo Rodolfo
- Rodrigue Chemaly
- Rudolf Weiner
- Rui José Silva Ribeiro
- Ruth Blackham
- Salvatore Tolone
- Samer G. Mattar
- Sara Gaafar Ibnauf Suliman
- Sergio Carandina
- Sergio Aparicio
- Sergio Verboonen
- Silvana Leanza
- Silvia Leite
- Sjaak Pouwels
- Sonja Chiappetta
- Stefano Olmi
- Suhaib Ahmad
- Tadeja Pintar
- Tarek Hassab
- Tigran Poghosyan
- Tuna BILECIK
- Valdemir José Alegre Salles
- Vasileios Charalampakis
- Wah Yang
- Yannick Nijs
- Yves Borbély