Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study

2026·0 Zitationen·Obesity SurgeryOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p < 0.001). Fully guideline-concordant ratings (score = 3) were most frequent for DeepSeek (80%) vs. ChatGPT (0%) and Gemini (3.3%). Internal consistency was excellent (α > 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use. This is the first randomized, blinded evaluation comparing ChatGPT-4, Gemini 2.0 Flash, and DeepSeek-V3 in metabolic and bariatric surgery (MBS). DeepSeek-V3 achieved the highest accuracy, with 80% of responses rated fully guideline-concordant, surpassing ChatGPT-4o and Gemini 2.0 Flash. Expert agreement was excellent (Cronbach’s α > 0.90; ICC > 0.88), reinforcing the reliability of scoring. The study partially aligns with the QUEST framework: Quality and Understanding were addressed; Expression, Safety, and Trust require further evaluation. Findings underscore the need for domain-specific validation of LLMs before clinical integration in MBS.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen