OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 25.03.2026, 03:39

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of ChatGPT-4o, Gemini 2.0 Pro, and DeepSeek-V3 in Patient-Facing Information on Chest Wall Deformities: A Comparative Evaluation of Accuracy, RELIABILITY, and Reproducibility

2026·0 Zitationen·DiagnosticsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2026

Jahr

Abstract

<b>Background</b>: Large language models (LLMs) such as DeepSeek-V3, Google Gemini 2.0 Pro, and ChatGPT-4o are increasingly used by patients seeking online medical information. However, their accuracy, reliability, and reproducibility in patient-facing content related to chest wall deformities (CWD) remain unclear. This study aimed to compare the performance of three contemporary LLMs in generating information on pectus excavatum, pectus carinatum, and related thoracic deformities. <b>Methods</b>: Eighty patient-facing questions were developed across eight thematic domains and independently submitted to each model using newly created accounts over two consecutive days. Accuracy was assessed using a validated four-point rubric by blinded physiatrists, and reproducibility was evaluated using agreement metrics and weighted Cohen's kappa. <b>Results</b>: ChatGPT-4o achieved the highest overall accuracy (median score: 1.20), the greatest proportion of fully accurate responses, and the lowest hallucination rate (5.0%). Gemini showed intermediate accuracy, while DeepSeek-V3 demonstrated the lowest accuracy and highest hallucination rate (11.25%). Across all models, general-information and quality-of-life domains had the best performance, whereas treatment-related questions showed the most errors. Reproducibility was highest for ChatGPT-4o (weighted κ = almost perfect), followed by Gemini and DeepSeek-V3. Inter-rater reliability was substantial (Fleiss' κ = 0.69). <b>Conclusions</b>: Contemporary LLMs can generate largely accurate and reproducible patient-facing information on CWD, with ChatGPT-4o showing the strongest overall performance. This study provides the first domain-specific comparative evaluation of LLMs in CWD and integrates reproducibility analysis alongside accuracy and reliability assessment. While these tools may support patient education, treatment-related responses require caution, and LLMs should be used as adjuncts rather than substitutes for clinical counseling.

Ähnliche Arbeiten