OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 23.03.2026, 04:53

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

How Well Does ChatGPT-4o Reason? Expert Evaluation of Diagnostic and Therapeutic Performance in Hand Surgery

2025·0 Zitationen·Journal of Clinical MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

<b>Background:</b> The application of large language model (LLM) in surgical decision-making is rapidly expanding, yet its potential in hand and peripheral nerve surgery remains largely unexplored. This study assessed the diagnostic and therapeutic performance of a large language model (ChatGPT-4o) in scenarios characterized by multiple valid management strategies and absent expert consensus. <b>Methods:</b> Three representative cases-thumb carpometacarpal (CMC I) arthritis, scaphoid nonunion, and carpal tunnel syndrome (CTS)-were developed to reflect frequent conditions in hand surgery with competing but accepted treatment options. Each case was submitted to ChatGPT-4o using a standardized prompt. LLM-generated responses were evaluated by 52 participants (34 board-certified hand surgeons and 18 residents) across diagnostic accuracy, clinical relevance, and completeness. Readability indices, including Flesch-Kincaid Grade Level, were analyzed to assess appropriateness for a medical audience. <b>Results:</b> ChatGPT-4o demonstrated coherent but limited diagnostic accuracy (mean 2.9 ± 1.2 SD), moderate clinical relevance (3.5 ± 1.0 SD), and slightly higher completeness (3.4 ± 1.1 SD). Performance was strongest in the standardized scenario (carpal tunnel syndrome, CTS) and weakest in individualized reasoning (CMC I arthritis). No significant differences were observed between experts and residents (<i>p</i> > 0.05). In higher-level reasoning, ChatGPT-4o performed best in CTS and weakest in CMC I arthritis. Readability confirmed professional-level language (mean Flesch-Kincaid Grade Level: 16.4). <b>Conclusions:</b> ChatGPT-4o shows promise as a supportive tool for diagnostic reasoning and surgical education, particularly where standardized frameworks exist. Its limitations in ambiguous scenarios highlight the ongoing need for expert oversight. Future large language model development should emphasize specialty-specific training and context-aware reasoning to enhance their role in surgical decision support.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsRadiomics and Machine Learning in Medical Imaging
Volltext beim Verlag öffnen