OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 02.05.2026, 02:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

From algorithms to operating room: can large language models master China’s attending anesthesiology exam? A cross-sectional evaluation

2025·3 Zitationen·International Journal of SurgeryOpen Access
Volltext beim Verlag öffnen

3

Zitationen

7

Autoren

2025

Jahr

Abstract

OBJECTIVE: The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5 and GPT-4) and DeepSeek (DeepSeek-V3 and DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set artificial intelligence (AI) benchmarks in medical assessments and enhance AI-driven medical education. METHODS: This cross-sectional study assessed 4 iterations of 2 major LLMs on the 2025 CAAPE question bank (5647 questions). Testing employed diverse querying strategies and languages, with subgroup analyses by subspecialty, knowledge type, and question format. The focus was on LLM performance in clinical and logical reasoning tasks, measuring accuracy, error types, and response times. RESULTS: DeepSeek-R1 (70.6-73.4%) and GPT-4 (68.6-70.3%) outperformed DeepSeek-V3 (53.1-55.5%) and GPT-3.5 (52.2-55.7%) across all strategies. System role (SR) improved performance, while joint response degraded it. DeepSeek-R1 outperformed GPT-4 in complex subspecialties, reaching peak accuracy (73.4%) under SR combined initial response. Generative Pre-trained Transformers (GPT) models performed better with English than Chinese queries. All models excelled in basic knowledge and Type A1 questions but struggled with clinical scenarios and advanced reasoning. Despite DeepSeek-R1's stronger performance, its response time was longer. Errors were primarily logical and informational (over 70%), with more than half being high-risk clinical errors. CONCLUSION: LLMs show promise in complex clinical reasoning but risk critical errors in high-risk settings. While useful for education and decision support, their error potential must be carefully assessed in high-stakes environments.

Ähnliche Arbeiten