OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 25.04.2026, 03:24

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Accuracy Assessment of Large Language Models in Cardiothoracic Anesthesia: A Performance Analysis of Claude and ChatGPT-4 on Subspecialty Board-Style Questions

2025·0 Zitationen·CureusOpen Access
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2025

Jahr

Abstract

Background The integration of artificial intelligence (AI) into healthcare has accelerated rapidly since the public release of ChatGPT (Open AI, San Francisco, California, United States) in 2022. While large language models (LLMs) have demonstrated proficiency in general medical knowledge and licensing examinations, their performance in specialized medical subspecialties remains largely unexplored. Objective The objective of this study was to compare the accuracy of two prominent LLMs, Claude (Anthropic PBC, San Francisco, California, United States) and ChatGPT-4, in answering cardiothoracic anesthesia board-style questions and evaluate their potential for clinical decision support in this subspecialty. Methods We developed a Python-based framework to systematically evaluate LLM performance on 100 custom multiple-choice questions covering cardiothoracic anesthesia topics including arrhythmia management, electrophysiology procedures, pacemaker programming, and perioperative complications. Questions were presented to both Claude and ChatGPT-4 via their respective application programming interface (APIs), with responses compared against expert-validated correct answers. The primary outcome was overall accuracy percentage for each model. Results Claude achieved 32% accuracy (32/100 questions), while ChatGPT-4 achieved 23% accuracy (23/100 questions), representing a 9% point difference (p < 0.05). Both models performed below the threshold typically considered acceptable for clinical decision-making (≥80%). Performance varied across question domains, with both models demonstrating marked difficulty in questions requiring complex electrophysiological reasoning and visual data interpretation (e.g., ECG and imaging-based cases). Conclusions Current LLMs demonstrate limited accuracy in subspecialty-level cardiothoracic anesthesia knowledge, highlighting the need for specialized training datasets and model refinement before clinical implementation. These findings underscore the importance of subspecialty-specific validation before deploying AI tools in specialized medical domains.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationCardiac, Anesthesia and Surgical OutcomesHip and Femur Fractures
Volltext beim Verlag öffnen