Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Accuracy Assessment of Large Language Models in Cardiothoracic Anesthesia: A Performance Analysis of Claude and ChatGPT-4 on Subspecialty Board-Style Questions
0
Zitationen
3
Autoren
2025
Jahr
Abstract
Background The integration of artificial intelligence (AI) into healthcare has accelerated rapidly since the public release of ChatGPT (Open AI, San Francisco, California, United States) in 2022. While large language models (LLMs) have demonstrated proficiency in general medical knowledge and licensing examinations, their performance in specialized medical subspecialties remains largely unexplored. Objective The objective of this study was to compare the accuracy of two prominent LLMs, Claude (Anthropic PBC, San Francisco, California, United States) and ChatGPT-4, in answering cardiothoracic anesthesia board-style questions and evaluate their potential for clinical decision support in this subspecialty. Methods We developed a Python-based framework to systematically evaluate LLM performance on 100 custom multiple-choice questions covering cardiothoracic anesthesia topics including arrhythmia management, electrophysiology procedures, pacemaker programming, and perioperative complications. Questions were presented to both Claude and ChatGPT-4 via their respective application programming interface (APIs), with responses compared against expert-validated correct answers. The primary outcome was overall accuracy percentage for each model. Results Claude achieved 32% accuracy (32/100 questions), while ChatGPT-4 achieved 23% accuracy (23/100 questions), representing a 9% point difference (p < 0.05). Both models performed below the threshold typically considered acceptable for clinical decision-making (≥80%). Performance varied across question domains, with both models demonstrating marked difficulty in questions requiring complex electrophysiological reasoning and visual data interpretation (e.g., ECG and imaging-based cases). Conclusions Current LLMs demonstrate limited accuracy in subspecialty-level cardiothoracic anesthesia knowledge, highlighting the need for specialized training datasets and model refinement before clinical implementation. These findings underscore the importance of subspecialty-specific validation before deploying AI tools in specialized medical domains.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.521 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.412 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.891 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.575 Zit.