Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large language model performance in clinical cardiology multiple choice questions; has reasoning improved performance?
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Abstract Introduction Large language models (LLMs) have garnered significant attention in applications throughout medicine, however, many have struggled with more nuanced clinical challenges. Reasoning models such as GPT-o1 and DeepSeek R1 leverage reinforcement learning and chain-of-thought methodologies which potentially improve performance in complex cognitive tasks. It is challenging to benchmark these models however and whilst the underlying reasoning is often displayed, the reasoning process used to provide an answer is not always clear. Purpose This study aims to compare the ability of Chat GPT-4o, GPT-4.5, GPT-o1, DeepSeek, and DeepSeek R1 to accurately respond to cardiology multiple choice questions (MCQs) from a commonly used UK cardiology textbook. Methods This study is a cross-sectional in-silico benchmarking study. A question corpus of 236 text-only questions and 26 image based questions were employed from a popular UK based, board level cardiology textbook. Video questions were excluded. Image-based queries were excluded for DeepSeek models due to the lack of image interpretation capabilities. Each model was presented with identical zero-shot prompts. Responses were scored against textbook answers. Additionally, two UK consultant cardiologists reviewed questions answered incorrectly by the majority of models to examine performance deficits. To ensure consistency, reviewed questions were deemed to be accepted (in line with accepted guidance and evidence) or not accepted and removed from the evaluation. Revised Bloom’s Taxonomy was applied to classify all questions. Results All models completed 236 text MCQs with answers A-E. An additional 26 image MCQs were tested with GPT models. Due to updated evidence since source publication, 5 text and 1 image question was removed. Text-only accuracy ranged from 77.5% (GPT-4o) to 82.3% (GPT-o1). Image-based MCQ accuracy ranged from 38% (GPT-4.5) to 53% (GPT-4o). When divided by Bloom’s Taxonomy, performance across all models in knowledge recall (90.0%) was significantly higher than in understanding (68.6%), application (76.2%), analysis (78.9%) and evaluation (75.9%). Due to the nature of MCQs none related to creation. Narrative review by two consultant cardiologists noted 1 episode of hallucination, 1 set of unrelated reasoning and answer and poor performance in double negative questions. Conclusions Modest performance gains over GPT-4o, GPT-4.5, and GPT-o1 were observed, though not between DeepSeek and DeepSeek R1. The reasoning models GPT-o1 and DeepSeek R1 did not outperform their counterparts significantly. Existing literature often benchmarks LLM performance against standardised examination question sets, however, this provides a significant challenge when models are iterative and dynamic. Possible solutions to this include the use of very large question corpora, physician graded testing or real-world comparator testing.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.