OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 07:22

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language model performance in clinical cardiology multiple choice questions; has reasoning improved performance?

2026·0 Zitationen·European Heart Journal - Digital HealthOpen Access
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2026

Jahr

Abstract

Abstract Introduction Large language models (LLMs) have garnered significant attention in applications throughout medicine, however, many have struggled with more nuanced clinical challenges. Reasoning models such as GPT-o1 and DeepSeek R1 leverage reinforcement learning and chain-of-thought methodologies which potentially improve performance in complex cognitive tasks. It is challenging to benchmark these models however and whilst the underlying reasoning is often displayed, the reasoning process used to provide an answer is not always clear. Purpose This study aims to compare the ability of Chat GPT-4o, GPT-4.5, GPT-o1, DeepSeek, and DeepSeek R1 to accurately respond to cardiology multiple choice questions (MCQs) from a commonly used UK cardiology textbook. Methods This study is a cross-sectional in-silico benchmarking study. A question corpus of 236 text-only questions and 26 image based questions were employed from a popular UK based, board level cardiology textbook. Video questions were excluded. Image-based queries were excluded for DeepSeek models due to the lack of image interpretation capabilities. Each model was presented with identical zero-shot prompts. Responses were scored against textbook answers. Additionally, two UK consultant cardiologists reviewed questions answered incorrectly by the majority of models to examine performance deficits. To ensure consistency, reviewed questions were deemed to be accepted (in line with accepted guidance and evidence) or not accepted and removed from the evaluation. Revised Bloom’s Taxonomy was applied to classify all questions. Results All models completed 236 text MCQs with answers A-E. An additional 26 image MCQs were tested with GPT models. Due to updated evidence since source publication, 5 text and 1 image question was removed. Text-only accuracy ranged from 77.5% (GPT-4o) to 82.3% (GPT-o1). Image-based MCQ accuracy ranged from 38% (GPT-4.5) to 53% (GPT-4o). When divided by Bloom’s Taxonomy, performance across all models in knowledge recall (90.0%) was significantly higher than in understanding (68.6%), application (76.2%), analysis (78.9%) and evaluation (75.9%). Due to the nature of MCQs none related to creation. Narrative review by two consultant cardiologists noted 1 episode of hallucination, 1 set of unrelated reasoning and answer and poor performance in double negative questions. Conclusions Modest performance gains over GPT-4o, GPT-4.5, and GPT-o1 were observed, though not between DeepSeek and DeepSeek R1. The reasoning models GPT-o1 and DeepSeek R1 did not outperform their counterparts significantly. Existing literature often benchmarks LLM performance against standardised examination question sets, however, this provides a significant challenge when models are iterative and dynamic. Possible solutions to this include the use of very large question corpora, physician graded testing or real-world comparator testing.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen