Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
How well do large language models interpret ECGs? A comparative benchmark of chatGPT - 4o, claude, and gemini using standardised teaching cases
0
Zitationen
6
Autoren
2026
Jahr
Abstract
Abstract Background Large language models (LLMs) such as ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google DeepMind) are increasingly explored for clinical decision support. However, their ability to accurately interpret electrocardiograms (ECGs)—a critical diagnostic tool in cardiology—remains underexamined. This study compared the diagnostic performance of three leading LLMs in interpreting core ECG parameters using validated teaching cases. Aim To benchmark and compare the diagnostic accuracy, consistency, and parameter-specific performance of ChatGPT, Claude, and Gemini across a standardised set of ECG interpretation prompts. Methods Seventy ECG images were sampled from six volumes of "Podrid's Real World ECGs", encompassing rhythm, rate, axis, intervals, and morphological abnormalities. Each ECG case was paired with a standardised prompt and submitted to each LLM in a new chat instance. LLM responses were scored against the corresponding textbook diagnosis using a standard 7-point rubric (1–7), with two independent reviewers blinded to model identity. Discrepancies were resolved by consensus. Repeated-measures ANOVA tested differences in overall performance, per textbook, and across ECG features. Graphical comparisons were generated in Excel. Results Overall, there were no significant performance differences across models: Gemini (6.09 ± 0.33), ChatGPT (6.04 ± 0.33), and Claude (5.41 ± 0.39); p = 0.08. Subgroup analysis showed ChatGPT significantly outperformed Gemini (MD: 2.4; p=0.0355) and Claude (MD: 4.2; p=0.0012) in Book 6. Performance varied significantly by ECG feature (F(6,1242)=58.03, p<0.001), with no interaction between model and feature (F(12,1242)=0.52, p=0.639), indicating shared weaknesses in features such as axis determination and bundle branch block interpretation. Standard deviations often exceeded 0.5, suggesting response variability. Conclusion Current general-purpose LLMs demonstrate moderate but variable accuracy in ECG interpretation, with no model achieving consistently superior performance. Their shared struggles with specific ECG parameters highlight a need for cardiology-specific model fine-tuning. Until such refinement occurs, LLMs should be used cautiously for ECG interpretation in educational or clinical settings.
Ähnliche Arbeiten
A Real-Time QRS Detection Algorithm
1985 · 7.624 Zit.
An Overview of Heart Rate Variability Metrics and Norms
2017 · 6.389 Zit.
Power Spectrum Analysis of Heart Rate Fluctuation: A Quantitative Probe of Beat-to-Beat Cardiovascular Control
1981 · 5.055 Zit.
The impact of the MIT-BIH Arrhythmia Database
2001 · 4.502 Zit.
Decreased heart rate variability and its association with increased mortality after acute myocardial infarction
1987 · 3.989 Zit.