Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

How well do large language models interpret ECGs? A comparative benchmark of chatGPT - 4o, claude, and gemini using standardised teaching cases

2026·0 Zitationen·European Heart Journal - Digital HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background Large language models (LLMs) such as ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google DeepMind) are increasingly explored for clinical decision support. However, their ability to accurately interpret electrocardiograms (ECGs)—a critical diagnostic tool in cardiology—remains underexamined. This study compared the diagnostic performance of three leading LLMs in interpreting core ECG parameters using validated teaching cases. Aim To benchmark and compare the diagnostic accuracy, consistency, and parameter-specific performance of ChatGPT, Claude, and Gemini across a standardised set of ECG interpretation prompts. Methods Seventy ECG images were sampled from six volumes of "Podrid's Real World ECGs", encompassing rhythm, rate, axis, intervals, and morphological abnormalities. Each ECG case was paired with a standardised prompt and submitted to each LLM in a new chat instance. LLM responses were scored against the corresponding textbook diagnosis using a standard 7-point rubric (1–7), with two independent reviewers blinded to model identity. Discrepancies were resolved by consensus. Repeated-measures ANOVA tested differences in overall performance, per textbook, and across ECG features. Graphical comparisons were generated in Excel. Results Overall, there were no significant performance differences across models: Gemini (6.09 ± 0.33), ChatGPT (6.04 ± 0.33), and Claude (5.41 ± 0.39); p = 0.08. Subgroup analysis showed ChatGPT significantly outperformed Gemini (MD: 2.4; p=0.0355) and Claude (MD: 4.2; p=0.0012) in Book 6. Performance varied significantly by ECG feature (F(6,1242)=58.03, p&lt;0.001), with no interaction between model and feature (F(12,1242)=0.52, p=0.639), indicating shared weaknesses in features such as axis determination and bundle branch block interpretation. Standard deviations often exceeded 0.5, suggesting response variability. Conclusion Current general-purpose LLMs demonstrate moderate but variable accuracy in ECG interpretation, with no model achieving consistently superior performance. Their shared struggles with specific ECG parameters highlight a need for cardiology-specific model fine-tuning. Until such refinement occurs, LLMs should be used cautiously for ECG interpretation in educational or clinical settings.

Autoren

Institutionen

Themen

ECG Monitoring and AnalysisCardiac electrophysiology and arrhythmiasArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

How well do large language models interpret ECGs? A comparative benchmark of chatGPT - 4o, claude, and gemini using standardised teaching cases

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen