Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

AI vs the heart: can chatGPT-4o reliably detect cardiomegaly on chest X-Rays?

2026·0 Zitationen·European Heart Journal - Digital HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background Large language models (LLMs) have shown promise in clinical decision support, yet their image interpretation capabilities remain under-explored. Emerging trends of use in clinical environments without validation is of concern, given the potential to harm patients by providing inaccurate information to overly trusting clinicians [1]. This pilot study assessed ChatGPT-4o’s ability to assess the presence or absence of cardiomegaly in an established chest x-ray (CXR) database to assess its current potential for clinical use. Method ChatGPT-4o was prompted to provide binary, "yes/no" answers for a diagnosis of cardiomegaly from cases drawn from a National Institute of Health’s (NIH) open-access chest x-ray database, with 50 confirmed cases, 25 confirmed normal cases, and 25 cases with diagnoses other than cardiomegaly [2]. Statistical analysis of inter-rater agreement, as well as McNemar’s test of paired proportions, significant difference, and bias assessment (i.e., propensity to over-/under-call a diagnosis) was completed using an institutional access version of StataMP (StataCorp. 2025. Stata Statistical Software: Release 19. College Station, TX: StataCorp LLC). Results Inter-rater agreement (Cohen’s Kappa) for cases with a confirmed diagnosis of cardiomegaly (n = 50) was 0.48. McNemar’s tests of significant differences (Chi-squared) between the two pairs across asymptotic, marginal homogeneity (Stuart-Maxwell, Bickenboller, no diagonals), and linear trend assessment was 26.0 (p = &lt;0.05), 26.0 (p = &lt;0.05), 35.14 (p = &lt;0.05), 26.0 (p = &lt;0.05), and 27.0 (p = &lt;0.05), respectively, demonstrating ChatGPT’s classifications were not symmetric with the reference rating, the presence of significant distribution variation, and significant concern for misclassification bias with diagnosis overcalling. Inter-rater agreement (Cohen’s Kappa) for cases without cardiomegaly (i.e., normal CXR, n = 25) was 0.76. McNemar’s tests as denoted previously were 6.0 (p = 0.014), 6.0 (p = 0.014), 6.82 (p = 0.10), 6.0 (p = 0.014), and 5.0 (p = 0.31), demonstrating more moderate, but still significant disagreement between ChatGPT and the reference with significant overcalling. In cases of diagnosis other than cardiomegaly (n = 25), inter-rater agreement was 0.68. McNemar’s tests as previously denoted were 8.00 (p = &lt;0.01), 8.0 (p = &lt;0.01), 9.52 (p = &lt;0.01), 8.00 (p = &lt;0.05), again demonstrating significant variation and differences in marginal distribution, with overprediction of the presence of cardiomegaly. Conclusion This pilot study highlights significant limitations in the ability of the current ChatGPT-4o model in diagnosing a relatively straightforward pathology, with a significant propensity to over-call a diagnosis of cardiomegaly. These findings raise important concerns regarding how much faith clinicians can place in LLMs, highlighting that diagnoses and decision-making should firmly remain in the clinician’s hands – for now, at least.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationExplainable Artificial Intelligence (XAI)Radiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

AI vs the heart: can chatGPT-4o reliably detect cardiomegaly on chest X-Rays?

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen