Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
AI vs the heart: can chatGPT-4o reliably detect cardiomegaly on chest X-Rays?
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Abstract Background Large language models (LLMs) have shown promise in clinical decision support, yet their image interpretation capabilities remain under-explored. Emerging trends of use in clinical environments without validation is of concern, given the potential to harm patients by providing inaccurate information to overly trusting clinicians [1]. This pilot study assessed ChatGPT-4o’s ability to assess the presence or absence of cardiomegaly in an established chest x-ray (CXR) database to assess its current potential for clinical use. Method ChatGPT-4o was prompted to provide binary, "yes/no" answers for a diagnosis of cardiomegaly from cases drawn from a National Institute of Health’s (NIH) open-access chest x-ray database, with 50 confirmed cases, 25 confirmed normal cases, and 25 cases with diagnoses other than cardiomegaly [2]. Statistical analysis of inter-rater agreement, as well as McNemar’s test of paired proportions, significant difference, and bias assessment (i.e., propensity to over-/under-call a diagnosis) was completed using an institutional access version of StataMP (StataCorp. 2025. Stata Statistical Software: Release 19. College Station, TX: StataCorp LLC). Results Inter-rater agreement (Cohen’s Kappa) for cases with a confirmed diagnosis of cardiomegaly (n = 50) was 0.48. McNemar’s tests of significant differences (Chi-squared) between the two pairs across asymptotic, marginal homogeneity (Stuart-Maxwell, Bickenboller, no diagonals), and linear trend assessment was 26.0 (p = <0.05), 26.0 (p = <0.05), 35.14 (p = <0.05), 26.0 (p = <0.05), and 27.0 (p = <0.05), respectively, demonstrating ChatGPT’s classifications were not symmetric with the reference rating, the presence of significant distribution variation, and significant concern for misclassification bias with diagnosis overcalling. Inter-rater agreement (Cohen’s Kappa) for cases without cardiomegaly (i.e., normal CXR, n = 25) was 0.76. McNemar’s tests as denoted previously were 6.0 (p = 0.014), 6.0 (p = 0.014), 6.82 (p = 0.10), 6.0 (p = 0.014), and 5.0 (p = 0.31), demonstrating more moderate, but still significant disagreement between ChatGPT and the reference with significant overcalling. In cases of diagnosis other than cardiomegaly (n = 25), inter-rater agreement was 0.68. McNemar’s tests as previously denoted were 8.00 (p = <0.01), 8.0 (p = <0.01), 9.52 (p = <0.01), 8.00 (p = <0.05), again demonstrating significant variation and differences in marginal distribution, with overprediction of the presence of cardiomegaly. Conclusion This pilot study highlights significant limitations in the ability of the current ChatGPT-4o model in diagnosing a relatively straightforward pathology, with a significant propensity to over-call a diagnosis of cardiomegaly. These findings raise important concerns regarding how much faith clinicians can place in LLMs, highlighting that diagnoses and decision-making should firmly remain in the clinician’s hands – for now, at least.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.