OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.03.2026, 05:12

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Accuracy and Reliability of AI Conversational Agents in Patient Education on Cardiovascular Imaging: An Observational Comparative Study of ChatGPT o1, ChatGPT 4o, and Deepseek (Preprint)

2026·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large language models (LLMs) are increasingly used to support digital health communication, yet their reliability in patient-facing cardiovascular imaging education remains uncertain. Cardiovascular imaging involves complex terminology and procedural details that many patients struggle to understand, creating a need for accurate, clear, and reassuring explanations. While prior evaluations of conversational AI have focused primarily on diagnostic reasoning or clinician-oriented tasks, few studies have systematically compared contemporary LLMs in their ability to communicate effectively with patients. </sec> <sec> <title>OBJECTIVE</title> To compare the accuracy, clarity, completeness, and patient-centered communication quality of responses generated by three state-of-the-art conversational agents- (i) DeepSeek, (ii) ChatGPT o1, and (iii) ChatGPT 4o when addressing real-world patient questions about cardiovascular imaging. </sec> <sec> <title>METHODS</title> A prospective methodological evaluation was conducted using 84 unique patient-centered questions curated from authoritative cardiovascular information sources and online patient forums. Each question was independently submitted to DeepSeek, ChatGPT o1, and ChatGPT 4o in isolated sessions to avoid contextual contamination. Two cardiovascular radiologists scored each response across four domains: Accuracy, Clarity/Appropriateness, Completeness, and User Engagement/Reassurance, using a standardized 3-point rubric (total score range 4–12). Discrepancies were resolved through predefined adjudication procedures. Descriptive statistics were computed, and group differences were analyzed using one-way ANOVA or Kruskal–Wallis testing as appropriate. Categorical distributions were compared using Pearson’s chi-square test. Statistical significance was defined as α = 0.05 with star-notation thresholds (*p &lt; 0.05, **p &lt; 0.01, ***p &lt; 0.001). </sec> <sec> <title>RESULTS</title> Across the 84 patient questions, all three models produced largely accurate and complete responses. Mean Accuracy scores were similarly high for DeepSeek (2.82/3), ChatGPT o1 (2.85/3), and ChatGPT 4o (2.82/3). Clarity scores were also comparable for DeepSeek (2.90/3), o1 (2.82/3), and 4o (2.77/3). Completeness showed the same pattern, with scores of 2.79/3, 2.83/3, and 2.83/3, respectively. The only meaningful difference appeared in User Engagement and Reassurance. DeepSeek averaged 2.96/3 and ChatGPT o1 2.99/3, whereas ChatGPT 4o scored markedly lower at 2.54/3. Categorical analysis showed that “good” engagement ratings were assigned to 81/84 DeepSeek responses (96.4%), 83/84 ChatGPT o1 responses (98.8%), but only 45/84 ChatGPT 4o responses (53.6%) with p &lt; 0.001. Total composite scores reflected this pattern: DeepSeek averaged 11.48/12, ChatGPT o1 11.49/12, and ChatGPT 4o 10.96/12. No significant differences were observed for Accuracy (p = 0.325), Clarity (p = 0.119), or Completeness (p = 0.653), and no unsafe statements were identified in any model. Here we define unsafe statements as content that could plausibly lead to patient harm through misinformation, inappropriate reassurance, or deviation from standard cardiovascular imaging practices. </sec> <sec> <title>CONCLUSIONS</title> DeepSeek and ChatGPT o1 consistently delivered accurate, clear, and patient-centered explanations of cardiovascular imaging questions, whereas ChatGPT 4o, despite comparable technical accuracy, provided less engaging and reassuring communication. These findings suggest that affective qualities rather than factual correctness represent the main differentiator among current LLMs in patient-education tasks. As conversational agents become integrated into cardiovascular imaging workflows, attention to communication tone, emotional support, and health-literacy alignment will be essential to ensure safe and effective patient use. </sec>

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationExplainable Artificial Intelligence (XAI)Machine Learning in Healthcare
Volltext beim Verlag öffnen