OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 03.05.2026, 22:18

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears

2025·3 Zitationen·HealthcareOpen Access
Volltext beim Verlag öffnen

3

Zitationen

5

Autoren

2025

Jahr

Abstract

BACKGROUND/OBJECTIVES: Rotator cuff (RC) tears are a leading cause of shoulder pain and disability. Artificial intelligence (AI)-based chatbots are increasingly applied in healthcare for diagnostic support and patient education, but the reliability, quality, and readability of their outputs remain uncertain. International guidelines (AMA, NIH, European health communication frameworks) recommend that patient materials be written at a 6th-8th grade reading level, yet most online and AI-generated content exceeds this threshold. METHODS: We compared responses from three AI chatbots-ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google), and DeepSeek-V3 (Deepseek AI)-to 20 frequently asked patient questions about RC tears. Four orthopedic surgeons independently rated reliability and usefulness (7-point Likert) and overall quality (5-point Global Quality Scale). Readability was assessed using six validated indices. Statistical analysis included Kruskal-Wallis and ANOVA with Bonferroni correction; inter-rater agreement was measured using intraclass correlation coefficients (ICCs). RESULTS: Inter-rater reliability was good to excellent (ICC 0.726-0.900). Gemini 1.5 Flash achieved the highest reliability and quality, ChatGPT-4o performed comparably but slightly lower in diagnostic content, and DeepSeek-V3 consistently scored lowest in reliability and quality but produced the most readable text (FKGL ≈ 6.5, within the 6th-8th grade target). None of the models reached a Flesch Reading Ease (FRE) score above 60, indicating that even the most readable outputs remained more complex than plain-language standards. CONCLUSIONS: Gemini 1.5 Flash and ChatGPT-4o generated more accurate and higher-quality responses, whereas DeepSeek-V3 provided more accessible content. No single model fully balanced accuracy and readability. CLINICAL IMPLICATIONS: Hybrid use of AI platforms-leveraging high-accuracy models alongside more readable outputs, with clinician oversight-may optimize patient education by ensuring both accuracy and accessibility. Future work should assess real-world comprehension and address the legal, ethical, and generalizability challenges of AI-driven patient education.

Ähnliche Arbeiten