Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the Quality of Health Information Generated by Generative AI: The Case of Chronic Disease Management in China (Preprint)
0
Zitationen
9
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Background: In recent years, the rapid development of generative artificial intelligence (AI) in China has led to a growing number of AI platforms being applied in healthcare contexts, particularly in assisting communication with patients managing chronic conditions. Tools such as DeepSeek, Kimi, ChatGPT, and Wenxin Yiyan have demonstrated significant potential in supporting patient education, disease explanation, and decision-making. However, the quality of their generated content remains uneven, and inconsistencies in medical accuracy, clarity, and relevance may pose risks to patient health. A systematic evaluation of these tools is urgently needed to inform safe and effective use. </sec> <sec> <title>OBJECTIVE</title> Objective: This study aims to assess the quality of health information generated by four commonly used generative AI tools in China: DeepSeek, Kimi, ChatGPT, and Wenxin Yiyan, in the context of chronic disease communication. The focus is on evaluating their ability to provide accurate, clear, and empathetic responses across a range of content types, including explanations of medical terminology, disease conditions, etiologies, treatment options, and medical costs. </sec> <sec> <title>METHODS</title> Methods: We conducted a cross-sectional study consisting of two parts. First, we evaluated each AI system’s performance on multiple-choice knowledge questions derived from validated instruments for ten chronic diseases. Each item was manually input into the AI systems, and responses were scored for accuracy. As a benchmark, domain-specific physicians completed the same questionnaires. Second, we conducted semi-structured interviews with 50 patients across the ten disease categories to collect a total of 108 real-world, patient-centered questions. These were submitted to the AI systems, and the resulting responses were independently evaluated by ten physicians based on three criteria: medical accuracy, linguistic clarity, and emotional empathy. Physicians also participated in follow-up interviews to compare overall strengths and limitations across platforms. </sec> <sec> <title>RESULTS</title> Results: ChatGPT achieved the highest accuracy in the objective knowledge assessment, correctly answering 216 of 230 questions. Wenxin Yiyan followed with 203 correct responses, Kimi with 201, and DeepSeek with 196. All AI systems scored below the physician baseline (228/230). In the subjective evaluation, ChatGPT again received the most favorable ratings, particularly for completeness and structure. Kimi and Wenxin Yiyan showed mixed results, often performing well on symptom descriptions but less reliably on treatment plans or rare conditions. DeepSeek produced the most inconsistent and verbose responses. Across categories, common weaknesses included a lack of personalized guidance, limited empathetic tone, and outdated or imprecise cost estimates. </sec> <sec> <title>CONCLUSIONS</title> Conclusions: Generative AI tools show promise in supporting health communication for chronic disease management in China, but their outputs remain inconsistent in accuracy, clarity, and emotional appropriateness. While ChatGPT and Wenxin Yiyan were generally more reliable, no system consistently matched physician-level performance, especially in nuanced or emotionally sensitive domains. This study highlights the importance of continued monitoring, culturally relevant evaluation, and iterative improvement to ensure that generative AI can serve as a trustworthy complement to professional healthcare services. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.418 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.288 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.726 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.516 Zit.