Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Cross-sectional comparative evaluation of us- and china-developed large language models for bilingual coronary heart disease patient education

2025·0 Zitationen·Intelligent MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

: Patient education for coronary heart disease (CHD) is increasingly facilitated by large language models (LLMs). However, it remains unclear whether the origin of the models (United States vs. China) affects their performance in providing CHD-related patient education. This study aimed to systematically compare six mainstream LLMs when responding to common CHD-related patient questions presented in both English and Chinese. : Between 1 and 15 February 2025, we posed 30 clinician-validated CHD questions—extracted from outpatient records—to six large language models: GPT-4o, OpenAI o1, Gemini 1.5 (United States) and DeepSeek-R1, ERNIE Bot 3.5, Doubao (China). Each prompt was asked in English and Chinese. Each prompt was asked in both English and Chinese. Three blinded cardiologists rated every answer for accuracy, comprehensiveness, understandability, and empathy on a 4-point Likert scale. Ratings were analysed using cumulative-link mixed models (CLMMs) with a logit link function, including fixed effects for Model, Language, and Dimension, as well as their interactions, and random intercepts for Question and Rater. Type III likelihood-ratio χ² tests assessed main effects and interactions, followed by Holm-adjusted pairwise contrasts. Inter-rater agreement was quantified with Fleiss’ κ and. : Three cardiologists independently rated 360 bilingual responses with high inter-rater reliability (Fleiss' κ = 0.821). In cumulative-link mixed models, there were significant main effects of Model and Dimension, as well as a Model × Language interaction and Model × Language × Dimension. OpenAI o1 achieved the highest odds of superior ratings versus GPT-4o (OR 4.45, 95 % CI 3.01–6.57, P < 0.001), followed by DeepSeek-R1 (OR 1.32, 95 % CI 0.97–1.78, P = 0.038). Language-stratified contrasts showed Chinese prompts increased comprehensiveness (OR 1.48, 95 % CI 1.01–2.17, P = 0.045) and empathy (OR 2.14, 95 % CI 1.47–3.11, P < 0.001) but reduced understandability (OR 0.64, 95 % CI 0.42–0.98, P = 0.042). Gemini 1.5 excelled in Chinese (OR 3.55, 95 % CI 2.35–5.38, P < 0.001), whereas DeepSeek-R1 favored English (OR 0.64, 95 % CI 0.41–0.99, P = 0.046) and Doubao favored Chinese (OR 1.64, 95 % CI 1.08–2.49, P = 0.020). : Model performance was strongly modulated by prompt language and by evaluation dimension. Our benchmark offers practical guidance for clinicians, patients, and health-information providers choosing LLMs for bilingual patient education.

Autoren

Institutionen

Themen

Interpreting and Communication in HealthcareArtificial Intelligence in Healthcare and EducationHealth Literacy and Information Accessibility

Volltext beim Verlag öffnen

Cross-sectional comparative evaluation of us- and china-developed large language models for bilingual coronary heart disease patient education

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen