Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Cross-sectional comparative evaluation of us- and china-developed large language models for bilingual coronary heart disease patient education
0
Zitationen
7
Autoren
2025
Jahr
Abstract
: Patient education for coronary heart disease (CHD) is increasingly facilitated by large language models (LLMs). However, it remains unclear whether the origin of the models (United States vs. China) affects their performance in providing CHD-related patient education. This study aimed to systematically compare six mainstream LLMs when responding to common CHD-related patient questions presented in both English and Chinese. : Between 1 and 15 February 2025, we posed 30 clinician-validated CHD questions—extracted from outpatient records—to six large language models: GPT-4o, OpenAI o1, Gemini 1.5 (United States) and DeepSeek-R1, ERNIE Bot 3.5, Doubao (China). Each prompt was asked in English and Chinese. Each prompt was asked in both English and Chinese. Three blinded cardiologists rated every answer for accuracy, comprehensiveness, understandability, and empathy on a 4-point Likert scale. Ratings were analysed using cumulative-link mixed models (CLMMs) with a logit link function, including fixed effects for Model, Language, and Dimension, as well as their interactions, and random intercepts for Question and Rater. Type III likelihood-ratio χ² tests assessed main effects and interactions, followed by Holm-adjusted pairwise contrasts. Inter-rater agreement was quantified with Fleiss’ κ and. : Three cardiologists independently rated 360 bilingual responses with high inter-rater reliability (Fleiss' κ = 0.821). In cumulative-link mixed models, there were significant main effects of Model and Dimension, as well as a Model × Language interaction and Model × Language × Dimension. OpenAI o1 achieved the highest odds of superior ratings versus GPT-4o (OR 4.45, 95 % CI 3.01–6.57, P < 0.001), followed by DeepSeek-R1 (OR 1.32, 95 % CI 0.97–1.78, P = 0.038). Language-stratified contrasts showed Chinese prompts increased comprehensiveness (OR 1.48, 95 % CI 1.01–2.17, P = 0.045) and empathy (OR 2.14, 95 % CI 1.47–3.11, P < 0.001) but reduced understandability (OR 0.64, 95 % CI 0.42–0.98, P = 0.042). Gemini 1.5 excelled in Chinese (OR 3.55, 95 % CI 2.35–5.38, P < 0.001), whereas DeepSeek-R1 favored English (OR 0.64, 95 % CI 0.41–0.99, P = 0.046) and Doubao favored Chinese (OR 1.64, 95 % CI 1.08–2.49, P = 0.020). : Model performance was strongly modulated by prompt language and by evaluation dimension. Our benchmark offers practical guidance for clinicians, patients, and health-information providers choosing LLMs for bilingual patient education.
Ähnliche Arbeiten
Fundamental Considerations in Language Testing
1991 · 4.401 Zit.
Interpretative Phenomenological Analysis
2020 · 4.054 Zit.
Implicit memory: History and current status.
1987 · 2.905 Zit.
Recognizing: The judgment of previous occurrence.
1980 · 2.676 Zit.
Category Interference in Translation and Picture Naming: Evidence for Asymmetric Connections Between Bilingual Memory Representations
1994 · 2.563 Zit.