Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Artificial intelligence (AI) has shown potential for enhancing medical practice and improving patient outcomes. However, the efficacy and linguistic accessibility of Large Language Models(LLMs) in pediatric asthma management remain underexplored. This study evaluated the performance of four LLMs in generating clinical information within this domains. We administrated 15 guideline-based pediatric asthma inquiries to hatGPT-4o, Claude 3 Opus, Gemini 2.0, and DeepSeek. Anonymized responses were independently evaluated by three board-certified pediatric pulmonologists using DISCERN instrument (score range 16–80). Readability was assessed using six standard indices. Inter-rater reliability was measured with intraclass correlation coefficients (ICC). Statistical analysis included repeated measures and post-hoc comparisons with effect size reporting. No significant difference was found in the overall quality of health information (DISCERN scores) among the four LLMs (F(3,56) = 0.144, p =.933, η² =0.008), with all mean scores clustered within a narrow “fair-to-good” range (50.3–51.9). However, significant differences were observed in readability: ChatGPT-4o generated significantly more comprehensible text than DeepSeek (FRE mean difference = 12.41, p =.005, Cohen’s d = 1.28), while DeepSeek performed significantly worse than all other models (all p <.05). Inter-rater reliability was high (ICC range: 0.849–0.901, all p <.001). Critically, the mean readability level of all outputs (FKGL: 13.2–14.9) far exceeded the recommended reading accessibility level for patient materials. While current LLMs can provide generally accurate information on pediatric asthma, their outputs exhibit significant limitations in readability for patient-facing use. ChatGPT‑4o shows relative advantages in comprehensibility, yet none meet recommended health-literacy standards. These findings underscore that AI should serve as a supplementary decision‑support tool under clinician supervision, not as a substitute for professional medical advice. Future work should prioritize the integration of adaptive text‑simplification features, validate AI‑generated content in real‑world clinical and caregiver settings, and expand evaluations to include emerging models and diverse chronic disease contexts.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.