Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
S2997 Health Literacy in the Age of Artificial Intelligence: Readability of LLM-Generated Materials for Patients With IBD
0
Zitationen
17
Autoren
2025
Jahr
Abstract
Introduction: Large language models (LLMs) are increasingly applied in healthcare and their ability to generate educational materials for complicated conditions, such as inflammatory bowel disease (IBD). It is important to assess whether such LLM-generated materials align with established health literacy standards, as recommended by the National Institutes of Health (NIH), and Agency for Healthcare Research and Quality. This study evaluates the readability of LLM-generated materials for patients with IBD and evaluates whether these emerging technologies can consistently deliver accessible health information. Methods: Five LLMs (ChatGPT 4o, Gemini 2.5, Claude 4.0, Doximity GPT, and OpenEvidence®) were asked 3 prompts: Prompt 1 (“What is [condition]?”), Prompt 2 (“I am a patient that was just diagnosed with [condition]. Explain that to me in simple terms”), and Prompt 3 (“Explain [condition] to a patient at a 6th grade reading level or below”). The conditions were IBD, Crohn’s, and ulcerative colitis. The same prompt was repeated 3 times for every condition on each LLM, for a total of 135 outputs. Readability was assessed with the Simple Measure of Gobbledygook (SMOG) index via the Sydney Health Literacy Lab Health Literacy Editor. The Shapiro-Wilk test was used to assess normality, followed by appropriate statistical analysis to assess for within-model differences across prompts and between-model differences for the same prompt, with post-hoc analysis and a P-value of < 0.05 considered statistically significant. Results: Across all 5 LLMs, both the within-model and between-model differences were found to be statistically significant for each prompt (P < 0.001) with readability improving from Prompt 1 to 3. Although ChatGPT averaged higher readability than other LLMs in prompts 1 and 2, and Claude in prompt 3, none consistently provided materials below a 6th grade reading level. OpenEvidence® had the most complex outputs, averaging the highest SMOG scores consistently across prompts. Conclusion: This study illustrates that, in their current models, the 5 tested LLMs generate outputs at a higher reading level than recommended for healthcare material for patients with IBD, even when prompted to specifically do so. Further training of LLMs is indicated to ensure they consistently provide accessible health information to patients.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.422 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.300 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.734 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.519 Zit.