OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.04.2026, 05:29

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

ASSESSING READABILITY OF LLM-GENERATED EDUCATIONAL MATERIALS FOR IBD PATIENTS

2026·0 Zitationen·Inflammatory Bowel Diseases
Volltext beim Verlag öffnen

0

Zitationen

15

Autoren

2026

Jahr

Abstract

Abstract BACKGROUND Large language models (LLMs) are increasingly applied in healthcare and their ability to generate patient-friendly educational materials for complicated conditions, such as inflammatory bowel disease (IBD). It is important to assess whether such LLM-generated materials align with established health literacy standards, as recommended by the National Institutes of Health (NIH), and Agency for Healthcare Research and Quality (AHRQ). This study evaluates the readability of LLM-generated patient materials for IBD, and evaluations whether these emerging technologies can consistently deliver accessible health information. METHODS Five LLMs (ChatGPT 4o, Gemini 2.5, Claude 4.0, Doximity GPT, and OpenEvidence) were asked three prompts: Prompt 1(“What is [condition]?”), Prompt 2 (“I am a patient that was just diagnosed with [condition]. Explain that to me in simple terms”), and Prompt 3 (“Explain [condition] to a patient at a 6th grade reading level or below”). The conditions were IBD, Crohn’s, and Ulcerative Colitis. The same prompt was repeated three times for every condition on each LLM, for a total of 135 outputs. Readability was assessed using the Simple Measure of Gobbledygook (SMOG) index via the Sydney Health Literacy Lab (SHeLL) Health Literacy Editor. The Shapiro-Wilk test was used to assess normality, followed by appropriate statistical analysis to assess for within-model differences across prompts and between-model differences for the same prompt, with post-hoc analysis. A p-value of < 0.05 was considered statistically significant. RESULTS Across all five LLMs, both the within-model and between-model differences were found to be statistically significant for each prompt (p < 0.001) with readability improving from Prompt 1 to 3. Although ChatGPT averaged higher readability than other LLMs in prompts one and two, and Claude in prompt three, none consistently provided materials below a 6th grade reading level. OpenEvidence had the most complex outputs, averaging the highest SMOG scores consistently across prompts. DISCUSSION In their current configurations, the five evaluated LLMs consistently produced outputs that exceeded the readability levels recommended by the NIH and AHRQ for healthcare materials. This limitation persisted even when the models were explicitly instructed to simplify their language or to generate responses from a patient-oriented perspective. These findings underscore the need for further refinement and training of LLMs to ensure the consistent delivery of accessible and patient-appropriate health information for individuals with IBD.

Ähnliche Arbeiten