OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.04.2026, 07:09

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

S2958 Benchmarking the Readability of Large Language Model-Generated Patient Materials in Gastroenterology and Hepatology

2025·0 Zitationen·The American Journal of Gastroenterology
Volltext beim Verlag öffnen

0

Zitationen

17

Autoren

2025

Jahr

Abstract

Introduction: The accessibility to large language model (LLM) generated materials has increased dramatically over the past 2 years. Patients and practitioners alike can utilize it to create educational materials in clinic and on the wards. Given the nation’s health literacy gap, it is the recommendation of the Agency for Healthcare Research and Quality that all healthcare materials be written at a sixth grade reading level or below. However, there is little evidence if LLM-generated materials are meeting this benchmark, especially for materials targeting patients with digestive or liver conditions. This project analyzes the readability of LLM-generated materials in the field of gastroenterology and hepatology across multiple LLMs. Methods: ChatGPT 4o, Gemini 2.5 Pro, Claude Sonnet 4, OpenEvidence®, and Doximity GPT were all asked 3 prompts: Prompt 1 (“What is [condition]?”), Prompt 2 (“I am a patient that was just diagnosed with [condition]. Explain that to me in simple terms.”), and Prompt 3 (“Explain [condition] to a patient at a 6th grade reading level or below.”). Each LLM was asked the same prompt 3 times for 12 common gastrointestinal and liver conditions, yielding a total of 540 outputs. The “readability” of the outputs was assessed with the Sydney Health Literacy Lab Health Literacy Editor tool, which provides the validated SMOG index to assess the readability of healthcare materials. Analysis involved normality testing, followed by significance testing for within-model comparisons across prompts, and between-model comparisons for each prompt, with P < 0.05 being deemed statistically significant. Results: Every LLM showed statistically significant differences between-models (P < 0.001), with statistically significant levels of improvement in readability from Prompt 1 to Prompt 3 within-models (P < 0.001). However, no LLM consistently produced material at a 6th grade reading level for any prompt variation. Claude had the lowest average SMOG index in Prompt 3 (7.5 ± 0.8), followed by ChatGPT (7.8 ± 1.0), and Gemini (8.2 ± 0.7). Conclusion: This project is one of the first to analyze the readability of LLM-generated materials in the field of gastroenterology and hepatology. At this time, none of the leading LLMs consistently met the goal of providing materials at a 6th grade level or below but utilizing more specific prompts may aid in reaching this goal. At this time, we recommend cautious but deliberate use of LLMs in the field of gastroenterology and hepatology.

Ähnliche Arbeiten