Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of AI-Generated Personalized Patient Education Handouts for Stress Urinary Incontinence: Readability, Quality, and Actionability
0
Zitationen
3
Autoren
2026
Jahr
Abstract
INTRODUCTION: Stress urinary incontinence (SUI) is a prevalent condition that impacts quality of life and treatment decisions. Patient education is central for informed treatment choices, yet most written educational resources use language above recommended literacy levels, limiting accessibility. Large language models (LLMs) offer a novel opportunity to generate tailored, accessible education. We evaluated the clarity, quality, and clinical utility of artificial intelligence (AI)-generated patient education materials for SUI. OBJECTIVE: To evaluate whether LLMs can generate patient education handouts tailored to individual medical histories that are more understandable and accessible than standard patient education materials. METHODS: Five standardized SUI patient profiles were entered into five LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 2.5 Flash, Grok 3, DeepSeek-V3). Each generated handouts without instruction regarding literacy level (baseline) and with instruction to write at a sixth-grade reading level. Handouts were evaluated using PEMAT (understandability, actionability), DISCERN (information quality), automated readability indices (Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG Index, Coleman-Liau Index), and structured expert review (accuracy, safety, appropriateness, actionability, effectiveness). Of note, lower scores indicate better readability in all readability indices except Flesch Reading Ease. International Urogynecological Association (IUGA) handouts served as comparator. Bonferroni-adjusted post hoc tests determined statistical significance on IBM SPSS Statistics software version 31. RESULTS: IUGA handouts had the highest PEMAT understandability (85.0%). Each baseline LLM scored lower than IUGA (ChatGPT 69.23%, DeepSeek 72.31%, Grok 70.77%, all p<0.001), but sixth-grade prompts improved scores. None of the sixth-grade models were statistically different than IUGA. Actionability scored ∼60% across all groups, including IUGA. DISCERN scores were highest for IUGA (3.69) and significantly lower for all LLMs (2.84–3.24, all p<0.001). Readability improved significantly with sixth-grade prompts across multiple indices when compared to IUGA (all p<0.001). Experts rated IUGA highest (24.3/25), with LLMs scoring 19.6–21.2; no significant difference was found between IUGA and Claude-6, DeepSeek-6, or Grok-6. Experts rated all baseline level LLMs, ChatGPT-6 and Gemini-6 significantly worse than IUGA (p<0.05). See Table 1. CONCLUSIONS: Baseline LLMs underperformed IUGA in understandability and information quality, but sixth-grade prompts eliminated the understandability gap and improved readability across multiple indices. Despite comparable readability and safety, AI-generated outputs remained lower in information quality, underscoring the need for human oversight to ensure clinical accuracy and completeness.Table 1
Ähnliche Arbeiten
The standardisation of terminology of lower urinary tract function: Report from the standardisation sub‐committee of the International Continence Society
2002 · 8.233 Zit.
The Female Sexual Function Index (FSFI): A Multidimensional Self-Report Instrument for the Assessment of Female Sexual Function
2000 · 6.409 Zit.
The standardization of terminology of female pelvic organ prolapse and pelvic floor dysfunction
1996 · 4.236 Zit.
The American Urological Association Symptom Index for Benign Prostatic Hyperplasia
1992 · 3.409 Zit.
Epidemiology of surgically managed pelvic organ prolapse and urinary incontinence
1997 · 3.337 Zit.