OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 10.05.2026, 03:46

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of AI Chatbot Responses to a Standardized Patient Query on Myelin Oligodendrocyte Glycoprotein Antibody–Associated Disease: Cross-Sectional Content Analysis

2026·0 Zitationen·JMIR Medical InformaticsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

Background: Large language model-based chatbots are increasingly used by the public to access medical information. Although these tools can improve access and convenience, their quality, clarity, and transparency remain uncertain for rare and diagnostically complex neurological conditions, such as myelin oligodendrocyte glycoprotein antibody-associated disease (MOGAD). Objective: This study aimed to evaluate the scientific quality, understandability, citation transparency, and readability of responses generated by widely used artificial intelligence chatbot platforms to a standardized, patient-centered query on MOGAD. Methods: We conducted a cross-sectional content analysis using the query, "What is MOGAD, and how is MOGAD treated?" Ten widely accessible chatbot platforms were queried once on the same day in new sessions. Responses were anonymized and independently evaluated by 7 blinded neurologists using DISCERN (treatment-related scientific quality), Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), and the Web Resource Rating (WRR; citation transparency). Readability was assessed using the Flesch-Kincaid Grade Level (FKGL) and Coleman-Liau Index, and word count was recorded. Platforms were compared by functional orientation and the access model. Mann-Whitney U and Kruskal-Wallis tests with Dunn post hoc tests were used. Interrater reliability was assessed using intraclass correlation coefficients. Results: Significant differences were observed across platforms for DISCERN, PEMAT-P, and WRR scores (all P<.001). Search-focused platforms achieved higher understandability than conversation-focused platforms (median PEMAT-P 52.6, IQR 47.4-54 vs 46.7, IQR 42-47.3; P=.04), whereas conversation-focused platforms had higher WRR scores (median 26.8, IQR 19.6-26.8 vs 19.6, IQR 19.6-25.9; P=.001). DISCERN scores did not differ significantly by functional orientation (P=.11). Paid-access platforms outperformed free-access platforms in DISCERN (median 42, IQR 36-45 vs 33, IQR 23.8-41.3; P<.001), PEMAT-P (median 52.6, IQR 46-54 vs 46, IQR 26.3-47.4; P=.002), and WRR (median 26.8, IQR 23.2-26.8 vs 10.7, IQR 3.57-19.6; P<.001). However, no statistically significant differences were observed between paid and free platforms in response length (median word count 336, IQR 271-369 vs 206, IQR 116-294; P=.11) or readability metrics. FKGL scores were comparable between paid and free outputs (median 17.54, IQR 16.6-18.4 vs 17.56, IQR 16.5-17.6; P=.61), and Coleman-Liau Index values similarly showed no significant difference by access model (median 21.30, IQR 20.6-22.3 vs 21.71, IQR 20.9-22.1; P=.91). Readability remained limited: all outputs exceeded recommended public health readability thresholds (FKGL≥8). High interrater agreement was observed (intraclass correlation coefficient=0.902 for DISCERN, 0.887 for WRR, and 0.838 for PEMAT-P). Conclusions: Artificial intelligence chatbot responses to a patient-centered MOGAD query varied substantially in scientific quality, understandability, transparency, and readability. Search-focused systems were more understandable, whereas conversation-focused systems showed greater citation transparency. Paid-access platforms achieved higher quality and transparency scores, without differences in readability or response length. All outputs exceeded recommended public health readability thresholds. These findings highlight the need for context-sensitive evaluation of chatbot outputs in rare and clinically complex conditions such as MOGAD.

Ähnliche Arbeiten