Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Multidimensional Evaluation of AI Chatbot Responses to a Standardized Patient Query on MOGAD: A Blinded Expert Analysis (Preprint)
0
Zitationen
9
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large language model-based chatbots are increasingly used by the public to access medical information. While these tools offer considerable potential in terms of accessibility and scalability, their accuracy, transparency, and clarity remain insufficiently evaluated for rare and diagnostically complex conditions such as myelin oligodendrocyte glycoprotein antibody-associated disease (MOGAD). </sec> <sec> <title>OBJECTIVE</title> This study aimed to evaluate the quality, comprehensibility, transparency, and readability of responses generated by widely used AI chatbot platforms in response to a standardized, patient-centered question about MOGAD. </sec> <sec> <title>METHODS</title> We conducted a cross-sectional content analysis using the query: “What is MOGAD, and how is MOGAD treated?” Ten widely used chatbot platforms were selected to reflect diversity in architecture, access model, and functional design. Responses were collected on the same day, anonymized, and independently evaluated by seven blinded neurologists. Validated instruments were used, including DISCERN (treatment quality), PEMAT-P (understandability), Web Resource Rating (WRR; citation transparency), and two readability metrics: Flesch–Kincaid Grade Level (FKGL) and Coleman–Liau Index (CLI). Chatbots were also compared by access type (free vs paid) and functional focus (conversation-based vs search-based). Inter-rater reliability was assessed using intraclass correlation coefficients (ICCs). </sec> <sec> <title>RESULTS</title> Significant differences were observed across platforms in DISCERN, PEMAT-P, and WRR scores (all p < 0.001). Paid chatbots demonstrated higher treatment quality (p = 0.020) and citation transparency (p = 0.001) compared to free versions. Search-based models produced more understandable responses than conversation-based ones (p = 0.035). However, none of the chatbot responses achieved the recommended readability threshold for public-facing health communication (FKGL < 8). Inter-rater agreement was excellent across all expert-rated measures (ICC ≥ 0.838). </sec> <sec> <title>CONCLUSIONS</title> AI chatbot responses to patient queries about MOGAD vary widely in quality, clarity, and transparency. These findings highlight the need for structured benchmarking, transparent evaluation frameworks, and thoughtful oversight in the use of generative AI tools for digital health communication, particularly in the context of rare and clinically complex diseases. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.635 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.543 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.051 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.844 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.