OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 03:04

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative evaluation of five AI chatbots in pediatric ophthalmology: A multidomain expert-based appraisal”

2025·0 Zitationen·AJO InternationalOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2025

Jahr

Abstract

• Evaluated 5 AI chatbots for pediatric eye care using 16 expert-rated queries • ChatGPT led in accuracy and educational value; Perplexity AI in safety • Google Gemini and Grok scored well for clarity; DeepSeek showed variability • Expert inter-rater agreement was low, underscoring subjective interpretation • Study emphasizes need for pediatric-specific, clinically validated chatbots. To evaluate and compare the performance of five AI chatbots—ChatGPT 3.5 (OpenAI), Google Gemini, Grok (xAI), DeepSeek, and Perplexity AI—in delivering accurate, clear, educational, and safe responses to pediatric ophthalmology-related queries. Sixteen standardized caregiver-facing questions were posed to each chatbot in separate fresh sessions. Five pediatric ophthalmologists independently rated the responses across four domains—Accuracy, Clarity, Educational Value, and Safety—using a 5-point Likert scale (1–5). This produced 400 ratings per domain (5 chatbots × 16 questions × 5 raters). Inter-rater reliability was assessed using ICC(2,1), ICC(2,5), quadratic-weighted Fleiss’ κ, Gwet’s AC1, and percent agreement. Between-chatbot comparisons were analyzed with cumulative-link mixed models (CLMMs), reporting odds ratios (OR) with 95% confidence intervals. Post-hoc pairwise contrasts were corrected using Holm adjustment. ChatGPT achieved the highest scores for Accuracy, while Google Gemini and Grok (xAI) showed modest advantages in Clarity and Educational Value. Safety ratings were similar across platforms and clustered at “adequate,” with limited probability of top scores. CLMM analyses confirmed significant between-chatbot differences in Accuracy, Clarity, and Educational Value, but not Safety. Inter-rater reliability was poor-to-fair for single raters [ICC(2,1) = 0.08–0.24], improving to moderate when averaging across all five raters [ICC(2,5) = 0.29–0.61]. Weighted Fleiss’ κ indicated only slight agreement (0.14), but Gwet’s AC1 (0.86) and high percent agreement (94%) suggested stronger underlying consensus. Performance of AI chatbots varied across domains: ChatGPT led in Accuracy, Gemini and Grok in Clarity and Educational Value, while no system excelled in Safety. Low agreement reflects the difficulty of scoring nuanced AI-generated responses rather than a lack of expert consensus. These findings support the potential of AI chatbots as educational adjuncts in pediatric ophthalmology, while underscoring the need for expert oversight, standardized rubrics, and domain-specific fine-tuning to improve reliability and safety

Ähnliche Arbeiten