Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative evaluation of five AI chatbots in pediatric ophthalmology: A multidomain expert-based appraisal”
0
Zitationen
8
Autoren
2025
Jahr
Abstract
• Evaluated 5 AI chatbots for pediatric eye care using 16 expert-rated queries • ChatGPT led in accuracy and educational value; Perplexity AI in safety • Google Gemini and Grok scored well for clarity; DeepSeek showed variability • Expert inter-rater agreement was low, underscoring subjective interpretation • Study emphasizes need for pediatric-specific, clinically validated chatbots. To evaluate and compare the performance of five AI chatbots—ChatGPT 3.5 (OpenAI), Google Gemini, Grok (xAI), DeepSeek, and Perplexity AI—in delivering accurate, clear, educational, and safe responses to pediatric ophthalmology-related queries. Sixteen standardized caregiver-facing questions were posed to each chatbot in separate fresh sessions. Five pediatric ophthalmologists independently rated the responses across four domains—Accuracy, Clarity, Educational Value, and Safety—using a 5-point Likert scale (1–5). This produced 400 ratings per domain (5 chatbots × 16 questions × 5 raters). Inter-rater reliability was assessed using ICC(2,1), ICC(2,5), quadratic-weighted Fleiss’ κ, Gwet’s AC1, and percent agreement. Between-chatbot comparisons were analyzed with cumulative-link mixed models (CLMMs), reporting odds ratios (OR) with 95% confidence intervals. Post-hoc pairwise contrasts were corrected using Holm adjustment. ChatGPT achieved the highest scores for Accuracy, while Google Gemini and Grok (xAI) showed modest advantages in Clarity and Educational Value. Safety ratings were similar across platforms and clustered at “adequate,” with limited probability of top scores. CLMM analyses confirmed significant between-chatbot differences in Accuracy, Clarity, and Educational Value, but not Safety. Inter-rater reliability was poor-to-fair for single raters [ICC(2,1) = 0.08–0.24], improving to moderate when averaging across all five raters [ICC(2,5) = 0.29–0.61]. Weighted Fleiss’ κ indicated only slight agreement (0.14), but Gwet’s AC1 (0.86) and high percent agreement (94%) suggested stronger underlying consensus. Performance of AI chatbots varied across domains: ChatGPT led in Accuracy, Gemini and Grok in Clarity and Educational Value, while no system excelled in Safety. Low agreement reflects the difficulty of scoring nuanced AI-generated responses rather than a lack of expert consensus. These findings support the potential of AI chatbots as educational adjuncts in pediatric ophthalmology, while underscoring the need for expert oversight, standardized rubrics, and domain-specific fine-tuning to improve reliability and safety
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.
Autoren
Institutionen
- Central Rice Research Institute(IN)
- Sambalpur University(IN)
- Post Graduate Institute of Medical Education and Research(IN)
- Maulana Azad Medical College(IN)
- Dr. Ram Manohar Lohia Institute of Medical Sciences(IN)
- Bai Jerbai Wadia Hospital for Children(IN)
- Apollo Hospitals(IN)
- Apollo Hospitals(IN)
- Apollo Hospitals(IN)