OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 09:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Performance of AI Large Language Models in Detecting Pediatric Medication Errors Across Languages: A Comparative Study

2025·0 Zitationen·Journal of Clinical MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

<b>Objectives</b>: This study aimed to evaluate the performance of four AI models, (GPT-5, GPT-4, Microsoft Copilot, and Google Gemini), in detecting medication errors through pediatric case scenarios. <b>Methods:</b> A total of 60 pediatric cases were analyzed for the presence of medication errors, of which only half contained errors. The cases covered four therapeutic systems (respiratory, endocrine, neurology, and infectious). The four models were exposed to the cases in both English and Arabic using a unified prompt. The responses for each model were used to calculate various performance metric cover accuracy, sensitivity, specificity and reproducibility. Analysis was carried out using SPSS version 22. <b>Results:</b> Microsoft Copilot demonstrated relatively higher accuracy (86.7% in English, 85.0% in Arabic) compared to other models in this dataset, followed by GPT-5 (81.7% in English, 75.0% in Arabic). GPT-4 and Google Gemini had less accuracy, with Gemini having the lowest accuracy across all languages (76.7% in English, and 73.3% in Arabic). Microsoft Copilot showed comparatively higher sensitivity and specificity, particularly in cases of respiratory and infectious diseases. The accuracy in Arabic was lower compared to that of English for the majority of models. Microsoft Copilot exhibited relatively higher reproducibility and inter-run agreement (Cohen's Kappa = 0.836 English, 0.815 Arabic, <i>p</i> < 0.001 for both), while Gemini showed the lowest reproducibility. For inter-language agreement in general, Copilot showed the highest Cohen's Kappa of 0.701 for English and Arabic (<i>p</i> < 0.001). <b>Conclusions:</b> In our evaluation, Microsoft Copilot demonstrated relatively higher performance in pediatric drug error detection compared to the other AI models. The decreased performance in Arabic points toward the requirement of improved multilingual training for supporting equal AI aid across languages. This study highlights the importance of human oversight and domain-based training for AI tools in pediatric pharmacotherapy.

Ähnliche Arbeiten