OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 15:37

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The accuracy of AI-assisted chatbots on the annual assessment test for emergency medicine residents

2024·13 Zitationen·Journal of Medicine Surgery and Public HealthOpen Access
Volltext beim Verlag öffnen

13

Zitationen

5

Autoren

2024

Jahr

Abstract

The incorporation of natural language processing (NLP) models into medical education has accelerated with the introduction of ChatGPT. This study aimed to evaluate and compare the performance of ChatGPT-3.5, Bard and the residents in the annual assessment test for emergency medicine (EM) residents. A total of 90 questions covering 18 distinct topics within the field of EM were asked to residents. The same questions were directed to ChatGPT-3.5 and Bard without excluding those with images. The percentage of correct answers were calculated and represented through a histogram showcasing the distribution of test scores across various bins. Questions were categorized as medical knowledge and clinical reasoning to further assess chatbots' performance based on question types. ChatGPT-3.5 demonstrated a 60% accuracy, securing the 10th position, while Bard achieved an accuracy of 55.5%, placing 21st in the rankings among 46 residents. ChatGPT-3.5 performed better in the 16 subtopics. Bard outperformed ChatGPT-3.5 only on cardiovascular and pulmonary emergencies, accounting 24.4% of questions. Analysis of question types revealed ChatGPT-3.5's higher accuracy in medical knowledge (66%) compared to Bard (56%), while Bard performed better in clinical reasoning (55%) than ChatGPT-3.5 (52.5%). Despite lacking access to ECG images, both models answered several questions by providing text-based ECG interpretations. ChatGPT-3.5 and Bard demonstrated impressive performance in the task of medical question answering. On the other hand, issues related to ECG interpretations raises questions about the reliability of these models. Our findings highlight the importance of verifying the outputs generated by these models.

Ähnliche Arbeiten