Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating artificial intelligence chatbot performance on board-level geriatrics questions

2026·0 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Artificial intelligence (AI) language models are increasingly being explored as tools to support medical education and clinical care. Evaluating their performance on valid and reliable assessments such as board certification exams may provide insight into their potential integration into real-world medical settings. This study evaluated the accuracy, consistency, and difficulty assessment of four advanced AI models using board-level geriatrics questions. Four AI models-Grok-3, ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.0 Flash-were tested on 300 text-based multiple-choice questions from the BoardVitals geriatrics certification question bank. The questions were equally divided into easy, medium, and hard categories. Each model was asked to classify the question's difficulty and provide an answer twice. Model responses were evaluated for accuracy, consistency between attempts, quality of explanations, and alignment with the difficulty ratings predefined by BoardVitals. GPT-4o demonstrated the highest overall accuracy (85.3%), followed by Grok-3 (82.0%), Copilot (78.7%), and Gemini (74.0%). All models performed best on easy questions, and showed a decrease in accuracy as the difficulty increased (p < 0.001). GPT-4o exhibited the highest consistency (96.3%), followed by Grok-3 (95.0%), Copilot (90.7%), and Gemini (81.3%). While their overall performance surpassed the average success rates of human users in the database, the agreement between model-assigned and reference difficulty ratings was moderate (mean κ = 0.41). GPT-4o received the highest mean quality score (4.68 ± 0.84), followed by Grok-3 (4.59 ± 0.98), Copilot (4.30 ± 1.07), and Gemini (3.88 ± 1.53). Advanced AI models demonstrate strong performance on geriatrics board-level content, suggesting potential applications as educational support tools. However, performance on multiple-choice examinations does not equate to clinical utility. Significant limitations include struggles with complex scenarios, difficulty in metacognitive assessment of question complexity, and variable explanation quality. These findings emphasize that AI integration into geriatric education and practice requires careful human oversight, explicit acknowledgment of limitations, and continued validation in diverse real-world contexts.

Autoren

Institutionen

Themen

AI in Service InteractionsArtificial Intelligence in Healthcare and EducationDigital Mental Health Interventions

Volltext beim Verlag öffnen

Evaluating artificial intelligence chatbot performance on board-level geriatrics questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen