Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators (Preprint)

2026·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Medical interview training is a cornerstone of clinical education but faces resource limitations in both implementation and evaluation. While Generative Artificial Intelligence (GAI) offers a potential solution for assessment, it remains unclear whether reasoning models improve evaluation validity, particularly within the linguistic context of the Japanese language. </sec> <sec> <title>OBJECTIVE</title> To evaluate the validity of state-of-the-art GAI models in Japanese medical interview training, we assessed scoring patterns and agreement with human clinical educators. </sec> <sec> <title>METHODS</title> This preliminary comparative study was conducted at a medical university in Japan using text data derived from medical interview training, including both chatbot-based and traditional styles. Postgraduate year 1 and 2 residents were involved. Two blinded human clinical educators independently evaluated the transcripts, reaching a consensus score through discussion. The consensus score was the reference standard. Two GAI models, GPT-5.2 Thinking and Gemini 3.0 Pro, independently evaluated the same transcripts. All evaluations used a standardized 6-domain Objective Structured Clinical Examination rubric (patient care, history taking, physical examination, accuracy and organization of clinical information, clinical reasoning, and management) scored on a 1–6 Likert scale, where 1 is inferior and 6 is excellent. We compared mean evaluation scores using the Wilcoxon signed-rank test and assessed inter-rater reliability using Intraclass Correlation Coefficients (ICCs) between the GAI models and the clinical educators. </sec> <sec> <title>RESULTS</title> Clinical educators and both GAI models rated the entire dataset of 40 transcripts by 20 included residents. Clinical educators assigned the highest overall mean scores (5.18, 95% CI 5.06-5.30). Compared to clinical educators, both GAI models demonstrated significant score deflation: GPT-5.2 Thinking assigned the lowest overall score (3.68, 95% CI 3.62-3.72; P<.001), followed by Gemini 3.0 Pro (4.09, 95% CI 3.97-4.21; P<.001). This discrepancy was most pronounced in the management domain, where GPT-5.2 Thinking assigned 2.93 (95% CI 2.79-3.06) compared to the clinical educators' 5.20 (95% CI 4.91-5.49). Agreement between the GAI models and human raters was poor across all domains, with overall ICCs of 0.04 (95% CI 0.00-0.09) for GPT-5.2 Thinking and 0.22 (95% CI 0.10-0.35) for Gemini 3.0 Pro. </sec> <sec> <title>CONCLUSIONS</title> Unlike previous iterations of GAI, which tended to overestimate student performance, GPT-5.2 Thinking and Gemini 3.0 Pro graded stricter than human experts. Due to significant score discrepancies and poor inter-rater agreement, these models currently lack the validity to serve as standalone summative evaluators for Japanese Objective Structured Clinical Examinations, although their rigorous detection of deficiencies may offer value for formative feedback. </sec> <sec> <title>CLINICALTRIAL</title> Trial Registration: UMIN-CTR UMIN000053747; https://center6.umin.ac.jp/cgi-open-bin/ctr_e/ctr_view.cgi?recptno=R000061336. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsSimulation-Based Education in Healthcare

Volltext beim Verlag öffnen

Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen