Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators (Preprint)
0
Zitationen
8
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Medical interview training is a cornerstone of clinical education but faces resource limitations in both implementation and evaluation. While Generative Artificial Intelligence (GAI) offers a potential solution for assessment, it remains unclear whether reasoning models improve evaluation validity, particularly within the linguistic context of the Japanese language. </sec> <sec> <title>OBJECTIVE</title> To evaluate the validity of state-of-the-art GAI models in Japanese medical interview training, we assessed scoring patterns and agreement with human clinical educators. </sec> <sec> <title>METHODS</title> This preliminary comparative study was conducted at a medical university in Japan using text data derived from medical interview training, including both chatbot-based and traditional styles. Postgraduate year 1 and 2 residents were involved. Two blinded human clinical educators independently evaluated the transcripts, reaching a consensus score through discussion. The consensus score was the reference standard. Two GAI models, GPT-5.2 Thinking and Gemini 3.0 Pro, independently evaluated the same transcripts. All evaluations used a standardized 6-domain Objective Structured Clinical Examination rubric (patient care, history taking, physical examination, accuracy and organization of clinical information, clinical reasoning, and management) scored on a 1–6 Likert scale, where 1 is inferior and 6 is excellent. We compared mean evaluation scores using the Wilcoxon signed-rank test and assessed inter-rater reliability using Intraclass Correlation Coefficients (ICCs) between the GAI models and the clinical educators. </sec> <sec> <title>RESULTS</title> Clinical educators and both GAI models rated the entire dataset of 40 transcripts by 20 included residents. Clinical educators assigned the highest overall mean scores (5.18, 95% CI 5.06-5.30). Compared to clinical educators, both GAI models demonstrated significant score deflation: GPT-5.2 Thinking assigned the lowest overall score (3.68, 95% CI 3.62-3.72; P<.001), followed by Gemini 3.0 Pro (4.09, 95% CI 3.97-4.21; P<.001). This discrepancy was most pronounced in the management domain, where GPT-5.2 Thinking assigned 2.93 (95% CI 2.79-3.06) compared to the clinical educators' 5.20 (95% CI 4.91-5.49). Agreement between the GAI models and human raters was poor across all domains, with overall ICCs of 0.04 (95% CI 0.00-0.09) for GPT-5.2 Thinking and 0.22 (95% CI 0.10-0.35) for Gemini 3.0 Pro. </sec> <sec> <title>CONCLUSIONS</title> Unlike previous iterations of GAI, which tended to overestimate student performance, GPT-5.2 Thinking and Gemini 3.0 Pro graded stricter than human experts. Due to significant score discrepancies and poor inter-rater agreement, these models currently lack the validity to serve as standalone summative evaluators for Japanese Objective Structured Clinical Examinations, although their rigorous detection of deficiencies may offer value for formative feedback. </sec> <sec> <title>CLINICALTRIAL</title> Trial Registration: UMIN-CTR UMIN000053747; https://center6.umin.ac.jp/cgi-open-bin/ctr_e/ctr_view.cgi?recptno=R000061336. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.287 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.140 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.534 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.450 Zit.