OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 24.03.2026, 04:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparing AI- and Human-Based Assessments of Medical Interview Transcripts Using a Generative AI Simulated Patient System. (Preprint)

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

15

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Generative AI is increasingly used in medical education, including the use of AI-based virtual patients to improve interview skills. However, it remains unclear how much AI-based assessment (ABA) differs from those of Human-based assessment (HBA). </sec> <sec> <title>OBJECTIVE</title> This study aimed to compare the quality of clinical interview assessments generated by an ABA using a virtual patient with those provided by a HBA conducted by clinical instructors. Additionally, it evaluated whether the use of AI could lead to a measurable reduction in evaluation time, and examined the level of agreement across participants with differing levels of clinical experience. </sec> <sec> <title>METHODS</title> A standardized leg-weakness case was implemented in an AI based virtual patient. Seven participants—two medical students, three resident physicians, and two attending physicians—each conducted an interview, and transcripts were scored with the Master Interview Rating Scale (MIRS; 25 items, 0–5 scale; total 0–125). Two evaluation strategies were compared. (1) ChatGPT o1-Pro scored each transcript five times with different random seeds to assess case specificity; total runtime for the five scores was automatically logged. (2) Five blinded clinical instructors , after a preparatory webinar reviewing the rubric and practicing on sample transcripts, each rated every transcript once and recorded clock time per rating. Because the five AI outputs are replicates of the same algorithm, intraclass correlation coefficients (ICC) were used to quantify repeatability rather than inter rater reliability. For human raters, we calculated ICC (2,1). Mean scores from both methods were compared, and agreement was quantified with Pearson’s r, Lin’s concordance correlation coefficient (ρc), Bland–Altman limits of agreement (LoA), internal consistency (Cronbach’s α), and ICC. Time efficiency was expressed as mean minutes per transcript and the relative percentage reduction achieved by AI scoring. </sec> <sec> <title>RESULTS</title> Mean interview scores were similar for ABA and HBA (52.1 ± 6.9 vs 53.7 ± 6.8). Agreement was strong (r = 0.92; ρc = 0.92) with minimal bias (+0.4 points; LoA −4.9 to +5.7). ABA showed higher internal consistency (α = 0.936 vs 0.863) and greater inter rater reliability (ICC = 0.77 vs 0.38). The coefficient of variation for ABA scores was roughly half that of HBA scores (6.6 % vs 13.9 %). In addition, ChatGPT completed each five run analysis in 4.3 ± 1.7 minutes compared with 10.3 ± 3.3 minutes for physicians, representing a 58 % reduction in assessment time. </sec> <sec> <title>CONCLUSIONS</title> ABA scores that closely matched HBA scores while demonstrating superior consistency and reliability. In the setting of virtual clinical interview transcripts, these preliminary findings suggest that ABA shows potential as a valid, rapid, and scalable alternative to HBA. When applied strategically, it could potentially furnish timely formative feedback, quantify efficiency gains, and reduce faculty workload without compromising assessment quality. Further research is needed to determine whether this can be achieved without compromising assessment quality. </sec>

Ähnliche Arbeiten