OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 10:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large Language Models in Rubric-Based Scoring of Students’ Clinical Summary Statements: Experimental Comparative Study on Test–Retest Reliability and Concordance with Human Experts (Preprint)

2026·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Clinical summary statements convert unstructured patient information into a concise format. They are widely used in virtual patients (VPs) to train medical students in diagnostic reasoning. Most VP systems provide little individualized feedback on the quality of learners’ summaries, and earlier automated approaches have required complex natural language processing, limiting their scalability. Large language models (LLMs) may offer a more flexible, rubric-guided way to score summary statements and support feedback, but their test–retest reliability and concordance with expert ratings in this context remain unclear. </sec> <sec> <title>OBJECTIVE</title> To examine the test–retest reliability of LLM-generated medical summary statement ratings as well as the concordance with human expert assessments under different prompting strategies. </sec> <sec> <title>METHODS</title> We conducted an experimental comparative study using 122 rubric-scored summary statements (English and German) from six internal medicine VP cases previously rated by two expert raters. Statements were evaluated along four ternary components (Semantic Qualifiers, Transformation, Narrowing, Global Rating) and two binary components (Factual Accuracy, Person). Four LLMs (Claude, GPT-4o, Mistral Large, Mistral 7B) were run under four prompting strategies (zero-shot, zero-shot with chain-of-thought [CoT], few-shot, few-shot+CoT), which were repeated five times. The test–retest reliability was assessed with intraclass correlation coefficients (ICCs) and Fleiss’ kappa (κ). The concordance with human ratings was quantified using accuracy (%) and chance-corrected interrater agreement (Gwet’s AC1/AC2). </sec> <sec> <title>RESULTS</title> All the models showed high test–retest reliability (ICC ≥0.89 and Fleiss’ kappa 0.32–1.00). Claude was essentially deterministic (ICC ≥0.99; Fleiss’ kappa ≥0.99), whereas GPT-4o and Mistral Large also showed high test–retest reliability (ICC ≥0.97; Fleiss’ kappa ≥0.80, excluding zero-shot), whereas Mistral 7B had high ICCs (≥0.88) but relatively low Fleiss’ kappa values (0.32–0.53). Few-shot with CoT prompting achieved the best concordance with human ratings for all the models, peaking for Mistral Large (accuracy: mean=77%; Gwet’s AC1/AC2: mean=0.85, CI=0.83–0.87), followed by Claude (accuracy: mean=71%; Gwet’s AC1/AC2: mean=0.77, CI=0.74–0.81), GPT-4o (accuracy: mean=67%; Gwet’s AC1/AC2: mean=0.72, CI: 0.68–0.77), and Mistral 7B (accuracy: mean=60%; Gwet’s AC1/AC2: mean=0.64, CI: 0.56–0.73). The concordance with human ratings was higher for the binary rubric components (Person, Factual Accuracy) than for the ternary components. Extreme deviations and invalid outputs were rare but more frequent in ternary components and the smaller local model. </sec> <sec> <title>CONCLUSIONS</title> Large, general-purpose LLMs can function as reliable, rubric-guided raters of VP-based clinical summary statements, achieving high concordance with expert ratings under clearly defined criteria and when using few-shot with CoT prompting. The performance on more nuanced, ternary reasoning dimensions was more variable. LLMs can support scalable, rubric-based formative feedback on clinical reasoning within VP systems but should complement rather than replace human judgment. </sec>

Ähnliche Arbeiten

Autoren

Themen

Clinical Reasoning and Diagnostic SkillsArtificial Intelligence in Healthcare and EducationTopic Modeling
Volltext beim Verlag öffnen