OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 26.03.2026, 06:15

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

2026·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

Abstract Background Delivering timely, high-quality feedback on resident scholarly projects is labour-intensive, especially in large programmes. We developed an AI-assisted evaluation system, powered by the open-weight LLaMA-3.1 large-language model (LLM), to generate formative feedback on Family Medicine residents’ scholarly projects and compared its performance with expert human evaluators. Methods We evaluated whether the AI-generated feedback achieves comparable quality to expert feedback. The tool ingests heterogeneous resident submissions (PDFs, scans, photographs) via OCR and produces section-by-section feedback aligned with programme rubrics. In a three-phase study we evaluated 240 feedback reports (Short, Question and Timeline, Final; n = 80 each). Within each phase, 40 reports were AI-generated and 40 produced by research experts across four project types: Quality Improvement, Survey-Based, Research, and Literature Review. Blinded raters used a 25-item survey across five constructs: understanding & reasoning, trust & confidence, quality of information, expression style & persona, safety & harm. Results Survey reliability was high across phases (α = .71–.98). Human feedback generally out-scored AI. In short reports, humans led on quality (Mean ± SD; 4.14 ± 0.57 vs 3.09 ± 1.05) and trust (3.96 ± 0.71 vs 2.78 ± 1.15). In final reports, differences become small for quality (4.09 ± 0.65 vs 3.49 ± 0.68) and persona (4.16 ± 0.40 vs 3.91 ± 0.50), while AI was preferred for safety (4.50 ± 0.60 vs 4.36 ± 0.56). Performance varied by project type: in survey-based final reports the AI led on quality (4.28 ± 0.50 vs 3.98 ± 0.44) and safety (4.58 ± 0.40 vs 4.24 ± 0.67), whereas in quality-improvement short reports humans were markedly superior in reasoning (4.27 ± 0.68 vs 2.33 ± 1.00). Conclusions An open-weight LLM with curated prompts can generate rubric-aligned feedback at scale that approaches the quality of expert human feedback. While expert feedback remained superior overall, AI surpassed humans in selected contexts and safety assessments. Performance of the tool will increase over time as newer and more capable open-weight models are released. Our code and systems prompts are open source.

Ähnliche Arbeiten