Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects
0
Zitationen
9
Autoren
2026
Jahr
Abstract
Abstract Background Delivering timely, high-quality feedback on resident scholarly projects is labour-intensive, especially in large programmes. We developed an AI-assisted evaluation system, powered by the open-weight LLaMA-3.1 large-language model (LLM), to generate formative feedback on Family Medicine residents’ scholarly projects and compared its performance with expert human evaluators. Methods We evaluated whether the AI-generated feedback achieves comparable quality to expert feedback. The tool ingests heterogeneous resident submissions (PDFs, scans, photographs) via OCR and produces section-by-section feedback aligned with programme rubrics. In a three-phase study we evaluated 240 feedback reports (Short, Question and Timeline, Final; n = 80 each). Within each phase, 40 reports were AI-generated and 40 produced by research experts across four project types: Quality Improvement, Survey-Based, Research, and Literature Review. Blinded raters used a 25-item survey across five constructs: understanding & reasoning, trust & confidence, quality of information, expression style & persona, safety & harm. Results Survey reliability was high across phases (α = .71–.98). Human feedback generally out-scored AI. In short reports, humans led on quality (Mean ± SD; 4.14 ± 0.57 vs 3.09 ± 1.05) and trust (3.96 ± 0.71 vs 2.78 ± 1.15). In final reports, differences become small for quality (4.09 ± 0.65 vs 3.49 ± 0.68) and persona (4.16 ± 0.40 vs 3.91 ± 0.50), while AI was preferred for safety (4.50 ± 0.60 vs 4.36 ± 0.56). Performance varied by project type: in survey-based final reports the AI led on quality (4.28 ± 0.50 vs 3.98 ± 0.44) and safety (4.58 ± 0.40 vs 4.24 ± 0.67), whereas in quality-improvement short reports humans were markedly superior in reasoning (4.27 ± 0.68 vs 2.33 ± 1.00). Conclusions An open-weight LLM with curated prompts can generate rubric-aligned feedback at scale that approaches the quality of expert human feedback. While expert feedback remained superior overall, AI surpassed humans in selected contexts and safety assessments. Performance of the tool will increase over time as newer and more capable open-weight models are released. Our code and systems prompts are open source.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.312 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.169 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.564 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.466 Zit.