OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 25.04.2026, 21:45

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Expert Evaluation and Consensus on Large Language Model Summaries of Clinical Letters: Results from the FRAIT Project (Preprint)

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large language models (LLMs) are increasingly used to summarize clinical documents. However, existing automated metrics often inadequately capture clinical relevance and safety. In the initial phase of the “Framework &amp; Implementation of AI Tools” (FRAIT), an expert-driven, co-created evaluation methodology was established to assess LLM-generated discharge-letter summaries. Both prompt content and intuitive assessment were integral to this collaborative process. </sec> <sec> <title>OBJECTIVE</title> To quantify expert agreement and interrater reliability on LLM-generated summaries of clinical discharge letters, identify frequent and clinically relevant errors, and evaluate practical implications for integrating LLMs into documentation workflows. </sec> <sec> <title>METHODS</title> Thirty expert-curated synthetic discharge letters were summarized. Thirty-one clinicians from diverse Belgian (Flemish) care settings - including one university hospital, two private hospitals, and two general practice circles – evaluated LLM-generated summaries. The evaluation framework consisted of 61 binary layout items assessing the presence of requested sections according to specific layout; 33 Content items rated as correct and complete versus incorrect (with subcategories for missing information, irrelevant details and incorrect or hallucinated content); and overall quality on a 4-point Likert scale(very bad to very good), and an open question for comment. Statistical analyses included descriptive statistics, mixed effects ordinal regression on the global score, consensus (agreement per question/letter) percentages, interrater reliability (Cohen’s kappa, Intraclass Correlation Coefficient (ICC), Fleiss’ Kappa), and thematic synthesis of comments. </sec> <sec> <title>RESULTS</title> Layout adherence was high (88%), especially in the Conclusion section. Content accuracy was overall moderate (78%), with best performance observed in the Medical History section and the lowest in the Medication section, which also showed the highest rate of hallucinations and the weakest inter-rater consensus. Across all sections, missing information was the most common error. Nearly 70% of global ratings were “good” or “very good”. Higher content accuracy predicted better global scores (β = 0.079, p &lt; 0.001), while layout and participant specialty were not relevant for global scoring. Consensus was high for layout questions (median 96.8%) and somewhat lower for content (median 83.9%), with the lowest agreement in the Medication section. Interrater agreement was moderate (median Cohen’s κ = 0.36, range 0.07–0.56), but overall reliability was high (ICC = 0.945), indicating strong consistency at the global level despite inter-rater variability. </sec> <sec> <title>CONCLUSIONS</title> The FRAIT framework offers a robust approach for evaluating LLM-generated discharge summaries, balancing usability and clinical relevance. Semantic integrity, especially regarding medication details, was identified as a key vulnerability. Perceived overall quality was driven by content accuracy. High ICC values for global score, with lower item-level agreement lead towards the need for clearer, context specific prompts and standardized evaluation criteria to reduce interrater variability. Human oversight and targeted automated checks (for omissions and hallucinations) remain essential for safe integration of LLMs into clinical workflows. </sec>

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling
Volltext beim Verlag öffnen