Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Reliability and Safety of Large Language Model-Generated Transfer Notes: A Retrospective Validation Study

2026·0 Zitationen·Intelligent MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language models (LLMs) show great promise in processing medical texts; The increasing specialization of clinical medicine has led to a greater demand for efficient and accurate referral. This study aims to evaluate the ability of the large language model DeepSeek-R1 to generate transfer notes for gastrointestinal surgery patients. Its performance is compared with that of clinician-provided transfer notes in terms of completeness, accuracy, and clinician preference. A retrospective clinical analysis was conducted on 204 referral patients who underwent gastrointestinal surgery at Qingdao University Affiliated Hospital between January 2022 and June 2025. The LLM was trained using a small-sample study of four cases, and 200 cases were used as the test set. LLM-generated transfer notes were based on a structured template comprising predefined units. A thorough completeness review of the LLM-generated transfer records was conducted by two trained clinicians ( κ = 0.719, p < 0.05). We quantitatively assessed the LLM's extraction performance by calculating recall, precision, and F1 scores within the LLM-generated transfer notes. McNemar's test was used to compare the completeness of LLM-generated and clinician-provided transfer notes. Five clinicians conducted blinded, paired preference evaluations. DeepSeek demonstrated excellent overall performance in information extraction among the 200 transfer notes generated, achieving high precision (99% [95% CI: 98%, 99%]), recall (97% [95% CI: 96%, 98%]), and an F1 score of 0.98 [95% CI: 0.97, 0.98]. LLM-generated transfer notes were comparable in completeness to clinician-provided notes. Within the “Current Diagnosis” unit, LLM-generated notes were significantly more complete than clinician-provided notes (90% vs. 81.5%; 180 vs. 163; P < 0.05). There were no statistically significant differences across the remaining five assessment units (all P > 0.05). In preference evaluation, clinicians were observed to demonstrate a pronounced preference for referral notes generated by LLMs (39% [78/200] vs. 13% [26/200], respectively; 48% [96/200] rated them as equivalent). DeepSeek can generate transfer notes that are accurate and of a quality similar to that of clinician-provided notes. Further evaluation in actual clinical settings is necessary.

Autoren

Institutionen

Themen

Topic ModelingBiomedical Text Mining and OntologiesArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Evaluating the Reliability and Safety of Large Language Model-Generated Transfer Notes: A Retrospective Validation Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen