OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 03:23

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study

2025·1 Zitationen·DiagnosticsOpen Access
Volltext beim Verlag öffnen

1

Zitationen

6

Autoren

2025

Jahr

Abstract

<b>Background/Objectives:</b> This study examines the effectiveness of train-time computation, test-time computation, and their combination on the performance of large language modeling applied to an electronic medical record quality management system. It identifies the most effective combination of models to enhance clinical documentation performance and efficiency. <b>Methods:</b> A total of 597 clinical medical records were selected from the MEDEC-MS dataset, 10 of which were used for prompt engineering to guide model training. Eight large language models were employed for training, focusing on train-time computation and test-time computation. Model performance on specific error types was assessed using precision, recall, F1 score, and error correction accuracy. The dataset was divided into training and testing sets in a 7:3 ratio. The assembly model was created using binary logistic regression for assembly analysis of the top-performing models. Its performance was evaluated using area under the curve values and model weights. <b>Results:</b> GPT-4 and Deepseek R1 demonstrated higher overall accuracy in detecting errors. Models that focus on train-time computation exhibited shorter reasoning times and stricter error detection, while models emphasizing test-time computation achieved higher error correction accuracy. The GPT-4 model was particularly effective in addressing issues related to causal organisms, management, and pharmacotherapy, whereas models focusing on test-time computation performed better in tasks involving diagnosis and treatment. The assembly model, focusing on both train-time computation and test-time computation, outperformed any single large language model (Assembly model accuracy: 0.690 vs. GPT-4 accuracy: 0.477). <b>Conclusions:</b> Models focusing on train-time computation demonstrated greater efficiency in processing speed, while models focusing on test-time computation showed higher accuracy and interpretability in identifying and detecting quality issues in electronic medical records. Assembling the train-time and test-time computation strategies may strike a balance between high accuracy and model efficiency, thereby enhancing the development of electronic medical records and improving medical care.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

AI in cancer detectionArtificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical Imaging
Volltext beim Verlag öffnen