Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study (Preprint)

2024·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

<sec> <title>BACKGROUND</title> The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application. </sec> <sec> <title>OBJECTIVE</title> This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation. </sec> <sec> <title>METHODS</title> We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria—appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators. </sec> <sec> <title>RESULTS</title> The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P&lt;.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P&lt;.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=–0.654; P&lt;.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=–0.633; P&lt;.001), indicating that higher error rates corresponded to lower clinical acceptability. </sec> <sec> <title>CONCLUSIONS</title> Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework’s potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationArtificial Intelligence in HealthcareElectronic Health Records Systems

Volltext beim Verlag öffnen

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen