Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Toward the Comprehensive Evaluation of Medical Text Generation by Large Language Models: Programmatic Metrics, Human Assessment, and Large Language Models Judgment
2
Zitationen
1
Autoren
2025
Jahr
Abstract
This commentary discusses three evaluation approaches for assessing large language models' generation in healthcare: programmatic metrics, human assessment, and large language models judgment. No single approach can address all challenges; however, the combination of these three methods provides a pipeline toward the comprehensive evaluation of medical text generation. Large language models (LLMs) such as Generative Pre-trained Transformer 4 (GPT-4), characterized by their extensive parameterization (e.g., exceeding 100 billion parameters) [1], predict the likelihood of subsequent word tokens based on the input context and demonstrate exceptional performance across a broad range of medical specialties, such as radiology [2], nephrology [3], and dermatology [4]. Unlike general tasks where erroneous outputs from LLMs are either readily identifiable or have limited consequences, in the evaluation of medical text, errors are often imperceptible to individuals without specialized medical knowledge and pose significant risks to patient safety [5, 6]. Consequently, a comprehensive evaluation of LLMs in clinical text generation is essential before their real-world release for users such as patients and healthcare professionals [7]. Broadly speaking, medical text generation can be categorized into two types: closed-ended and open-ended generation. Closed-ended generation addresses tasks with predefined answers, as exemplified by Wu et al. [3], who used LLMs to answer multiple-choice questions created by the American Society of Nephrology. By contrast, open-ended generation supports more flexible outputs, which enables the handling of complex and dynamic tasks. For instance, Wan et al. [8] used LLMs to facilitate conversations in the medical reception area, where interactions encompass the diverse topics of general administration, real-time triaging, and addressing primary care concerns, which are not easily formalized within a closed-ended framework. For closed-ended generation, programmatic metrics serve as the gold standard for evaluation. By contrast, evaluating open-ended generation, as shown in Figure 1, requires a more comprehensive approach that incorporates programmatic metrics, human assessment, and LLMs judgment to ensure a well-rounded analysis. Representative techniques toward the comprehensive evaluation of medical text generation. Programmatic metrics refer to clearly defined mathematical formulas used to compare the LLMs' generation and ground-truth answers without the involvement of humans. For closed-ended tasks, the LLMs' outputs and the ground-truth labels are structurally predefined. Advanced LLMs, following instruction tuning, can accurately adhere to user prompts and generate corresponding responses. By contrast, LLMs with weaker alignment may require additional processing, such as regular expressions to identify patterns and extract relevant answers [3]. Then, LLMs' answers are compared with the ground-truth labels using standard metrics, such as accuracy [3, 9, 10], F1-score [11], and area under the receiver operating characteristic curve [11, 12]. For open-ended tasks where LLMs' generation and golden labels are in a free-text format, two commonly used metrics in the general domain are bilingual evaluation understudy (BLEU) [13] and recall-oriented understudy for gisting evaluation (ROUGE) [14]. Both metrics calculate the similarity between LLMs-generated sentences and expert-provided sentences based on contiguous sequences of n tokens, a.k.a. n-grams. BLEU emphasizes the precision of overlapped n-grams with a penalty for overly short outputs, whereas ROUGE offers variants that measure precision, recall, and F1-score to provide more flexibility. The two metrics are also widely used in medical text generation [15]. For instance, Sushil et al. [16] quantified LLMs' performance by BLEU and ROUGE in extracting clinically meaningful, complex concepts and relations from oncology reports. In addition to BLEU and ROUGE, programmatic metrics, such as the Word Error Rate [17], Metric for Evaluation of Translation with Explicit Ordering [18], and Bidirectional Encoder Representations from Transformers Score [19], have also been used [3, 20]. For readers seeking detailed computational insights into these programmatic metrics, we recommend consulting the comprehensive survey by Sai, Mohankumar, and Khapra [21]. Although programmatic metrics provide automatic quantification and are validated to correlate positively with user preferences [19], they fall short of perfectly capturing key aspects such as readability, usability, factuality, toxicity, and empathy [4, 8, 20, 22]. To address these limitations, human assessment is frequently used as an additional safeguard to ensure that the quality of LLMs-generated content meets the stringent criteria required for clinical applicability [23, 24]. For instance, Sandmann et al. [25] systematically analyzed LLMs in the suggestion of initial diagnoses, examination steps, and treatment plans for diverse clinical cases. Two independent physicians evaluated LLMs' outputs based on three criteria: inclusion of relevant options, avoidance of redundancy, and prevention of unjustified statements. A more comprehensive human assessment was conducted by Singhal et al. [22], which serves as an excellent reference for medical professionals developing in-house assessment frameworks. Specifically, three qualified clinicians manually evaluated LLMs' outputs across five key dimensions: (1) scientific consensus; (2) comprehension, knowledge retrieval, and reasoning capabilities; (3) potential physical or mental-related harm; (4) incorrect or missing content; and (5) bias for medical demographics. Additionally, five laypeople without formal medical training were invited to assess the intent fulfillment, helpfulness, and actionability of LLMs' generation. Integrating evaluation by clinicians and laypersons, also evidenced in Wan et al. [8], can address the needs of diverse stakeholders. Future researchers are encouraged to conduct human assessment to evaluate both clinical accuracy and user accessibility. Human assessment, implemented by experienced medical practitioners, remains the gold standard for evaluation. However, it is time-consuming, lacks scalability in high-volume contexts, and requires multiple evaluators to reduce variance and ensure consistency. Programmatic metrics, while scalable, primarily focus on surface-level evaluation and often fail to assess deeper qualities such as contextual relevance, logical coherence, or factual accuracy. Recent advancements in LLMs have enhanced their language comprehension and knowledge synthesis capabilities to a near-human level [9], which has led to their use as an intermediary evaluation method between human assessment and programmatic metrics [26]. Fast et al. [27] validated the average 97% F1-score between GPT-4 judgment and human assessment across 20 diagnostic scenarios spanning 13 specialties. Notably, this study revealed no significant bias in GPT-4's self-assessment and highlighted the potential of using LLMs judgment as a robust and objective method for assessing both their own performance and that of other LLMs. To summarize, in this commentary, three evaluation approaches were discussed for assessing LLMs' generation in healthcare and their applications were illustrated in exemplificative cases. No single approach can address all challenges; however, the combination of these three methods, exemplified by Qiu et al. [9], provides a pipeline toward the comprehensive evaluation of medical text generation. Han Yuan: conceptualization, formal analysis, investigation, software, validation, visualization, writing–original draft, writing–review and editing. The author has nothing to report. This study is exempted from review by the ethics committee as it does not involve human participants, animal subjects, or the collection of sensitive data. The author has nothing to report. The author declares no conflicts of interest. The author has nothing to report.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.286 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.651 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Artificial intelligence in healthcare: past, present and future
2017 · 4.404 Zit.