Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study (Preprint)

2024·2 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

<sec> <title>BACKGROUND</title> The availability of increasingly powerful large language models (LLMs) has attracted substantial interest in their potential for interpreting and generating human-like text for biomedical and clinical applications. However, there are often demands for high accuracy, concerns about balancing generalizability and domain-specificity, and questions about prompting robustness hampering the adoption of LLMs for specific use cases. There also is a lack of a framework or method to help choose which LLMs (or prompting strategies) should be adopted for specific biomedical or clinical tasks. </sec> <sec> <title>OBJECTIVE</title> To address concerns about applicability, and principled usage of generative LLMs for solving complex biomedical tasks, this study 1) proposes a framework to comprehensively evaluate and compare the performance of a range of LLMs and prompting techniques on a suite of biomedical natural language processing (NLP) tasks; 2) uses the framework to benchmark several general-purpose LLMs and biomedical domain-specific LLMs; 3) demonstrates how using principled prompting strategies can increase model performance, often enabling smaller models to surpass their larger counterparts. </sec> <sec> <title>METHODS</title> We evaluated and compared six general-purpose LLMs (GPT-4, GPT-3.5-Turbo, Flan-T5-XXL, Llama-3-8B-Instruct, Yi-1.5-34B-Chat, and Zephyr-7B-Beta) and three healthcare-specific LLMs (Medicine-Llama3-8B, Meditron-7B, and MedLLaMA-13B) on a set of 13 datasets – referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB). BLURB covers six widespread biomedical NLP tasks: named entity recognition (NER); relation extraction (RE); population, interventions, comparators, and outcomes (PICO); sentence similarity (SS); document classification (Class.); and question answering (QA). Model performance was assessed according to a systematic, reusable prompting strategy framework and was measured according to the standard BLURB task-specific evaluation metrics. </sec> <sec> <title>RESULTS</title> Across all tasks, GPT-4 outperformed other LLMs, achieving a BLURB score of 64.6 on the benchmark, though other models, such as Flan-T5-XXL and Llama-3-8B-Instruct, demonstrated competitive performance on multiple tasks. General-purpose models achieved better overall scores than domain-specific models, sometimes by significant margins. We observed a substantial impact of strategically editing the prompt describing the task on improving performance when including examples semantically similar to the input text. Additionally, the most performant prompts for nearly half the models outperformed the previously reported best results for the PubMedQA dataset from the BLURB leaderboard. </sec> <sec> <title>CONCLUSIONS</title> These results provide evidence of the potential LLMs may have for biomedical applications and highlight the importance of robust evaluation before adopting them for any specific use cases. Notably, performant open-source LLMs such as Llama-3-8B-Instruct and Flan-T5-XXL show promise for use cases where trustworthiness and data confidentiality are concerns, as these models can be hosted locally, offering better security, transparency, and explainability. Further exploring how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be fundamental to enable responsible innovation with LLMs in the biomedical domain. </sec>

Autoren

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationBiomedical Text Mining and Ontologies

Volltext beim Verlag öffnen

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen