Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Research

2025·1 Zitationen·bioRxiv (Cold Spring Harbor Laboratory)Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

ABSTRACT Objectives To assess the calibration of 9 large language models (LLMs) within biomedical natural language processing (BioNLP) tasks, furthering understanding of trustworthiness and reliability in real-world settings. Materials and Methods For each LLM, we collected responses and corresponding confidence scores for all 13 datasets (grouped into 6 tasks) of the Biomedical Language Understanding & Reasoning Benchmark (BLURB). Confidence scores were assigned using 3 strategies: Verbal, Self-consistency, Hybrid. For evaluation, we introduced Flex-ECE (Flexible Expected Calibration Error): a novel adaptation of ECE that accounts for partial correctness in model responses, allowing for a more realistic assessment of calibration in language-based settings. Two post-hoc calibration techniques—isotonic regression and histogram binning—were evaluated. Results Across tasks, mean calibration ranged from 23.9% (Population-Intervention-Comparison-Outcome extraction) to 46.6% (Relation Extraction). Across LLMs, Medicine-Llama3-8B had the best mean overall calibration (29.8%); Flan-T5-XXL had the highest ranking on 5/13 datasets. Across strategies, self-consistency (mean: 27.3%) had better calibration than Verbal (mean: 42.0%) and Hybrid (mean: 44.2%). Post-hoc methods substantially improved calibration, with best mean calibrated Flex-ECEs ranging from 0.1% to 4.1%. Discussion The poor out-of-the-box calibration of LLMs poses a risk to trustworthy deployment of such models in real-world BioNLP applications. Calibration can be improved post-hoc and is a recommended practice. Non-binary metrics for LLM evaluation such as Flex-ECE provide a more realistic assessment of trustworthiness of LLMs, and indeed any model that can be partially right/wrong. Conclusion This study shows that out-of-the-box calibration of LLMs is very poor, but traditional post-hoc calibration techniques are useful to calibrate LLMs.

A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Research

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen