Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Uncertainty quantification by large language models

2025·1 Zitationen·Machine Learning with ApplicationsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

As reasoning capabilities of large language models (LLMs) continue to advance, they are being integrated into increasingly complex scientific workflows, with the goal of developing agents capable of generating evidence-based explanations and testing hypotheses and theories. However, despite their rapid progress, most existing evaluations of LLM reasoning focus on accuracy or consistency rather than on uncertainty quantification (UQ), which is essential for evidence-based reasoning because it quantifies the degree of trustworthiness of evidence-based explanations. Current approaches to LLM uncertainty remain fragmented, often lacking standardized benchmarks that test models under varying task complexities. To address this gap, we introduce the first benchmark suite designed to evaluate UQ by LLM-based agents and tools. The benchmark targets one of the most fundamental UQ problem: estimating whether one quantity is probably larger than another under uncertainty. It includes two progressively complex tasks: a simple inequality test, where models judge whether one of two sets of samples is “larger,” “smaller,” or “uncertain” with 95% confidence, and a complex inequality test, where models assess interventional probabilities requiring multiple intermediate calculations. We found that reasoning models are generally capable of UQ (scores ≳ 70 % ) in the simple inequality case but do not score appreciably better than random guessing (scores ∼ 33 % ) for the complex inequality case if the UQ method and intermediate steps are not provided in the prompt. Our implementation is available at https://github.com/bekaiser-LANL/tether . • LLMs integrated in scientific workflows must be able to perform fundamental UQ tasks. • We evaluate and present how LLMs perform uncertainty quantification. • Implementation and documentation available at https://github.com/bekaiser-LANL/tether .

Autoren

Institutionen

Los Alamos National Laboratory(US)

Themen

Artificial Intelligence in Healthcare and EducationExplainable Artificial Intelligence (XAI)Topic Modeling

Volltext beim Verlag öffnen

Uncertainty quantification by large language models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen