Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Uncertainty quantification by large language models
1
Zitationen
3
Autoren
2025
Jahr
Abstract
As reasoning capabilities of large language models (LLMs) continue to advance, they are being integrated into increasingly complex scientific workflows, with the goal of developing agents capable of generating evidence-based explanations and testing hypotheses and theories. However, despite their rapid progress, most existing evaluations of LLM reasoning focus on accuracy or consistency rather than on uncertainty quantification (UQ), which is essential for evidence-based reasoning because it quantifies the degree of trustworthiness of evidence-based explanations. Current approaches to LLM uncertainty remain fragmented, often lacking standardized benchmarks that test models under varying task complexities. To address this gap, we introduce the first benchmark suite designed to evaluate UQ by LLM-based agents and tools. The benchmark targets one of the most fundamental UQ problem: estimating whether one quantity is probably larger than another under uncertainty. It includes two progressively complex tasks: a simple inequality test, where models judge whether one of two sets of samples is “larger,” “smaller,” or “uncertain” with 95% confidence, and a complex inequality test, where models assess interventional probabilities requiring multiple intermediate calculations. We found that reasoning models are generally capable of UQ (scores ≳ 70 % ) in the simple inequality case but do not score appreciably better than random guessing (scores ∼ 33 % ) for the complex inequality case if the UQ method and intermediate steps are not provided in the prompt. Our implementation is available at https://github.com/bekaiser-LANL/tether . • LLMs integrated in scientific workflows must be able to perform fundamental UQ tasks. • We evaluate and present how LLMs perform uncertainty quantification. • Implementation and documentation available at https://github.com/bekaiser-LANL/tether .
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.292 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.143 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.539 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.452 Zit.