OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.03.2026, 20:39

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Uncertainty Quantification for Multiple-Choice Questions is Just One-Token Deep

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

14

Autoren

2025

Jahr

Abstract

Multiple-choice question (MCQ) benchmarks such as MMLU and GPQA are widely used to assess the capabilities of large language models (LLMs). While accuracy remains the standard evaluation metric, recent work has introduced uncertainty quantification (UQ) methods, such as entropy, conformal prediction, and verbalized confidence, as complementary measures of model reliability and calibration. However, we find that these UQ methods, when applied to MCQ tasks, are unexpectedly fragile. Specifically, we show that fine-tuning a model on just 1,000 examples to adjust the probability of the first generated token, under the common prompting setup where the model is instructed to output only a single answer choice, can systematically distort a broad range of UQ methods across models, prompts, and domains, all while leaving answer accuracy unchanged. We validate this phenomenon through extensive experiments on five instruction-tuned LLMs, tested under standard prompting, zero-shot chain-of-thought reasoning, and a biomedical question answering setting. In all cases, models retain similar accuracy but exhibit significantly degraded calibration. These results suggest that current UQ practices for MCQs are ''one-token deep'', driven more by first-token decoding behavior than by any deeper representation of uncertainty, and are easily manipulated through minimal interventions. Our findings call for more robust and interpretable approaches to uncertainty estimation, particularly in structured formats like MCQs, where confidence signals are often reduced to token-level heuristics.

Ähnliche Arbeiten