Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Large language models (LLMs) are increasingly adopted in medical question answering (QA) scenarios. However, LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) is now recognized as a robust framework within the broader domain of machine learning, offering statistically rigorous guarantees of marginal (average) coverage for prediction sets. However, the applicability of CP in medical QA remains to be explored. To address this limitation, this study proposes an enhanced CP framework for medical multiple-choice question answering (MCQA) tasks. The enhanced CP framework associates the non-conformance score with the frequency score of the correct option. The framework generates multiple outputs for the same medical query by leveraging self-consistency theory. The proposed framework calculates the frequency score of each option to address the issue of limited access to the model’s internal information. Furthermore, a risk control framework is incorporated into the enhanced CP framework to manage task-specific metrics through a monotonically decreasing loss function. The enhanced CP framework is evaluated on three popular MCQA datasets using off-the-shelf LLMs. Empirical results demonstrate that the enhanced CP framework achieves user-specified average (or marginal) error rates on the test set. Moreover, the results show that the test set’s average prediction set size (APSS) decreases as the risk level increases. It is concluded that it is a promising evaluation metric for the uncertainty of LLMs.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.316 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.468 Zit.