Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation
0
Zitationen
4
Autoren
2026
Jahr
Abstract
The accurate and reliable evaluation of large language models (LLMs) in medical domains is critical for real-world clinical deployment, automated medical reasoning, and patient safety. However, the evaluation process is highly sensitive to prompt design, and prevalent reliance on fixed or randomly sampled prompt policies often fails to dynamically adapt to clinical context, question complexity, or evolving safety requirements. This article presents a novel reinforcement learning-based framework for multi-prompt selection, which dynamically optimizes prompt policy per input for medical LLM evaluation across the Medical Knowledge Question-Answering dataset (MKQA), the Medical Multiple-Choice Question dataset (MCQ), and the Doctor-Patient Dialogue dataset. We formulate prompt selection as a Markov Decision Process (MDP) and employ a deep Q-Network (DQN) agent to maximize a reward signal incorporating textual accuracy, domain terminology coverage, safety, and dialogue relevance. Experiments on three medical LLM benchmarks demonstrate consistent improvements in composite reward (e.g., a 6.66% increase in MKQA vs. Random Baseline, and a 2.41% increase in Dialogue vs. Fixed Baseline) when compared to baselines. This was accompanied by robust enhancements in Safety (e.g., achieving 1.0000 in MKQA, a 5.26% increase vs. Fixed Baseline; and a 5.03% increase in Dialogue vs. Fixed Baseline) and substantial gains in Medical Terminology Coverage (e.g., a 74.61% increase in MKQA vs. Fixed Baseline, and a 9.13% increase in MCQ vs. Fixed Baseline) when compared to baselines. While varying across tasks, an improvement in accuracy was observed in the MKQA task, and the framework effectively optimizes the multi-objective reward function, even when minor trade-offs in other metrics like Accuracy and Contextual Relevance were observed in some contexts. Our framework enables robust, context-aware, and adaptive evaluation, laying a foundation for safer and more reliable LLM application in healthcare.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.218 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.589 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.109 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.482 Zit.
Artificial intelligence in healthcare: past, present and future
2017 · 4.386 Zit.