Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
1
Zitationen
6
Autoren
2025
Jahr
Abstract
BACKGROUND: Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. This suggests new potential to reduce the burden of administrative tasks and address current healthcare staff shortages. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. METHODS: To address this gap to the deployment of agentic AI in healthcare, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. RESULTS: MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial room for improvement which gives the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. CONCLUSION: Agent-based task frameworks and benchmarks are the necessary next step to advance the potential and capabilities for effectively improving and integrating AI systems into clinical workflows. MedAgentBench establishes this and is publicly available at https://github.com/stanfordmlgroup/MedAgentBench, offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.
Ähnliche Arbeiten
Machine Learning in Medicine
2019 · 3.591 Zit.
Systematic Review: Impact of Health Information Technology on Quality, Efficiency, and Costs of Medical Care
2006 · 3.166 Zit.
Effects of Computerized Clinical Decision Support Systems on Practitioner Performance and Patient Outcomes
2005 · 2.963 Zit.
Studies in health technology and informatics
2008 · 2.903 Zit.
Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success
2005 · 2.677 Zit.