Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
PatientSafeBench: Evaluating the Safety of Medical LLMs for Patient Use
0
Zitationen
10
Autoren
2025
Jahr
Abstract
Large Language Models (LLMs) in the medical domain have been primarily developed and validated for healthcare professionals, leaving a significant gap in patient-centered adaptation. As real-world patient use of these models poses safety risks, rigorous evaluation tailored for patient interaction scenarios becomes essential. To address this, we introduce PatientSafeBench, a novel benchmark assessing both the safety and utility of LLMs in patient-facing contexts. It comprises five categories and 25 subcategories, each representing critical aspects of LLM performance for patient use. We developed 500 evaluation queries grounded in real clinical cases, with scoring criteria reviewed by four medical professionals. We evaluated 11 different LLMs on PatientSafeBench using a multi-judge approach, scoring responses on a 10-point scale with hierarchical safety thresholds. The results reveal that no model met our safety criteria for patient use, with medical-specific LLMs surprisingly underperforming general-purpose models. All models showed consistent weaknesses in temporal relevance, transparency, personalization, and user engagement. These findings highlight the need for dedicated patient-centered benchmarks to ensure the safety and effectiveness of LLMs in patient-facing applications.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.227 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.601 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Artificial intelligence in healthcare: past, present and future
2017 · 4.387 Zit.