Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

PatientSafeBench: Evaluating the Safety of Medical LLMs for Patient Use

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Large Language Models (LLMs) in the medical domain have been primarily developed and validated for healthcare professionals, leaving a significant gap in patient-centered adaptation. As real-world patient use of these models poses safety risks, rigorous evaluation tailored for patient interaction scenarios becomes essential. To address this, we introduce PatientSafeBench, a novel benchmark assessing both the safety and utility of LLMs in patient-facing contexts. It comprises five categories and 25 subcategories, each representing critical aspects of LLM performance for patient use. We developed 500 evaluation queries grounded in real clinical cases, with scoring criteria reviewed by four medical professionals. We evaluated 11 different LLMs on PatientSafeBench using a multi-judge approach, scoring responses on a 10-point scale with hierarchical safety thresholds. The results reveal that no model met our safety criteria for patient use, with medical-specific LLMs surprisingly underperforming general-purpose models. All models showed consistent weaknesses in temporal relevance, transparency, personalization, and user engagement. These findings highlight the need for dedicated patient-centered benchmarks to ensure the safety and effectiveness of LLMs in patient-facing applications.

Autoren

Institutionen

Themen

Machine Learning in HealthcareArtificial Intelligence in Healthcare and EducationElectronic Health Records Systems

Volltext beim Verlag öffnen

PatientSafeBench: Evaluating the Safety of Medical LLMs for Patient Use

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen