Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains
0
Zitationen
38
Autoren
2025
Jahr
Abstract
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 metrics covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and revised 2069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.
Autoren
- Shirui Wang
- Zhihui Tang
- Huaxia Yang
- Qiuhong Gong
- Tiantian Gu
- Hongyang Ma
- Yongxin Wang
- Wubin Sun
- Zeliang Lian
- Kehang Mao
- Yinan Jiang
- Zhicheng Huang
- Lingyun Ma
- W D Shen
- Yajie Ji
- Yunhui Tan
- Chunbo Wang
- Yunlu Gao
- Qianling Ye
- Rui Lin
- MingYu Chen
- Lijuan Niu
- Zhihao Wang
- Peng Yu
- Mengran Lang
- Yue Liu
- Huimin Zhang
- Haitao Shen
- Long Chen
- Q M Zhao
- Si-Xuan Liu
- Lina Zhou
- He Gao
- Dongqiang Ye
- Lingmin Meng
- Youtao Yu
- Nan Liang
- Jianxiong Wu
Institutionen
- Intelligent Health (United Kingdom)(GB)
- King University(US)
- Peking University(CN)
- Beijing Haidian Hospital(CN)
- Chinese Academy of Medical Sciences & Peking Union Medical College(CN)
- Peking Union Medical College Hospital(CN)
- Chinese PLA General Hospital(CN)
- Shanghai University of Traditional Chinese Medicine(CN)
- Shuguang Hospital(CN)
- Harbin Medical University(CN)
- Second Affiliated Hospital of Harbin Medical University(CN)
- Third Affiliated Hospital of Harbin Medical University(CN)
- Tongji University(CN)
- Shanghai Skin Disease Hospital(CN)
- Tongji Hospital(CN)
- Fudan University(CN)
- Huashan Hospital(CN)
- Xinjiang Medical University(CN)
- People's Hospital of Xinjiang Uygur Autonomous Region(CN)
- Shanxi Academy of Medical Sciences(CN)
- Seventh People's Hospital of Shanghai(CN)
- Jinan University(CN)
- Beijing Anzhen Hospital(CN)
- EuroEspes Biomedical Research Center(ES)
- National Cancer Center(US)