Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
An interdisciplinary, randomized, single-blind evaluation of state-of-the-art large language models for their implications and risks in medical diagnosis and management
0
Zitationen
43
Autoren
2025
Jahr
Abstract
Abstract Background State-of-the-art (SOTA) large language models (LLMs) are poised to revolutionize clinical medicine by transforming diagnostic, therapeutic, and interdisciplinary reasoning. Despite their promising capabilities, rigorous benchmarking of these models is essential to address concerns about their clinical proficiency and safety, particularly in high-risk environments. Methods This study implemented a multi-disciplinary, randomized, single-blind evaluation framework involving 27 experienced specialty clinicians with an average of 25.9 years of practice. The assessment covered 685 simulated and real clinical cases across 13 subspecialties, including both common and rare conditions. Evaluators rated LLM responses on medical strength (0–10 scale, where > 9.5 signified leading expert proficiency) and hallucination severity (0–5 scale for fabricated or misleading medical elements). Seven SOTA LLMs were tested, including top-ranked models from the ARENA leaderboard, with statistical analyses applied to adjust for confounders such as response length. Findings The evaluation revealed clinical plausibility in general-purpose LLMs, with Gemini 2.0 Flash leading raw scores and DeepSeek R1 excelling in adjusted analyses. Top models demonstrated proficiency comparable to a physician of 6 years post qualification experience (score ∼6.0), yet significant risks were noted. Instances of incompetence (scores ≤4) were detected across specialties, and 40 hallucination instances involving fabricated conditions, medications, and classification errors. These findings underscore the importance of implementing stringent safeguards to mitigate potential adverse outcomes in clinical applications. Interpretation While SOTA LLMs show substantial promise in enhancing clinical reasoning and decision-making, their unguarded application in medicine could present serious risks, such as misinformation and diagnostic errors. Human expert oversight remains crucial, particularly given reported incompetence and hallucination risks. Larger, multi-center studies are warranted to evaluate their real-world performance and track their evolution before broader clinical adoption.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.250 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.109 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.482 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.434 Zit.
Autoren
- Peikai Chen
- Jifu Cai
- Jiaying Zhou
- Shaoxi Chen
- Chenguang Xu
- Lihua Yuan
- Xiaoying Dai
- Xiaowei Chen
- Yumei Wei
- Xia Li
- Shaofeng Gong
- Xiaolong Liang
- Jiancheng Yang
- Jun‐O Jin
- Kai Dai
- Yuzhen Cui
- Guan‐Ming Kuang
- Ju-Jun Xie
- Libing Luo
- Haibing Xiao
- Shijie Yin
- H. J. Yang
- Yulan Yan
- Jianliang Chen
- Yi‐Hua Chen
- Q.J. Zhang
- Qingshan Zhou
- Lina Zhao
- Min Wu
- Xin Tang
- Rong Lei
- Zanxin Wang
- Wenbin Qiu
- Yanli Wang
- Liwen Cui
- Xiangyang Li
- Yong Hu
- Huiren Tao
- Nan Wu
- Pearl Pai
- M. Wei
- Michael To
- Kmc Cheung