Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Beyond multiple-choice questions: rethinking evaluation frameworks for large language models for clinical medicine
0
Zitationen
8
Autoren
2026
Jahr
Abstract
Large language models (LLMs) have demonstrated encouraging performance for medical natural language processing (NLP) tasks, approaching human-equivalent performance in some of the standard benchmarks, positioning them as game-changers in healthcare. However, there remains a persistent gap between high benchmark performance and clinical utility of NLP algorithms due to limitations of existing evaluation paradigms. Existing benchmarks tend to use static, task-specific benchmarks, and as a result, they do not capture the full dimension of complexity, safety, interpretability, and integration in workflow required for the safe deployment in the clinic. The editorial advocates for a shift in evaluation paradigms from narrow score-based metrics to a dynamic, multi-dimensional, and patient-centered system of clinical gatekeeping. The proposed framework integrates a four-phase process, including retrospective benchmarking, pilot testing, multi-center validation, and real-world monitoring, alongside a capability-task-behavior-value progression and a continuous human-in-the-loop feedback mechanism. This comprehensive strategy ensures not only technical robustness but also clinical relevance, ethical accountability, and adaptive improvement, transforming LLMs from experimental tools into reliable clinical partners for safer and more patient-centric healthcare delivery.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.239 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.095 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.463 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.428 Zit.
Autoren
Institutionen
- Beijing Tsinghua Chang Gung Hospital(CN)
- Chinese Academy of Medical Sciences & Peking Union Medical College(CN)
- Heidelberg University(DE)
- University Hospital Heidelberg(DE)
- Shanghai Jiao Tong University(CN)
- Beijing Academy of Artificial Intelligence(CN)
- Shanghai Sixth People's Hospital(CN)
- Artificial Intelligence in Medicine (Canada)(CA)
- Singapore National Eye Center(SG)
- Singapore Eye Research Institute(SG)
- Tsinghua University(CN)