Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
7
Zitationen
81
Autoren
2025
Jahr
Abstract
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
Ähnliche Arbeiten
Autoren
- Suhana Bedi
- Hejie Cui
- Miguel Fuentes
- Alyssa Unell
- Michael Wornow
- Juan M. Banda
- Nikesh Kotecha
- Timothy Keyes
- Yifan Mai
- Mert Oez
- Hao Qiu
- Shrey Jain
- Leonardo Schettini
- Mehr Kashyap
- Jason Fries
- Akshay Swaminathan
- Philip Chung
- Fateme Nateghi
- Asad Aali
- Ashwin Nayak
- Shivam Vedak
- Sneha S. Jain
- Birju Patel
- Oluseyi Fayanju
- Shreya Shah
- Ethan Goh
- Dong-han Yao
- Brian Soetikno
- Eduardo M. Reis
- Sergios Gatidis
- Vasu Divi
- Rita Capasso
- Rachna Saralkar
- Chia‐Chun Chiang
- Jenelle Jindal
- Thi Lan Pham
- Faraz Ghoddusi
- Steven Lin
- Albert S. Chiou
- Hong Chen
- Mohana Roy
- Michael F. Gensheimer
- H. R. Patel
- Kevin A. Schulman
- Dev Dash
- Danton Char
- N. Lance Downing
- François Grolleau
- Kameron Collin Black
- Bethel Mieso
- Aydin Zahedivash
- Wen-wai Yim
- Harshita Sharma
- Tony Szu‐Hsien Lee
- Harald Kirsch
- Jennifer Lee
- Nerissa Ambers
- Carlene Lugtu
- Aditya Sharma
- Bilal Mawji
- A. A. Alekseyev
- Vicky Zhou
- Vikas Kakkar
- Jarrod Helzer
- Anurang Revri
- Yair Bannett
- Roxana Daneshjou
- Jonathan H. Chen
- Emily Alsentzer
- Keith Morse
- Nirmal Ravi
- Nima Aghaeepour
- Vanessa E. Kennedy
- Akshay Chaudhari
- Thomas J. Wang
- Oluwasanmi Koyejo
- Matthew P. Lungren
- Eric Horvitz
- Percy Liang
- Michael Pfeffer
- Nigam H. Shah