Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of Large Language Models as Emergency Department Revisit Predictors
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Large Language Models (LLMs) have shown promise in clinical reasoning and question answering, yet their effectiveness for real-world clinical prediction remains an open question. We present the first large-scale study evaluating LLMs for predicting 30-day emergency department (ED) revisits using 138,010 visits from the Adult Emergency Department at Stanford. We assessed two modeling paradigms: (1) direct prediction, where the LLM generates revisit risk assessments in natural language, and (2) embedding-based approaches that leverage LLM-derived vector representations (LLM2Vec) of patient data for downstream modeling. Retrieval augmentation improved direct prediction performance (e.g., Claude 3.7 F1 from 0.3755 (95% CI [0.3647, 0.3864]) to 0.4160 (95% CI [0.4024, 0.4294])), and embedding-based methods consistently outperformed direct approaches, with LLM2Vec achieving F1=0.4505 (95% CI [0.4345, 0.4666]). Despite having access to comprehensive structured and unstructured clinical data, all LLM approaches (F1=0.3022-0.4505) failed to exceed a traditional LightGBM model using only structured data (F1=0.4614 (95% CI [0.4496, 0.4789])). Through systematic analysis of the reasoning chains in 17,488 predictions, we suggest potential failure patterns: reasoning may systematically degrade performance through overweighting medical histories and similar visits, neglecting protective factors, and risk aversion. Our work establishes essential baseline performance while revealing fundamental limitations in current-generation LLMs for clinical prediction tasks.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.357 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.221 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.640 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.482 Zit.