Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification with Zero Shot Prompting (Preprint)
0
Zitationen
4
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> The proliferation of large language models (LLMs) through accessible chatbot interfaces has created unprecedented opportunities in healthcare, with state-of-the-art models such as ChatGPT-4, LLaMA-3·1, Gemini-1·5, DeepSeek-R1and OpenAI-O3, offering artificial intelligence-driven clinical support. Some studies showcase the potential of LLMs in managing complex healthcare tasks, while others emphasize concerns regarding their accuracy, reliability, and compliance with the rigorous standards of clinical settings. This study was conducted to better understand their true potential and identify areas where they can be most effective in healthcare. </sec> <sec> <title>OBJECTIVE</title> This study presents a comprehensive comparative analysis of leading reasoning and non-reasoning LLMs - ChatGPT-4, LLaMA-3·1, Gemini-1·5, DeepSeek-R1and OpenAI-O3 - evaluated across three critical healthcare tasks using the Medical Information Mart for Intensive Care IV (MIMIC-IV) dataset. </sec> <sec> <title>METHODS</title> We assessed the model capabilities in: (1) generating primary diagnoses, (2) mapping diagnoses to ICD-9 codes, and (3) predicting hospital readmission risk stratification through zero-shot prompting protocols. The study utilized a cohort of 300 randomly selected subjects from MIMIC-IV, with standardized prompts systematically generated from discharge summary sections. Each prompt was engineered to incorporate both patient clinical information and specific task requirements in a unified input format. To enhance result interpretability, we implemented explicit rationale elicitation within the prompting structure, requiring models to articulate their reasoning process for diagnostic and prognostic predictions. Since this is a zero-shot prompt approach, the prompt is not tested, repeating the same multiple times. </sec> <sec> <title>RESULTS</title> In our comparative analysis among non-reasoning models, LLaMA-3·1 demonstrated superior aggregate performance across all evaluation metrics, with 85% correctness in Primary Diagnosis prediction, 42·6% in ICD-9 code prediction, and 41·3% in hospital readmission risk prediction. Reasoning models DeepSeek-R1 and OpenAI-O3 showed similar performance, with O3 achieving slightly higher accuracy in primary diagnosis (90%) and ICD-9 prediction (45·3%), while R1 performed slightly better in readmission risk prediction (72·66%). </sec> <sec> <title>CONCLUSIONS</title> Our findings show that none of the evaluated models met clinical standards across all tasks, with medical coding showing the weakest performance. This aligns with few of the literature findings indicating that pretrained LLMs struggle with medical coding. This underscores the need for further refinement of these models to enhance their clinical applicability. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.291 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.143 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.535 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.452 Zit.