Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Exploring the prognostic utility of large language models versus traditional clinical models in heart failure: a pilot study
0
Zitationen
7
Autoren
2025
Jahr
Abstract
BACKGROUND: Large language models (LLMs) show promise in clinical decision support; however, their role in risk prediction for heart failure (HF) remains uncertain. OBJECTIVE: This pilot study evaluated the prognostic performance and reproducibility of two general-purpose LLMs, ChatGPT and DeepSeek, using structured clinical data and unstructured discharge summaries, compared with a conventional clinical model. METHODS: Structured data from the Zigong HF study included 473 hospitalized HF patients with 33 clinical variables predicting a 90-day composite outcome of all-cause death or rehospitalization. Discharge summaries from the MIMIC-IV cohort included 2,091 ICU HF patients predicting 1-year all-cause mortality. Standardized prompts were used to obtain predicted probabilities from each LLM. Model predictions were compared with logistic regression results, and reproducibility was assessed using intraclass correlation coefficients. RESULTS: In the Zigong HF study, both LLMs showed limited discrimination (AUC 0.59 for ChatGPT, 0.56 for DeepSeek), performing below the conventional model (AUC 0.63). In the MIMIC-IV cohort, ChatGPT achieved higher discrimination (AUC 0.72) than DeepSeek (AUC 0.67, P < 0.001) and comparable performance to the clinical model (AUC 0.74, P = 0.31). Decision curve analysis showed modest benefit for ChatGPT at low-to-moderate thresholds, while DeepSeek offered minimal benefit. Repeated predictions showed significant variability for both models. CONCLUSIONS: This pilot study provides preliminary evidence that LLMs have limited predictive value for structured data but show comparable performance in text-based risk prediction. These findings suggest potential for LLMs in processing unstructured clinical information and highlight the need for validation in larger, contemporary cohorts.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.605 Zit.
Coding Algorithms for Defining Comorbidities in ICD-9-CM and ICD-10 Administrative Data
2005 · 10.522 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.874 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.941 Zit.