Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparing prognostic performance and reasoning between large language models and physicians

2026·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

ABSTRACT Importance Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression. KEY POINTS Question How does large language model (LLM) prognostic accuracy and reasoning compare to physicians when predicting 6-month mortality for adult survivors of critical illness? Findings In this embedded mixed methods study, physicians and large language models had comparable, moderate prognostic accuracy with similar expressed reasoning patterns except that LLMs did not explicitly express uncertainty. Meaning Large language models may be able to support physician prognostication, although the inability of LLMs to express uncertainty poses an important safety consideration.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationSepsis Diagnosis and TreatmentMachine Learning in Healthcare

Volltext beim Verlag öffnen

Comparing prognostic performance and reasoning between large language models and physicians

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen