Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating large language models in real-world hematologic clinical decision-making: Performance, limitations, and clinical implications
0
Zitationen
19
Autoren
2025
Jahr
Abstract
Abstract Background Recent advances in Artificial Intelligence (AI), particularly in Large Language Models (LLMs) like GPT-4o (ChatGPT) and others, have shown impressive performance in medical domains, including passing licensing exams and, in some cases, surpassing physicians in general diagnostic and reasoning tasks. However, their reliability and clinical utility in highly specialized, real-world medical settings—such as in hematology diagnostics and therapy—have not been rigorously evaluated. Malignant hematology poses unique challenges due to its complex pathophysiology, layered diagnostic frameworks, and the need for nuanced, high-stakes clinical decision-making mainly derived from highly specialized physicians, making it an ideal testbed for assessing the true capabilities and limitations of these models. Objectives To evaluate how well state-of-the-art LLMs handle real-world hematology cases, focusing on their ability to make accurate diagnoses, predict outcomes, follow treatment guidelines, and suggest relevant clinical trials. Method We developed a test set of 30 complex, real-world clinical cases of myelodysplastic syndromes (MDS). We chose MDS as a representative hematologic malignancy due to its diagnostic complexity and need for expert subspecialty care. Each case required integration of clinical, morphological, cytogenetic, and molecular data—mirroring real-life decision-making in hematology. A standardized prompt was used to query multiple LLMs—ChatGPT (GPT-4o and GPTo3), Claude, and DeepSeek. Models were tasked with providing a diagnosis per WHO 2022/ICC criteria, calculating IPSS-R/IPSS-M risk scores, and recommending appropriate treatment and clinical trials. Responses were independently reviewed by a blinded panel of eleven international MDS experts, who scored them on diagnostic accuracy, prognostic assessment, and treatment relevance on a Likert scale 1-5, with score ≥ 4 considered correct per expert opinion. Factual errors were also categorized as none, minor, or major. To evaluate the consistency of expert ratings, we used the intraclass correlation coefficient (ICC) to measure how well experts agreed on numerical scores and Cohen’s κ (kappa) to assess their agreement when identifying errors. Results The highest-performing model was GPT-o3, achieving 58% agreement with expert clinical assessments. This was followed by GPT-4o (42%), DeepSeek (31%), and Claude (26%). On a 1–5 scale, the average expert-assigned scores across domains were as follows: GPT-o3 (overall 3.48; Diagnosis 3.68, Prognosis 3.58, Treatment 3.56, Clinical Trials 3.09), GPT-4o (3.22; 3.14 / 3.20 / 3.39 / 3.16), DeepSeek (2.98; 2.92 / 3.01 / 3.15 / 2.83), and Claude (2.86; 2.72 / 2.90 / 3.09 / 2.73), respectively. Major factual errors (hallucinations) were frequent across all models, each exceeding a 25% rate: GPT-o3 and GPT-4o (both 26%), DeepSeek (33%), and Claude (36%). Minor factual error rates were similarly high: GPT-o3 and Claude (47%), DeepSeek (49%), and GPT-4o (52%). Experts showed strong agreement in their evaluations, with high consistency in scoring (ICC = 0.81) and in identifying AI errors or hallucinations (κ = 0.76), confirming the reliability of the review process. Conclusion Despite recent reports suggesting that models like ChatGPT have outperformed physicians in diagnostic accuracy and clinical decision-making, current state-of-the-art LLMs underperform in highly specialized and complex clinical scenarios in hematologic malignancies. Even advanced reasoning models such as GPT-o3 have fallen short of expert expectations. These findings underscore that general-purpose LLMs are not yet suitable for autonomous clinical use in hematology. Their deployment should be approached with caution, and further research is essential to rigorously evaluate their performance across all subdomains of hematology.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.
Autoren
- David M. Swoboda
- Amy E. DeZern
- James T. England
- Sangeetha Venugopal
- Thomas J. Kehoe
- Brandon J. Aubrey
- Marco Gabriele Raddi
- Angela Consagra
- Jiasheng Wang
- Gustavo Rivero
- Maximilian Stahl
- Amer M. Zeidan
- Torsten Haferlach
- Andrew M. Brunner
- Rena Buckstein
- Valeria Santini
- Matteo Giovanni Della Porta
- Mikkael A. Sekeres
- Aziz Nazha
Institutionen
- Tampa General Hospital(US)
- Sidney Kimmel Comprehensive Cancer Center(US)
- Sunnybrook Health Science Centre(CA)
- Health Sciences Centre(CA)
- Sylvester Comprehensive Cancer Center(US)
- University of South Florida(US)
- Massachusetts General Hospital(US)
- University of Florence(IT)
- The Ohio State University(US)
- Yale University(US)
- Munich Leukemia Laboratory (Germany)(DE)
- IRCCS Humanitas Research Hospital(IT)
- Sidney Kimmel Cancer Center(US)