Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Abstract PS3-04-06: Benchmarking Large Language Models for Clinical Decision Support in Breast Cancer Care: A Multi-Institutional Expert Evaluation
0
Zitationen
21
Autoren
2026
Jahr
Abstract
Abstract Background: Artificial intelligence (AI) and large language models (LLMs) are increasingly explored as tools to support clinical decision-making in oncology. However, evidence validating their performance in complex breast cancer clinical scenarios (BCCS) remains limited. Given breast cancer’s diverse subtypes, evolving standards of care, and the need for nuanced, personalized treatment, we compared three LLMs for treatment decision-making to assess their capabilities and determine their readiness for integration into real-world breast oncology clinics. Methods: Ten breast cancer cases mimicking real-world scenarios were posed to three LLMs: ChatGPT-4o (GPT), DeepSeek-R1 (DS), and OpenEvidence (OE). 16 breast medical oncologists (BMOs; assistant to full professors) from 9 academic centers graded responses using a 5-point Likert scale (1 = poor, 5 = excellent) for clinical accuracy, clarity, relevance, and usability. Questions(Q) 1-9 assessed treatment decision-making cases; Q10 tested multimodal image interpretation (MMI) skills. Repeated measures ANOVA evaluated model differences, followed by Tukey’s post hoc comparisons. Results: On BCCS Q1-9 spanning all major breast cancer subtypes and treatment settings, including early-stage, metastatic, neoadjuvant, and adjuvant, OE achieved the highest mean score (3.91 ± 0.48; 2.57-4.43), significantly outperforming both GPT (3.19 ± 0.67; 2.36-3.93) and DS (2.93 ± 0.54; 1.50-3.86) in overall performance (p < 0.0001), with large effect sizes (Cohen’s d = 1.12 vs. GPT; d = 2.84 vs. DS). Repeated measures ANOVA identified significant differences among models in 5/9 (56%) Q (p < 0.05), as shown in the Table. Pairwise comparisons showed OE outperformed GPT in 4/5 (80%) and DS in 5/5 (100%) significant BCCS. GPT modestly outperformed DS in 2/5 (40%) (Cohen’s d = 0.41). GPT showed the highest inter-reviewer variability (SD = 0.67 vs. 0.48-0.54), indicating less agreement among BMOs on its responses. OE provided well-supported treatment recommendations and recurrence risk assessments with citations, though it lacked MMI. DS relied on optical character recognition in Q10 (inflammatory breast cancer image; 2.93 ± 1.33), limiting its utility in image-based BCCS. GPT, the only vision-enabled model, scored highest (4.29 ± 1.14), highlighting its strong potential for MMI integration in breast cancer diagnostic workflows. Conclusions: This is the first comparison of GPT, DS, and OE in BCCS. OE generated the most guideline-concordant treatment choices across BCCS, showing strong potential as a clinical decision support tool, though its verbosity may require streamlining. GPT showed moderate performance, while DS lagged in clinical relevance and accuracy. These findings highlight the promise of LLMs in breast oncology and the need for further refinement to ensure reliability and real-world applicability. Citation Format: Z. Shah, S. S. Afridi, M. Ombada, A. M. Roy, A. LeVee, S. Premji, V. Gupta, N. M. Lopetegui, D. M. Quiroga, R. L. Sacks, S. Shaikh, Y. Abdou, H. Yu, A. Madabhushi, R. Parikh, L. N. Chaudhary, E. Levine, M. Lambertini, K. Kalinsky, S. Kabraji, S. Gandhi.. Benchmarking Large Language Models for Clinical Decision Support in Breast Cancer Care: A Multi-Institutional Expert Evaluation [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PS3-04-06.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.239 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.095 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.463 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.428 Zit.
Autoren
Institutionen
- Roswell Park Comprehensive Cancer Center(US)
- SUNY Upstate Medical University(US)
- Comprehensive Cancer Center Vienna(AT)
- Centinela Hospital Medical Center(US)
- Sarah Cannon(US)
- Emory Healthcare(US)
- Piedmont Cancer Institute(US)
- The University of Texas Health Science Center at San Antonio(US)
- University of North Carolina Health Care(US)
- Emory University(US)
- Emory University Hospital(US)
- Medical College of Wisconsin(US)
- University of Genoa(IT)