Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Abstract PS3-04-06: Benchmarking Large Language Models for Clinical Decision Support in Breast Cancer Care: A Multi-Institutional Expert Evaluation

2026·0 Zitationen·Clinical Cancer Research

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background: Artificial intelligence (AI) and large language models (LLMs) are increasingly explored as tools to support clinical decision-making in oncology. However, evidence validating their performance in complex breast cancer clinical scenarios (BCCS) remains limited. Given breast cancer’s diverse subtypes, evolving standards of care, and the need for nuanced, personalized treatment, we compared three LLMs for treatment decision-making to assess their capabilities and determine their readiness for integration into real-world breast oncology clinics. Methods: Ten breast cancer cases mimicking real-world scenarios were posed to three LLMs: ChatGPT-4o (GPT), DeepSeek-R1 (DS), and OpenEvidence (OE). 16 breast medical oncologists (BMOs; assistant to full professors) from 9 academic centers graded responses using a 5-point Likert scale (1 = poor, 5 = excellent) for clinical accuracy, clarity, relevance, and usability. Questions(Q) 1-9 assessed treatment decision-making cases; Q10 tested multimodal image interpretation (MMI) skills. Repeated measures ANOVA evaluated model differences, followed by Tukey’s post hoc comparisons. Results: On BCCS Q1-9 spanning all major breast cancer subtypes and treatment settings, including early-stage, metastatic, neoadjuvant, and adjuvant, OE achieved the highest mean score (3.91 ± 0.48; 2.57-4.43), significantly outperforming both GPT (3.19 ± 0.67; 2.36-3.93) and DS (2.93 ± 0.54; 1.50-3.86) in overall performance (p &lt; 0.0001), with large effect sizes (Cohen’s d = 1.12 vs. GPT; d = 2.84 vs. DS). Repeated measures ANOVA identified significant differences among models in 5/9 (56%) Q (p &lt; 0.05), as shown in the Table. Pairwise comparisons showed OE outperformed GPT in 4/5 (80%) and DS in 5/5 (100%) significant BCCS. GPT modestly outperformed DS in 2/5 (40%) (Cohen’s d = 0.41). GPT showed the highest inter-reviewer variability (SD = 0.67 vs. 0.48-0.54), indicating less agreement among BMOs on its responses. OE provided well-supported treatment recommendations and recurrence risk assessments with citations, though it lacked MMI. DS relied on optical character recognition in Q10 (inflammatory breast cancer image; 2.93 ± 1.33), limiting its utility in image-based BCCS. GPT, the only vision-enabled model, scored highest (4.29 ± 1.14), highlighting its strong potential for MMI integration in breast cancer diagnostic workflows. Conclusions: This is the first comparison of GPT, DS, and OE in BCCS. OE generated the most guideline-concordant treatment choices across BCCS, showing strong potential as a clinical decision support tool, though its verbosity may require streamlining. GPT showed moderate performance, while DS lagged in clinical relevance and accuracy. These findings highlight the promise of LLMs in breast oncology and the need for further refinement to ensure reliability and real-world applicability. Citation Format: Z. Shah, S. S. Afridi, M. Ombada, A. M. Roy, A. LeVee, S. Premji, V. Gupta, N. M. Lopetegui, D. M. Quiroga, R. L. Sacks, S. Shaikh, Y. Abdou, H. Yu, A. Madabhushi, R. Parikh, L. N. Chaudhary, E. Levine, M. Lambertini, K. Kalinsky, S. Kabraji, S. Gandhi.. Benchmarking Large Language Models for Clinical Decision Support in Breast Cancer Care: A Multi-Institutional Expert Evaluation [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PS3-04-06.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingCardiovascular Health and Risk Factors

Volltext beim Verlag öffnen

Abstract PS3-04-06: Benchmarking Large Language Models for Clinical Decision Support in Breast Cancer Care: A Multi-Institutional Expert Evaluation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen