OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 14:05

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating artificial intelligence (AI) as a clinical decision support tool for lung cancer treatment recommendations.

2025·0 Zitationen·Journal of Clinical Oncology
Volltext beim Verlag öffnen

0

Zitationen

13

Autoren

2025

Jahr

Abstract

e20011 Background: The therapeutic landscape of lung cancer is rapidly evolving, presenting oncologists with the challenge of staying updated amidst an overwhelming influx of data. Clinical decision support (CDS) tools, including artificial intelligence (AI) and large language models (LLMs), may help bridge this gap. Evaluating the accuracy of LLMs in complex, real-world oncology scenarios is crucial to understanding their potential. Methods: Twenty-five de-identified lung cancer cases from the fellows’ clinic at Karmanos Cancer Institute, Detroit, MI, were analyzed. Two LLMs, GPT-4 (OpenAI) and Claude Opus (Anthropic), were assessed using advanced prompting techniques like persona-based and chain-of-thought prompting. Five board-certified lung cancer oncologists from NCI-designated centers evaluated LLM-generated responses based on accuracy, treatment recommendation comprehensiveness, and supportive care planning, using a 1–5 scale. Novel insights, the presence of fabricated information, and harmful recommendations were flagged as binary outcomes. Oncologists were blinded to the LLM source and actual treatment decisions. Results: Table 1 presents patient characteristics. GPT-4 achieved an average accuracy score of 4.2 (95% CI, 3.9–4.4), with 3.7 for comprehensiveness of medical/surgical treatment recommendations and 3.7 for supportive care planning. Six responses (32%) were flagged as potentially harmful, and two (8%) contained inaccuracies. Sixteen GPT-4 responses (64%) were rated trustworthy as a CDS tool. Claude Opus had an average accuracy score of 3.6 (95% CI, 3.1–4.1), scoring 3.6 for treatment recommendation comprehensiveness and 3.5 for supportive care planning. Nine responses (36%) were flagged for potential harm, and five (20%) included inaccuracies. Eleven Claude responses (44%) were deemed trustworthy. Significant differences were observed in accuracy (p=0.04) and trustworthiness (p=0.03) between models using McNemar's test. Other factors showed no statistical significance. Conclusions: GPT-4 outperformed Claude Opus in accuracy and trustworthiness, but both models demonstrated limitations, including harmful recommendations and inaccuracies. These findings highlight the need for improved LLM refinement before routine use as CDS tools in lung cancer treatment. Patient demographics and clinical characteristics. Category subcategory Number Median Age (range)- Yr 65 (26-78) Female 7 Male 18 Histology Adenocarcinoma 10 Squamous Cell Carcinoma (SCC) 7 Small Cell Carcinoma 6 Poorly Differentiated 2 Total 25 Stage NSCLC Stage 3 7 NSCLC Stage 4 13 Small Cell limited stage 3 Small Cell Extensive Stage 2

Ähnliche Arbeiten