Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating artificial intelligence (AI) as a clinical decision support tool for lung cancer treatment recommendations.
0
Zitationen
13
Autoren
2025
Jahr
Abstract
e20011 Background: The therapeutic landscape of lung cancer is rapidly evolving, presenting oncologists with the challenge of staying updated amidst an overwhelming influx of data. Clinical decision support (CDS) tools, including artificial intelligence (AI) and large language models (LLMs), may help bridge this gap. Evaluating the accuracy of LLMs in complex, real-world oncology scenarios is crucial to understanding their potential. Methods: Twenty-five de-identified lung cancer cases from the fellows’ clinic at Karmanos Cancer Institute, Detroit, MI, were analyzed. Two LLMs, GPT-4 (OpenAI) and Claude Opus (Anthropic), were assessed using advanced prompting techniques like persona-based and chain-of-thought prompting. Five board-certified lung cancer oncologists from NCI-designated centers evaluated LLM-generated responses based on accuracy, treatment recommendation comprehensiveness, and supportive care planning, using a 1–5 scale. Novel insights, the presence of fabricated information, and harmful recommendations were flagged as binary outcomes. Oncologists were blinded to the LLM source and actual treatment decisions. Results: Table 1 presents patient characteristics. GPT-4 achieved an average accuracy score of 4.2 (95% CI, 3.9–4.4), with 3.7 for comprehensiveness of medical/surgical treatment recommendations and 3.7 for supportive care planning. Six responses (32%) were flagged as potentially harmful, and two (8%) contained inaccuracies. Sixteen GPT-4 responses (64%) were rated trustworthy as a CDS tool. Claude Opus had an average accuracy score of 3.6 (95% CI, 3.1–4.1), scoring 3.6 for treatment recommendation comprehensiveness and 3.5 for supportive care planning. Nine responses (36%) were flagged for potential harm, and five (20%) included inaccuracies. Eleven Claude responses (44%) were deemed trustworthy. Significant differences were observed in accuracy (p=0.04) and trustworthiness (p=0.03) between models using McNemar's test. Other factors showed no statistical significance. Conclusions: GPT-4 outperformed Claude Opus in accuracy and trustworthiness, but both models demonstrated limitations, including harmful recommendations and inaccuracies. These findings highlight the need for improved LLM refinement before routine use as CDS tools in lung cancer treatment. Patient demographics and clinical characteristics. Category subcategory Number Median Age (range)- Yr 65 (26-78) Female 7 Male 18 Histology Adenocarcinoma 10 Squamous Cell Carcinoma (SCC) 7 Small Cell Carcinoma 6 Poorly Differentiated 2 Total 25 Stage NSCLC Stage 3 7 NSCLC Stage 4 13 Small Cell limited stage 3 Small Cell Extensive Stage 2
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.834 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.528 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.749 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.104 Zit.
Autoren
Institutionen
- The Barbara Ann Karmanos Cancer Institute(US)
- Wayne State University(US)
- Detroit Medical Center(US)
- Mayo Clinic Hospital(US)
- The University of Texas Health Science Center at San Antonio(US)
- Mayo Clinic in Florida(US)
- WinnMed(US)
- Jacksonville College(US)
- Mayo Clinic in Arizona(US)
- University of Alabama at Birmingham(US)