Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Revisiting LLMs and Lung Cancer Questions: How AI Responds to Common Lung Cancer Questions Two Years Later

2025·0 Zitationen·Academic RadiologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

RATIONALE AND OBJECTIVES: This study aimed to evaluate and compare the accuracy, consistency, and reliability of five state-of-the-art LLMs available to the public (chatbots) - GPT-o3-mini, Gemini, DeepSeek R1, Claude, and Perplexity-in answering standardized lung cancer-related questions, building upon prior assessments of earlier models in 2023. METHODS: Forty lung cancer-related questions were developed by experienced radiologists, based on Lung-RADS and Fleischner Society guidelines. On February 16, 2025, each of the five models answered these standardized questions. Responses were independently graded by three radiology experts using a majority-voting system to classify answers as correct, partially correct, incorrect, or no answer (refusal). Statistical analyses included accuracy calculations, logistic regression models to compare response correctness across chatbots, and inter-rater agreement assessments (Cohen's Kappa). RESULTS: GPT-o3-mini achieved the highest accuracy (75.83% correct responses), closely followed by Gemini (74.17%). Claude demonstrated the highest rate of partially correct answers (17.5%). Statistical analysis showed GPT-o3-mini had significantly higher odds of fully correct answers compared to Claude (p=0.02). DeepSeek R1 showed the highest agreement, while GPT-o3-mini had substantial disagreement among raters on weighted Kappa. Notably, GPT-o3-mini identified intentionally incorrect Lung-RADS questions more reliably than DeepSeek or Gemini. CONCLUSION: GPT-o3-mini showed the best overall accuracy among the models tested but had notable inter-rater variability. While GPT-o3-mini and Gemini showed improved accuracy (75.83% and 74.17%) over their older iterations in prior assessments (GPT-3.5 70.8%, Bard 51.7%), ongoing issues persist. Nearly two years later, problems such as inconsistent performance, incorrect responses, and refusals emphasize the continued importance of expert oversight.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiology practices and educationAI in Service Interactions

Volltext beim Verlag öffnen

Revisiting LLMs and Lung Cancer Questions: How AI Responds to Common Lung Cancer Questions Two Years Later

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen