OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.05.2026, 01:15

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Revisiting LLMs and Lung Cancer Questions: How AI Responds to Common Lung Cancer Questions Two Years Later

2025·0 Zitationen·Academic RadiologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

RATIONALE AND OBJECTIVES: This study aimed to evaluate and compare the accuracy, consistency, and reliability of five state-of-the-art LLMs available to the public (chatbots) - GPT-o3-mini, Gemini, DeepSeek R1, Claude, and Perplexity-in answering standardized lung cancer-related questions, building upon prior assessments of earlier models in 2023. METHODS: Forty lung cancer-related questions were developed by experienced radiologists, based on Lung-RADS and Fleischner Society guidelines. On February 16, 2025, each of the five models answered these standardized questions. Responses were independently graded by three radiology experts using a majority-voting system to classify answers as correct, partially correct, incorrect, or no answer (refusal). Statistical analyses included accuracy calculations, logistic regression models to compare response correctness across chatbots, and inter-rater agreement assessments (Cohen's Kappa). RESULTS: GPT-o3-mini achieved the highest accuracy (75.83% correct responses), closely followed by Gemini (74.17%). Claude demonstrated the highest rate of partially correct answers (17.5%). Statistical analysis showed GPT-o3-mini had significantly higher odds of fully correct answers compared to Claude (p=0.02). DeepSeek R1 showed the highest agreement, while GPT-o3-mini had substantial disagreement among raters on weighted Kappa. Notably, GPT-o3-mini identified intentionally incorrect Lung-RADS questions more reliably than DeepSeek or Gemini. CONCLUSION: GPT-o3-mini showed the best overall accuracy among the models tested but had notable inter-rater variability. While GPT-o3-mini and Gemini showed improved accuracy (75.83% and 74.17%) over their older iterations in prior assessments (GPT-3.5 70.8%, Bard 51.7%), ongoing issues persist. Nearly two years later, problems such as inconsistent performance, incorrect responses, and refusals emphasize the continued importance of expert oversight.

Ähnliche Arbeiten