Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Large Reasoning Models Versus Human Multidisciplinary Teams in Lung Cancer Decision-Making: A Real-World Study (Preprint)
0
Zitationen
13
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large Language Models (LLMs) and Large Reasoning Models (LRMs) have shown excellent performance on medical benchmarks, while evaluations concerning real-world medical workflows are still lacking. Lung cancer care is particularly dependent on MDT integration of radiology, pathology, staging, and treatment planning, making it a high-bar setting for evaluating LRMs. </sec> <sec> <title>OBJECTIVE</title> To compare the quality of recommendations generated by two LRMs (GPT-5-Thinking and Deepseek-v3-r1) with each other and with human MDT decisions in real-world lung cancer cases, and to assess whether MDT awareness of AI comparison influences MDT decision quality. </sec> <sec> <title>METHODS</title> Single-centre real-world comparative study of 100 consecutive lung cancer MDT cases (50 retrograde, 50 anterograde) from University Hospital of Split, Croatia. For each case, de-identified structured reports (containing all needed patient/case data, while excluding MDT conclusions) were submitted once to GPT-5-Thinking and Deepseek-v3-r1 to generate recommendations for radiologic diagnostics, pathologic diagnostics, and oncologic therapy, plus overall usefulness. Two independent lung oncologists graded MDT decisions and model outputs on 1–5 Likert scales. An average recommendation score (avg_rec) was calculated as the mean of radiology, pathology, and therapy scores. Analyses used Wilcoxon signed-rank tests for paired model comparisons, Mann–Whitney U for between-phase comparisons, and Spearman correlations (two-sided α=0.05). </sec> <sec> <title>RESULTS</title> Ratings showed ceiling effects. In the retrograde phase (n=50), GPT-5-Thinking scored higher than Deepseek-v3-r1 for radiologic diagnostics (mean 4.89 vs 4.76; mean paired difference +0.24; Wilcoxon Z=3.30, P<.001), oncologic therapy (4.82 vs 4.18; +0.63; Z=3.53, P<.001), and usefulness (4.82 vs 4.18; +0.63; Z=3.65, P<.001); pathologic diagnostics were similar (4.88 vs 4.73; +0.14; Z=1.44, P=.15). In the anterograde phase (n=50), GPT-5-Thinking remained higher for radiology (4.94 vs 4.64; +0.30; Z=3.67, P<.001) and pathology (4.96 vs 4.78; +0.18; Z=2.66, P=.008), with smaller differences for therapy (4.46 vs 4.20; +0.26; Z=1.17, P=.24) and usefulness (4.50 vs 4.16; +0.34; Z=1.55, P=.12). Compared with MDT grade, GPT-5-Thinking avg_rec exceeded MDT in both phases (retrograde 4.90 vs 4.14; mean paired difference +0.76 [95% CI 0.58–0.93]; Wilcoxon Z=6.59, P<.001; anterograde 4.79 vs 4.34; +0.45 [95% CI 0.22–0.67]; Z=3.59, P<.001). Deepseek-v3-r1 exceeded MDT in the retrograde phase (4.56 vs 4.14; +0.41 [95% CI 0.19–0.64]; Z=3.39, P<.001) but not in the anterograde phase (4.54 vs 4.34; +0.20 [95% CI −0.05–0.45]; Z=1.44, P=.15). MDT grades did not differ between phases (mean 4.14 vs 4.34; Mann–Whitney U=1033.0, P=.13). </sec> <sec> <title>CONCLUSIONS</title> In 100 real-world lung cancer MDT cases, both LRMs produced high-quality recommendations, with GPT-5-Thinking consistently outperforming Deepseek-v3-r1 in key diagnostic domains and exceeding expert-graded MDT decision quality in both study phases. MDT decision quality did not significantly change with awareness of AI benchmarking. These findings support LRMs as structured decision-support tools for MDT workflows, while also requiring explicit clinician oversight and governance. </sec>
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.906 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.591 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.770 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.110 Zit.