OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.03.2026, 00:36

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating Large Reasoning Models Versus Human Multidisciplinary Teams in Lung Cancer Decision-Making: A Real-World Study (Preprint)

2026·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

13

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large Language Models (LLMs) and Large Reasoning Models (LRMs) have shown excellent performance on medical benchmarks, while evaluations concerning real-world medical workflows are still lacking. Lung cancer care is particularly dependent on MDT integration of radiology, pathology, staging, and treatment planning, making it a high-bar setting for evaluating LRMs. </sec> <sec> <title>OBJECTIVE</title> To compare the quality of recommendations generated by two LRMs (GPT-5-Thinking and Deepseek-v3-r1) with each other and with human MDT decisions in real-world lung cancer cases, and to assess whether MDT awareness of AI comparison influences MDT decision quality. </sec> <sec> <title>METHODS</title> Single-centre real-world comparative study of 100 consecutive lung cancer MDT cases (50 retrograde, 50 anterograde) from University Hospital of Split, Croatia. For each case, de-identified structured reports (containing all needed patient/case data, while excluding MDT conclusions) were submitted once to GPT-5-Thinking and Deepseek-v3-r1 to generate recommendations for radiologic diagnostics, pathologic diagnostics, and oncologic therapy, plus overall usefulness. Two independent lung oncologists graded MDT decisions and model outputs on 1–5 Likert scales. An average recommendation score (avg_rec) was calculated as the mean of radiology, pathology, and therapy scores. Analyses used Wilcoxon signed-rank tests for paired model comparisons, Mann–Whitney U for between-phase comparisons, and Spearman correlations (two-sided α=0.05). </sec> <sec> <title>RESULTS</title> Ratings showed ceiling effects. In the retrograde phase (n=50), GPT-5-Thinking scored higher than Deepseek-v3-r1 for radiologic diagnostics (mean 4.89 vs 4.76; mean paired difference +0.24; Wilcoxon Z=3.30, P&lt;.001), oncologic therapy (4.82 vs 4.18; +0.63; Z=3.53, P&lt;.001), and usefulness (4.82 vs 4.18; +0.63; Z=3.65, P&lt;.001); pathologic diagnostics were similar (4.88 vs 4.73; +0.14; Z=1.44, P=.15). In the anterograde phase (n=50), GPT-5-Thinking remained higher for radiology (4.94 vs 4.64; +0.30; Z=3.67, P&lt;.001) and pathology (4.96 vs 4.78; +0.18; Z=2.66, P=.008), with smaller differences for therapy (4.46 vs 4.20; +0.26; Z=1.17, P=.24) and usefulness (4.50 vs 4.16; +0.34; Z=1.55, P=.12). Compared with MDT grade, GPT-5-Thinking avg_rec exceeded MDT in both phases (retrograde 4.90 vs 4.14; mean paired difference +0.76 [95% CI 0.58–0.93]; Wilcoxon Z=6.59, P&lt;.001; anterograde 4.79 vs 4.34; +0.45 [95% CI 0.22–0.67]; Z=3.59, P&lt;.001). Deepseek-v3-r1 exceeded MDT in the retrograde phase (4.56 vs 4.14; +0.41 [95% CI 0.19–0.64]; Z=3.39, P&lt;.001) but not in the anterograde phase (4.54 vs 4.34; +0.20 [95% CI −0.05–0.45]; Z=1.44, P=.15). MDT grades did not differ between phases (mean 4.14 vs 4.34; Mann–Whitney U=1033.0, P=.13). </sec> <sec> <title>CONCLUSIONS</title> In 100 real-world lung cancer MDT cases, both LRMs produced high-quality recommendations, with GPT-5-Thinking consistently outperforming Deepseek-v3-r1 in key diagnostic domains and exceeding expert-graded MDT decision quality in both study phases. MDT decision quality did not significantly change with awareness of AI benchmarking. These findings support LRMs as structured decision-support tools for MDT workflows, while also requiring explicit clinician oversight and governance. </sec>

Ähnliche Arbeiten

Autoren

Themen

Radiomics and Machine Learning in Medical ImagingArtificial Intelligence in Healthcare and EducationLung Cancer Diagnosis and Treatment
Volltext beim Verlag öffnen