OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 11:47

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Source matters: Performance of guideline-anchored RAG versus broad evidence LLMs in GI oncology.

2026·0 Zitationen·Journal of Clinical Oncology
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

805 Background: Adherence to NCCN guidelines in GI oncology is inconsistent, and large language models (LLMs) can stray from evidence-based care via hallucinations or outdated references. We examined whether an LLM’s evidence source influences its decision support quality by comparing baseline GPT-4 (O1), a retrieval-augmented model anchored to NCCN GI guidelines (RAG; O2), and an OpenEvidence model retrieving broadly from high-impact journals (O3). Methods: Fifty GI oncology case vignettes were used to test each model (O3 failed to produce output for 2 cases). Outputs were scored with a modified Generative Performance Score (mGPS, range -1.0 to 1.0), combining Guideline Concordance (five-tier rubric from non-concordant to current standard therapy, focusing on correct regimen, required qualifiers, and recognition of missing information) and Hallucination Severity (deductions 0 to -1.0 for minor to critical errors such as fabricated trials, unsafe doses, or incorrect biomarker–therapy links). Each output’s readability and rationale clarity were rated on a 5-point Likert scale. Two oncologists independently scored all cases with substantial agreement (Cohen’s κ = 0.74). We used repeated-measures ANOVA to compare performance across models (using cases where all three outputs were available). Results: For mGPS, mean scores (±SD) were: O1 0.53 ± 0.37; O2 0.70 ± 0.45; O3 0.30 ± 0.45. The difference across models was significant (F(2,90) = 18.62, p < 0.000001), with O2 scoring highest. Readability/rationale clarity was also highest for O2 (mean 4.12 vs 3.38 for O1 and 3.06 for O3). Hallucination severity was lowest with O2, whereas O1 and O3 outputs more often contained serious errors, consistent with broad retrieval increasing off-target information. Conclusions: An LLM’s evidence source materially impacts its clinical performance. The NCCN-anchored RAG model (O2) produced the most guideline-concordant treatment recommendations with the clearest rationale and fewest hallucinations, outperforming both baseline GPT-4 (O1) and the OpenEvidence model (O3). These findings suggest that grounding LLMs in a single up-to-date guideline repository (e.g., NCCN) can limit off-target or outdated information and unsafe suggestions, whereas pulling from a broad, heterogeneous literature increases the risk of hallucinations and misapplied evidence.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Meta-analysis and systematic reviewsArtificial Intelligence in Healthcare and EducationEconomic and Financial Impacts of Cancer
Volltext beim Verlag öffnen