Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Source matters: Performance of guideline-anchored RAG versus broad evidence LLMs in GI oncology.
0
Zitationen
6
Autoren
2026
Jahr
Abstract
805 Background: Adherence to NCCN guidelines in GI oncology is inconsistent, and large language models (LLMs) can stray from evidence-based care via hallucinations or outdated references. We examined whether an LLM’s evidence source influences its decision support quality by comparing baseline GPT-4 (O1), a retrieval-augmented model anchored to NCCN GI guidelines (RAG; O2), and an OpenEvidence model retrieving broadly from high-impact journals (O3). Methods: Fifty GI oncology case vignettes were used to test each model (O3 failed to produce output for 2 cases). Outputs were scored with a modified Generative Performance Score (mGPS, range -1.0 to 1.0), combining Guideline Concordance (five-tier rubric from non-concordant to current standard therapy, focusing on correct regimen, required qualifiers, and recognition of missing information) and Hallucination Severity (deductions 0 to -1.0 for minor to critical errors such as fabricated trials, unsafe doses, or incorrect biomarker–therapy links). Each output’s readability and rationale clarity were rated on a 5-point Likert scale. Two oncologists independently scored all cases with substantial agreement (Cohen’s κ = 0.74). We used repeated-measures ANOVA to compare performance across models (using cases where all three outputs were available). Results: For mGPS, mean scores (±SD) were: O1 0.53 ± 0.37; O2 0.70 ± 0.45; O3 0.30 ± 0.45. The difference across models was significant (F(2,90) = 18.62, p < 0.000001), with O2 scoring highest. Readability/rationale clarity was also highest for O2 (mean 4.12 vs 3.38 for O1 and 3.06 for O3). Hallucination severity was lowest with O2, whereas O1 and O3 outputs more often contained serious errors, consistent with broad retrieval increasing off-target information. Conclusions: An LLM’s evidence source materially impacts its clinical performance. The NCCN-anchored RAG model (O2) produced the most guideline-concordant treatment recommendations with the clearest rationale and fewest hallucinations, outperforming both baseline GPT-4 (O1) and the OpenEvidence model (O3). These findings suggest that grounding LLMs in a single up-to-date guideline repository (e.g., NCCN) can limit off-target or outdated information and unsafe suggestions, whereas pulling from a broad, heterogeneous literature increases the risk of hallucinations and misapplied evidence.
Ähnliche Arbeiten
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
2021 · 85.575 Zit.
Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement
2009 · 82.820 Zit.
The Measurement of Observer Agreement for Categorical Data
1977 · 77.011 Zit.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement
2009 · 62.852 Zit.
Measuring inconsistency in meta-analyses
2003 · 61.558 Zit.