OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 17:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating Web Retrieval–Assisted Large Language Models With and Without Whitelisting for Evidence-Based Neurology: Comparative Study (Preprint)

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large language models (LLMs) coupled with real-time web retrieval are reshaping how clinicians and patients locate medical evidence, and as major search providers fuse LLMs into their interfaces, this hybrid approach might become the new “gateway” to the internet. However, open-web retrieval exposes models to nonprofessional sources, risking hallucinations and factual errors that might jeopardize evidence-based care. </sec> <sec> <title>OBJECTIVE</title> We aimed to quantify the impact of guideline-domain whitelisting on the answer quality of 3 publicly available Perplexity web-based retrieval-augmented generation (RAG) models and compare their performance using a purpose-built, biomedical literature RAG system (OpenEvidence). </sec> <sec> <title>METHODS</title> We applied a validated 130-item question set derived from the American Academy of Neurology (AAN) guidelines (65 factual and 65 case based). Perplexity Sonar, Sonar-Pro, and Sonar-Reasoning-Pro were each queried 4 times per question with open-web retrieval and again with retrieval restricted to aan.com and neurology.org (“whitelisted”). OpenEvidence was queried 4 times. Two neurologists, blinded to condition, scored each response (0=wrong, 1=inaccurate, and 2=correct); any disagreements that arose were resolved by a third neurologist. Ordinal logistic models were used to assess the influence of question type and source category (AAN or neurology vs nonprofessional) on accuracy. </sec> <sec> <title>RESULTS</title> From the 3640 LLM answers that were rated (interrater agreement: κ=0.86), correct-answer rates were as follows (open vs whitelisted, respectively): Sonar, 60% vs 78%, Sonar-Pro, 80% vs 88%, and Sonar-Reasoning-Pro, 81% vs 89%; for OpenEvidence, the correct-answer rate was 82%. A Friedman test on modal scores across the 7 configurations was significant (&lt;i&gt;χ&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;sub&gt;6&lt;/sub&gt;=73.7; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001). Whitelisting improved mean accuracy on the 0 to 2 scale by 0.23 for Sonar (95% CI 0.12-0.34), 0.08 for Sonar-Pro (95% CI 0.01-0.16), and 0.08 for Sonar-Reasoning-Pro (95% CI 0.02-0.13). Including ≥1 nonprofessional source halved the odds of a higher rating in Sonar (odds ratio [OR] 0.50, 95% CI 0.37-0.66; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001), whereas citing an AAN or neurology document doubled it (OR 2.18, 95% CI 1.64-2.89; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001). Furthermore, factual questions outperformed case vignettes across Perplexity models (ORs ranged from 1.95, 95% CI 1.28-2.98 [Sonar + whitelisting] to 4.28, 95% CI 2.59-7.09 [Sonar-Reasoning-Pro]; all &lt;i&gt;P&lt;/i&gt;&amp;lt;.01) but not for OpenEvidence (OR 1.44, 95% CI 0.92-2.27; &lt;i&gt;P&lt;/i&gt;=.11). </sec> <sec> <title>CONCLUSIONS</title> Restricting retrieval to authoritative neurology domains yielded a clinically meaningful 8 to 18 percentage-point gain in correctness and halved output variability, upgrading a consumer search assistant to a decision-support-level tool that at least performed on par with a specialized literature engine. Lightweight source control is therefore a pragmatic safety lever for maintaining continuously updated, web-based RAG-augmented LLMs fit for evidence-based neurology. </sec> <sec> <title>CLINICALTRIAL</title> <p/> </sec>

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingText Readability and Simplification
Volltext beim Verlag öffnen