OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 19.03.2026, 11:01

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Promises and Pitfalls of Large Language Models use to interpret Healthcare Guidelines

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

11

Autoren

2025

Jahr

Abstract

Healthcare guidelines provide evidence-based recommendations for disease management, but their complexity can make interpretation challenging in patient-specific contexts. Large Language Models (LLMs) have been proposed to help query such guidelines, but they may produce inaccurate or incomplete responses when applied to clinical tasks. Retrieval-Augmented Generation (RAG) methods, which have been utilized to enhance LLM performance by incorporating relevant excerpts from healthcare guidelines, represent one strategy to improve the accuracy. In this study, we present a threefold contribution toward evaluating LLMs for healthcare guideline interpretation. First, in collaboration with board-certified physicians, we developed GuidelineQA, a clinically curated, in-house question-answer dataset covering three widely used guidelines on cardiovascular disease, diabetes, and colon cancer prevention. The dataset includes a variety of questions, including those commonly needed by physicians, a frequently asked patient question, and adversarial questions designed to challenge the LLMs. Second, we assessed the performance of two state-of-the-art models—GPT-4o (closed-source) and LLaMA-2-7B (open-source) under both standard (non-augmented) and Retrieval-Augmented Generation (RAG) settings. Third, we conducted a comprehensive evaluation using both quantitative metrics (e.g., BERTScore) and qualitative human assessments.

Ähnliche Arbeiten