OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 02:36

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large Language Models as Tools for Clinical Guideline Development

2025·0 Zitationen·American Journal of Respiratory and Critical Care Medicine
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2025

Jahr

Abstract

Abstract Rationale: Clinical practice guidelines are essential for evidence-based management, yet their development is rigorous and time-consuming. Large language models (LLMs), like ChatGPT-4, offer the potential to streamline the literature review process by selecting relevant articles for guideline development. However, the effectiveness of LLMs in conducting comprehensive analysis is unclear due to limited transparency in training data and accessible resources. Research has not assessed their utility in aiding guideline development. This study aimed to evaluate ChatGPT-4's performance in selecting articles relevant to guideline development. Methods: We conducted a comparative study of traditional vs. LLM methods for article selection in guideline development. We focused on specific evidence synthesis steps described in the ATS Clinical Practice Guideline Development Manual as follows: (1) database searches; (2) title and abstract screening; and (3) full-text screening. We selected three published ATS guidelines on interstitial lung disease diagnosis and management because the associated guideline methodologist (KW) kept detailed records of the search strategies and article selections. From selected guidelines, we identified four representative guideline questions, which followed a population-intervention-comparison-outcome (PICO) format. We chose ChatGPT-4 due to its popularity and recognition as a leading LLM. We generated prompts based on published recommendations to optimize agreement with traditional methods, and we present findings from the best prompts. We quantitatively determined differences between articles selected by traditional vs. LLM methods. We also qualitatively assessed the differences for two PICO questions based on perspectives from the lead guideline methodologist (YK) and follow-up prompts to ChatGPT. Results: ChatGPT-4's performance differed substantially from traditional methods at each step of evidence synthesis. For example, in PICO question 1, ChatGPT-4 selected two of the 35 articles chosen by humans during the database search step, two of the five articles from the title and abstract screening step, and three of the four articles from the full-text screening step (Figure). Best prompts often required a conversational approach about why we are conducting the search (rather than including the exact PICO question) and included a request to select a specific number of articles. Across evidence synthesis steps, ChatGPT-4 identified articles that were not relevant, often selecting manuscripts that were not primary research and/or were not appropriate based on the criteria provided. In one instance, ChatGPT-4 produced a hallucinated article. Conclusions: ChatGPT-4 performed poorly in selecting articles for guideline development. The study highlights the need for human-in-the-loop strategies when using LLMs for tasks like article selection in guideline development.

Ähnliche Arbeiten