Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language models show promising performance for some systematic review tasks but call for cautious implementation: a systematic review

2026·1 Zitationen·Journal of Clinical EpidemiologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

OBJECTIVES: With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of large language models (LLMs) in the automation of some or all steps of systematic reviews and meta-analyses. STUDY DESIGN AND SETTING: In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to January 14, 2025. We included any studies assessing the performance of LLMs (eg, generative pre-trained transformer [GPT], Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median (interquartile range [IQR]) for positive (PPA) and negative percent agreements (NPA), respectively, analogous to sensitivity and specificity, between LLMs and human reviewers. RESULTS: From 3889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n = 114, 77%). The most frequently evaluated tasks were title and abstract screening (n = 78, 53%), data extraction (n = 23, 16%), and full-text screening (n = 20, 14%). For title and abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full-text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n = 11). For the "risk of bias assessment" task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n = 6). CONCLUSIONS: The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation. PLAIN LANGUAGE SUMMARY: Systematic reviews are one of the most reliable ways to answer medical and public health questions. They bring together all available studies on a topic and help clinicians and policymakers make informed decisions. However, producing a high-quality systematic review takes a lot of time and effort. Whole teams of researchers spend months screening thousands of articles, extracting data, and double-checking results. With little more than a million of new publications every year, keeping reviews up to date is becoming increasingly difficult. LLMs, such as ChatGPT, may help reduce this workload. These tools can read and summarize text and might assist with repetitive tasks like selecting relevant studies or extracting information from articles. But it is still unclear how reliable these tools are for research purposes. This is the first systematic review to assess LLMs' performance to facilitate systematic reviews. We sought to review all studies that tested LLMs in the different steps of systematic reviews and found 63 studies evaluating how well these tools performed compared with human reviewers. Overall, LLMs showed good agreement with humans for tasks such as screening titles and abstracts, and full-text articles. Newer models seemed to perform better than older ones. However, performance was more variable for complex tasks that require interpretation, such as extracting detailed data or assessing methodological quality. Our findings suggest that LLMs could help researchers work faster and make systematic reviews more efficient. However, they are not ready to replace human judgment. These tools can make mistakes, produce inconsistent results, or generate inaccurate information if not carefully supervised. In practice, LLMs should be used as assistants rather than substitutes. With proper safeguards, transparent reporting, and human oversight, they may become valuable tools to support evidence-based healthcare and help keep research up to date.

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare

Volltext beim Verlag öffnen

Large language models show promising performance for some systematic review tasks but call for cautious implementation: a systematic review

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen