Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Prompt engineering of large language models for paper screening in medical meta-analyses and systematic reviews: A prospective comparative study
0
Zitationen
11
Autoren
2026
Jahr
Abstract
Interest in large language models (LLMs) as a tool for meta-analyses and systematic reviews (MA/SRs). We prospectively developed 515 unique prompts by predefined screening-related categories and tested with open-access LLMs (Llama, Mistral) against four gold-standard MA/SRs from different medical fields published after the LLMs' training cut-offs, using a Python-based pipeline. Heterogeneity between prompts was quantified, and hypothetical workload/cost reduction with top-performing prompts calculated. Across 12,360 pipeline runs, LLMs versus MA/SRs reached average recall/sensitivity = 83.6 ± 17.0%, precision = 18.5 ± 15.6%, specificity = 36.6 ± 23.7% F1-score = 27.6 ± 17.2%, and accuracy = 61.1 ± 11.0%. F1-scores were significantly higher when prompts focused on methods (0.78 ± 0.40%), explicitly mentioned MA/SR screening (0.81 ± 0.37%), included the comparison MA/SR's title (5.64 ± 0.37%) or selection criteria (8.05 ± 0.68%), and with more LLM parameters (70b = 4.48 ± 0.31%, 123b = 7.77 ± 0.31%), but lower when screening abstracts instead of titles (-3.67 ± 0.28%). In LLM-base preselection, top-performing F1-score prompts (recall/sensitivity = 72.2%, specificity = 66.1%, precision = 28.6%) would reduce screening demands by 34.5%-37.5%, saving 8.4-8.8 weeks of work and 17,592-18,552. Recall/sensitivity increased with less MA/SR information contrasting F1-score results, which highlights a recall/sensitivity-precision/specificity trade-off. F1-score increased with detailed MA/SR information, while recall/sensitivity increased with shorter, zeroshot prompts. We provide the first prospectively assessed prompt engineering framework for early-stage LLM-based paper screening across medical fields. The publicly available Python pipeline and full prompt list used here support further development of LLM-based evidence synthesis.
Ähnliche Arbeiten
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
2021 · 88.709 Zit.
Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement
2009 · 82.990 Zit.
The Measurement of Observer Agreement for Categorical Data
1977 · 77.651 Zit.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement
2009 · 63.309 Zit.
Measuring inconsistency in meta-analyses
2003 · 61.972 Zit.