Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Machine Learning Tools To (Semi-) Automate Evidence Synthesis
0
Zitationen
12
Autoren
2025
Jahr
Abstract
Introduction. Tools that leverage machine learning, a subset of artificial intelligence, are becoming increasingly important for conducting evidence synthesis as the volume and complexity of primary literature expands exponentially. In response, we have created a living rapid review and evidence map to understand existing research and identify available tools. Methods. We searched PubMed, Embase, and the ACM Digital Library from January 1, 2021, to April 3, 2024, for comparative studies, and identified older studies using the reference lists of existing evidence synthesis products (ESPs). We plan to update searches every 6 months. We included evaluations of machine learning or artificial intelligence tools to automate or semi-automate any stage of systematic review production. Two reviewers conducted title and abstract screening independently, with disagreements resolved through discussion or adjudication by a third reviewer. A single reviewer performed full-text screening and data extraction. We did not assess the quality of individual studies or the strength of evidence across studies. Extracted data included key characteristics of the tools (e.g., type of automation method, systematic review tasks automated), evaluation methods, and performance results (e.g., recall, measures of workload, accuracy, and the authors’ conclusions). The protocol was prospectively registered on the AHRQ website (https://effectivehealthcare.ahrq.gov/products/tools/protocol). Results. We included 56 studies, which evaluated the performance of tools primarily relative to standard human processes across various systematic review tasks. For search-related tools (7 studies), recall (the percent of relevant citations correctly identified) ranged from 0 to 97 percent (median 26%) compared to human-developed search strategies, while precision (the percent of identified citations that are relevant) ranged from 0 to 13.4 percent (median 4.3%). Tools designed to identify randomized controlled trials (RCTs) (6 studies) had recalls between 96 and 100 percent (median 98.5%), with precision ranging from 8 to 92 percent (median 44%), compared to either manual identification or PubMed’s “publication type” tags. Abstract screening tools (22 studies) had a median recall of 93 percent (range 1–100%) with human screening as the standard, while median burden reduction was 50 percent (range 1–93%), and median work saved over sampling to achieve 95 percent recall (WSS@95) was 54 percent (range 33–90%). Data extraction tools (9 studies) showed highly variable performance, with the percentage of data correctly extracted compared to manual extraction ranging from 0 to 99 percent (median 10%). Finally, tools used for risk of bias assessment (7 studies) showed modest agreement with human reviewers, with Cohen’s weighted kappa ranging from 0.11 to 0.48 (median 0.16). Discussion. Certain tools, particularly those for automatically identifying RCTs and prioritizing relevant abstracts in screening, show a high level of recall and precision, suggesting they are nearing widespread use with human oversight. However, other tools, such as those for searching and data extraction, show highly variable performance and are not yet reliable enough for semi-automation. This work revealed the importance of developing standardized evaluation frameworks for assessing the performance of machine learning and artificial intelligence tools in systematic review tasks. We did not assess the risk of bias or methodological quality of the included studies, which may affect the reliability and comparability of the reported performance outcomes. Additionally, the tools were evaluated in a variety of settings, tasks, and review questions, which introduces heterogeneity that makes direct comparisons across tools challenging. Lastly, the rapidly evolving nature of machine learning technologies means that our findings may quickly become outdated. Therefore, we have planned ongoing updates every 6 months.
Ähnliche Arbeiten
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
2021 · 85.301 Zit.
Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement
2009 · 82.806 Zit.
The Measurement of Observer Agreement for Categorical Data
1977 · 76.958 Zit.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement
2009 · 62.803 Zit.
Measuring inconsistency in meta-analyses
2003 · 61.524 Zit.