Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large language model (LLM)-based agentic artificial intelligence tool streamlines research processes in biomarker studies: a proof of concept
0
Zitationen
10
Autoren
2026
Jahr
Abstract
Abstract Background/Introduction AI tools utilizing large language models (LLMs) can significantly accelerate literature reviews by automating repetitive tasks and analyses. However, initial evaluations have been limited to title and abstract screenings. Purpose This study evaluates the full-text screening performance of an agentic AI tool leveraging LLM technology to accurately identify relevant publications for a systematic review of circulating biomarkers in heart failure with reduced ejection fraction (HFrEF). Methods Within the iCARE4CVD public private partnership, we developed a knowledge model combined with an agentic AI tool that screened the full text of 5523 publications based on predefined selection criteria. The inclusion and exclusion criteria were decomposed into 136 specific tasks, each addressed by individual LLM agents using a Retrieval-Augmented Generation (RAG) approach. This process involved segmenting the full text into manageable chunks, vectorizing them, and using RAG to identify the most relevant segments for analysis by the LLM agents. Results were aggregated for automated validation of unusual responses by a critique LLM agent. The response informed then the final inclusion or exclusion decisions. We evaluated the performance of five LLMs based on privacy, openness, and effectiveness (precision and recall) to select the most accurate model. The AI tool was trained and validated against human-reviewed papers, arbitrated by a senior reviewer, with 197 papers used for training and 97 for validation (Fig 1). Performance metrics included sensitivity, specificity, false positive and negative rates, and Cohen’s κ to measure agreement between LLM and human reviewers. Results Our findings demonstrate significant improvement in sensitivity and specificity across the training (batches 1 and 2) and validation phases. In batch 1, sensitivity was 77.8% and specificity was 62.5%. These metrics improved in batch 2 to 81% and 79% respectively. Subsequently, the model settings were updated to prioritize minimizing the false negative rate. In the validation phase, the metrics reached a sensitivity of 91%, a specificity of 53%, a false positive rate of 46.8% and a false negative rate of 8.6% (Fig 2). Notably, inter-rater agreement showed that the AI tool demonstrated greater consistency (κ = 0.38) compared to human reviewers (κ = 0.23), suggesting that the tool provided more reliable results in screening for relevant publications. Conclusion(s) Our study demonstrated the potential of AI tool to reduce labor-intensive efforts while maintaining accuracy in literature reviews. Its reliability is further reinforced by greater inter-rater agreement compared to human reviewers. These findings suggest that LLM-based AI tools can significantly accelerate systematic reviews and enhance research efficiency in medicine.Figure 1Figure 2
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.