OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 19:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing risk of bias of cohort studies with large language models

2025·1 Zitationen·Research Synthesis MethodsOpen Access
Volltext beim Verlag öffnen

1

Zitationen

11

Autoren

2025

Jahr

Abstract

This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and <i>F</i>1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen's kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%-83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o's 0.55, <i>P</i> < 0.001). In terms of <i>F</i>1 scores, Moonshot-v1-128k led in population selection (<i>F</i> = 0.80 versus ChatGPT-4o's 0.67, <i>P</i> = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47-50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.

Ähnliche Arbeiten