Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

2026·1 Zitationen·Journal of Medical Internet ResearchOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

BACKGROUND: Despite the transformative potential of Large Language Models (LLMs) in healthcare, the rapid development of these tools has outpaced their rigorous evaluation. While AI-specific reporting guidelines have been developed to address standardized reporting of AI studies, there is currently no specific tool available for risk of bias assessment of LLM-QA studies. Existing risk-of-bias tools for medical research are not well-suited for the unique challenges of evaluating LLM Question-Answer (LLM-QA) studies, which creates a critical gap in assessing their safety and effectiveness. OBJECTIVE: To develop the Alberta Risk of Bias Assessment Tool for LLM-QA studies (AQAT:RoB) to systematically evaluate validity and risk of bias of LLM-QA studies. METHODS: We conducted two literature reviews. The first was on quality assessment tools for LLM-QA studies and the second was on LLM-QA studies, which informed the first draft of AQAT:RoB. The draft AQAT:ROB was further refined through a pre-specified iterative process of modified-Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the consensus meeting, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and ≥Cohen's Kappa of 0.61 between evaluators, no further rounds of development and validation were undertaken. A second Delphi process occurred between February 20 and Feb 23, 2026 to vote on post-pilot changes in response to peer review. RESULTS: The AQAT:RoB consists of five high level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes, Reporting, and Other). These domains are sub-divided into 9 sub-domains. Each sub-domain includes at least one "Support for Judgement" and at least one "Type of Bias" and are to be rated "low", "high" or "unclear" for risk of bias. Pilot evaluation was completed by internal validators who were not part of the consensus discussion and were asked to complete the AQAT:RoB form on each assigned study. Each of the 16 studies were evaluated by two evaluators independently. Pilot validation showed a percent agreement of 86.1% and a Cohen's Kappa of 0.70 between assessors. CONCLUSIONS: The AQAT:RoB demonstrates promising initial reliability for assessing the validity/risk of bias of LLM-QA studies. The tool will benefit from future refinements, external validation, and periodic updates to keep pace with the evolving technology.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsPsychometric Methodologies and Testing

Volltext beim Verlag öffnen

The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen