Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation
1
Zitationen
25
Autoren
2026
Jahr
Abstract
BACKGROUND: Despite the transformative potential of Large Language Models (LLMs) in healthcare, the rapid development of these tools has outpaced their rigorous evaluation. While AI-specific reporting guidelines have been developed to address standardized reporting of AI studies, there is currently no specific tool available for risk of bias assessment of LLM-QA studies. Existing risk-of-bias tools for medical research are not well-suited for the unique challenges of evaluating LLM Question-Answer (LLM-QA) studies, which creates a critical gap in assessing their safety and effectiveness. OBJECTIVE: To develop the Alberta Risk of Bias Assessment Tool for LLM-QA studies (AQAT:RoB) to systematically evaluate validity and risk of bias of LLM-QA studies. METHODS: We conducted two literature reviews. The first was on quality assessment tools for LLM-QA studies and the second was on LLM-QA studies, which informed the first draft of AQAT:RoB. The draft AQAT:ROB was further refined through a pre-specified iterative process of modified-Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the consensus meeting, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and ≥Cohen's Kappa of 0.61 between evaluators, no further rounds of development and validation were undertaken. A second Delphi process occurred between February 20 and Feb 23, 2026 to vote on post-pilot changes in response to peer review. RESULTS: The AQAT:RoB consists of five high level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes, Reporting, and Other). These domains are sub-divided into 9 sub-domains. Each sub-domain includes at least one "Support for Judgement" and at least one "Type of Bias" and are to be rated "low", "high" or "unclear" for risk of bias. Pilot evaluation was completed by internal validators who were not part of the consensus discussion and were asked to complete the AQAT:RoB form on each assigned study. Each of the 16 studies were evaluated by two evaluators independently. Pilot validation showed a percent agreement of 86.1% and a Cohen's Kappa of 0.70 between assessors. CONCLUSIONS: The AQAT:RoB demonstrates promising initial reliability for assessing the validity/risk of bias of LLM-QA studies. The tool will benefit from future refinements, external validation, and periodic updates to keep pace with the evolving technology.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.549 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.941 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
- Carrie Ye
- J. Ross Mitchell
- Daniel C. Baumgart
- Zechen Ma
- Angela Lim Fung
- Daniela Garcia Orellana
- Juel Chowdhury
- Abass Abdullah
- Steven J. Katz
- Jacob L. Jaremko
- Pierre Boulanger
- Claire E.H. Barber
- Gillian Lemermeyer
- Hosna Jabbari
- Lili Mou
- Maryam Mirzaei
- Mary Waithera Beckett Githumbi
- Puneeta Tandon
- Randy Goebel
- Rhys H. Clark
- Whitney Hung
- Marjan Abbasi
- Farhad Maleki
- Scott Klarenbach
- Mohamed Abdalla