Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessing the accuracy of ChatGPT responses to guideline-based inquiries: A cross-sectional study
1
Zitationen
8
Autoren
2025
Jahr
Abstract
Introduction: Physicians treating multisystem diseases face challenges in consulting expanding, complex clinical guidelines. Large language models like ChatGPT may help consolidate this information, providing quick access to guideline recommendations. The objective of this study was to assess the accuracy of ChatGPT 3.5 and 4o responses to questions based on specialist-level guideline recommendations. Methods: A framework was developed for authors to pose questions, based on a guideline recommendation, to ChatGPT. A validation tool graded responses as concordant, partially concordant, or discordant to the guideline recommendation. A total of 581 recommendations from three guidelines were analyzed. The primary outcome was overall accuracy. Subgroup analyses assessed accuracy based on number of criteria, strength of evidence, and type of recommendation. Results: For ChatGPT 3.5, 347 recommendations were concordant (59.72%), 128 partially concordant (22.03%), and 106 discordant (18.24%). Questions seeking a single response (Z = 5.289, p < .001) and questions based on recommendations with strong levels of evidence (OR 2.23, p = .001) generated higher levels of concordance. For ChatGPT 4o, 474 recommendations were concordant (81.6%), 82 partially concordant (14.1%), and 25 discordant (4.3%). Mean concordance ratings for single questions were significantly higher compared to multipart questions (Z = 3.08, p = .002). Mean concordance ratings for ChatGPT 4o were substantially higher compared to ChatGPT 3.5 (Z = 8.66, p < .00001). Discussion: ChatGPT 3.5 had a moderate level of accuracy. There remain weaknesses in its ability to answer multi-part questions or those backed by weaker evidence. ChatGPT 4o performed substantially better than ChatGPT 3.5, though both models were vulnerable to hallucination.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.