OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 08.04.2026, 08:52

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing the accuracy of ChatGPT responses to guideline-based inquiries: A cross-sectional study

2025·1 Zitationen·Canadian Journal of General Internal MedicineOpen Access
Volltext beim Verlag öffnen

1

Zitationen

8

Autoren

2025

Jahr

Abstract

Introduction: Physicians treating multisystem diseases face challenges in consulting expanding, complex clinical guidelines. Large language models like ChatGPT may help consolidate this information, providing quick access to guideline recommendations. The objective of this study was to assess the accuracy of ChatGPT 3.5 and 4o responses to questions based on specialist-level guideline recommendations. Methods: A framework was developed for authors to pose questions, based on a guideline recommendation, to ChatGPT. A validation tool graded responses as concordant, partially concordant, or discordant to the guideline recommendation. A total of 581 recommendations from three guidelines were analyzed. The primary outcome was overall accuracy. Subgroup analyses assessed accuracy based on number of criteria, strength of evidence, and type of recommendation. Results: For ChatGPT 3.5, 347 recommendations were concordant (59.72%), 128 partially concordant (22.03%), and 106 discordant (18.24%). Questions seeking a single response (Z = 5.289, p < .001) and questions based on recommendations with strong levels of evidence (OR 2.23, p = .001) generated higher levels of concordance. For ChatGPT 4o, 474 recommendations were concordant (81.6%), 82 partially concordant (14.1%), and 25 discordant (4.3%). Mean concordance ratings for single questions were significantly higher compared to multipart questions (Z = 3.08, p = .002). Mean concordance ratings for ChatGPT 4o were substantially higher compared to ChatGPT 3.5 (Z = 8.66, p < .00001). Discussion: ChatGPT 3.5 had a moderate level of accuracy. There remain weaknesses in its ability to answer multi-part questions or those backed by weaker evidence. ChatGPT 4o performed substantially better than ChatGPT 3.5, though both models were vulnerable to hallucination.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationHealthcare cost, quality, practicesClinical Reasoning and Diagnostic Skills
Volltext beim Verlag öffnen