Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.
1
Zitationen
2
Autoren
2023
Jahr
Abstract
INTRODUCTION: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023. METHODS: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams. RESULTS: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001). CONCLUSIONS: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.549 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.941 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.