Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.

2023·1 Zitationen·PubMed

Volltext beim Verlag öffnen

Zitationen

Autoren

2023

Jahr

Abstract

INTRODUCTION: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023. METHODS: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams. RESULTS: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001). CONCLUSIONS: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.

Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen