Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank
79
Zitationen
12
Autoren
2023
Jahr
Abstract
Abstract Background General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. Objective To assess performance of three LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation. Methods The 149-question Self-Assessment Neurosurgery Exam (SANS) Indications Exam was used to query LLM accuracy. Questions were input in a single best answer, multiple-choice format. Chi-squared, Fisher’s exact, and univariable logistic regression tests assessed differences in performance by question characteristics. Results On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% confidence interval [CI]: 54.1-70.1%) and 82.6% (95% CI: 75.2-88.1%) of questions correctly, respectively. In contrast, Bard scored 44.2% (66/149, 95% CI: 36.2-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P <0.01), and GPT-4 significantly outperformed GPT-3.5 ( P =0.023). Among six subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in four categories relative to Bard (all P <0.01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (OR=0.80, P =0.042) and Bard (OR=0.76, P =0.014), but not GPT-4 (OR=0.86, P =0.085). GPT-4’s performance on imaging-related questions surpassed GPT-3.5’s (68.6% vs. 47.1%, P =0.044) and was comparable to Bard’s (68.6% vs. 66.7%, P =1.000). However, GPT-4 demonstrated significantly lower rates of “hallucination” on imaging-related questions than both GPT-3.5 (2.3% vs. 57.1%, P <0.001) and Bard (2.3% vs. 27.3%, P =0.002). Lack of question text description for imaging predicted significantly higher odds of hallucination for GPT-3.5 (OR=1.45, P =0.012) and Bard (OR=2.09, P <0.001). Conclusion On a question bank of predominantly higher-order management case scenarios intended for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google’s Bard.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.