Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
This too shall pass: the performance of <scp>ChatGPT</scp>‐3.5, <scp>ChatGPT</scp>‐4 and New Bing in an Australian medical licensing examination
36
Zitationen
3
Autoren
2023
Jahr
Abstract
To the Editor: Following the release of the generative pre-trained transformer (GPT) ChatGPT in November 2022, a wide range of large language models (LLMs), including ChatGPT-3.5 (GPT-3-derived), ChatGPT-4 and New Bing (GPT-4-derived), have been made publicly available. There is suggestion that ChatGPT-4 outperforms ChatGPT-3.5 in answering questions from medical exams,1 but it is unknown whether GPT-4-derived LLMs consistently outperform GPT-3-derived LLMs. Despite being GPT-4-derived, New Bing was fine-tuned independently to ChatGPT-4. Fine-tuning provides additional training to LLMs for a specific task, such as responding to user queries, which can include human feedback on LLM-generated responses.2 New Bing and ChatGPT may differ in performance due to such variations in fine-tuning, and, at the time of testing (16–18 March 2023), only New Bing incorporated basic web searches. We tested ChatGPT-3.5, ChatGPT-4 and New Bing against all 50 publicly available Australian Medical Council licensing examination practice questions.3 The questions had a five-option multiple-choice format, and we copied each one in full to the LLM. We omitted images during testing because only text inputs were accepted. Two medical student investigators (OK and CG) decoded responses into answer options, which the Australian Medical Council website graded. Each algorithm was tested three times in independent sessions. ChatGPT-3.5 and ChatGPT-4 answered 49/50 questions in every session, and New Bing answered 49, 46 and 43 questions in separate sessions (Supporting Information). ChatGPT-4 provided 46/50 answer options that were identical in each session. Conversely, ChatGPT-3.5 gave only 37/50 identical answers, and New Bing, 33/50 (Box). ChatGPT-4 scored the highest mean (39.7; standard deviation [SD], 0.6), and the means for New Bing and ChatGPT-3 were 36.0 (SD, 2.6) and 33.0 (SD, 2.6) respectively. GPT-4-derived LLMs (ChatGPT-4 and New Bing) appear to exceed the medical multiple-choice performance of their GPT-3-derived predecessors (ChatGPT-3.5). Such models have previously been shown to outperform PubMedGPT,4 GPT-3 without fine-tuning, and InstructGPT.5 Despite improvements in accuracy, limitations in consistency remain. These results indicate the potential influence of fine-tuning and web-search access for LLM performance. Caution is advised when choosing whether to use an LLM, and which one to use, for a specific task, as it cannot be assumed that all same-generation LLMs perform equally. We thank Joshua Kovoor for providing editorial and statistical support to this piece. No relevant disclosures. Supplementary table Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.239 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.095 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.463 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.428 Zit.