Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

This too shall pass: the performance of <scp>ChatGPT</scp>‐3.5, <scp>ChatGPT</scp>‐4 and New Bing in an Australian medical licensing examination

2023·36 Zitationen·The Medical Journal of AustraliaOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2023

Jahr

Abstract

To the Editor: Following the release of the generative pre-trained transformer (GPT) ChatGPT in November 2022, a wide range of large language models (LLMs), including ChatGPT-3.5 (GPT-3-derived), ChatGPT-4 and New Bing (GPT-4-derived), have been made publicly available. There is suggestion that ChatGPT-4 outperforms ChatGPT-3.5 in answering questions from medical exams,1 but it is unknown whether GPT-4-derived LLMs consistently outperform GPT-3-derived LLMs. Despite being GPT-4-derived, New Bing was fine-tuned independently to ChatGPT-4. Fine-tuning provides additional training to LLMs for a specific task, such as responding to user queries, which can include human feedback on LLM-generated responses.2 New Bing and ChatGPT may differ in performance due to such variations in fine-tuning, and, at the time of testing (16–18 March 2023), only New Bing incorporated basic web searches. We tested ChatGPT-3.5, ChatGPT-4 and New Bing against all 50 publicly available Australian Medical Council licensing examination practice questions.3 The questions had a five-option multiple-choice format, and we copied each one in full to the LLM. We omitted images during testing because only text inputs were accepted. Two medical student investigators (OK and CG) decoded responses into answer options, which the Australian Medical Council website graded. Each algorithm was tested three times in independent sessions. ChatGPT-3.5 and ChatGPT-4 answered 49/50 questions in every session, and New Bing answered 49, 46 and 43 questions in separate sessions (Supporting Information). ChatGPT-4 provided 46/50 answer options that were identical in each session. Conversely, ChatGPT-3.5 gave only 37/50 identical answers, and New Bing, 33/50 (Box). ChatGPT-4 scored the highest mean (39.7; standard deviation [SD], 0.6), and the means for New Bing and ChatGPT-3 were 36.0 (SD, 2.6) and 33.0 (SD, 2.6) respectively. GPT-4-derived LLMs (ChatGPT-4 and New Bing) appear to exceed the medical multiple-choice performance of their GPT-3-derived predecessors (ChatGPT-3.5). Such models have previously been shown to outperform PubMedGPT,4 GPT-3 without fine-tuning, and InstructGPT.5 Despite improvements in accuracy, limitations in consistency remain. These results indicate the potential influence of fine-tuning and web-search access for LLM performance. Caution is advised when choosing whether to use an LLM, and which one to use, for a specific task, as it cannot be assumed that all same-generation LLMs perform equally. We thank Joshua Kovoor for providing editorial and statistical support to this piece. No relevant disclosures. Supplementary table Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingCOVID-19 diagnosis using AI

Volltext beim Verlag öffnen

This too shall pass: the performance of <scp>ChatGPT</scp>‐3.5, <scp>ChatGPT</scp>‐4 and New Bing in an Australian medical licensing examination

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen