Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Letter: Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
0
Zitationen
3
Autoren
2025
Jahr
Abstract
To the Editor: Regarding the article “Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations” by Ali, et al, published ratings of accuracy within scientific journals may be an incomplete measure of reporting the performance of large language models (LLMs) such as ChatGPT (OpenAI), GPT-4, and future iterations on neurosurgical questions.1 The rapid development of LLMs, coupled with the lengthy review process necessitated for scientific publication, makes it difficult to assess the significance of these studies over time. This issue is compounded by the closed-source nature of OpenAI models. Since the company's products are not specifically developed for accuracy concerning neurosurgical domains, care must be taken when extrapolating current performance into the future. Release statements are issued by OpenAI on an ongoing basis, indicating potentially significant updates to their underlying models.2 Given the time disparity between collecting data and publication in scientific literature, a model may be updated or made obsolete by a newer model between the point of data collection and publication, rendering the accuracy ratings outdated as published. The Figure shows the comparative results of 12 studies, including the study by Ali et al, that report accuracy of either ChatGPT (GPT-3.5) or GPT-4 on the Congress of Neurological Surgeons Self-Assessment Neurosurgery Examination (SANS) questions and subsets of the Medical Knowledge Self-Assessment Program (MKSAP) questions.1,3-7 Overlaid are lines indicating releases from OpenAI, either detailing releases of new GPT models, or were subjectively assessed to have potential impact on the accuracy or fidelity of the models' responses (such as updates referencing performance, fidelity, or model behavior). Gauging the significance of any single update is difficult, given the inspecific language used by OpenAI in their releases.2FIGURE.: Timeline of GPT updates overlaid with accuracy on SANS and MKSAP questions. Timeline showing the temporal relation of updates with the GPT suite of LLMs vs performance on SANS neurosurgical boards-style questions and MKSAP questions. Thicker lines correspond to the releases of new GPT-family products. In cases where the access method to the LLM was differentiated (ChatGPT, Bing Chat, or through the Application Programming Interface), the chatbot accuracy rating was used (ChatGPT or Bing Chat). LLMs, large language models; MKSAP, Medical Knowledge Self-Assessment Program; SANS, Self-Assessment Neurosurgery Examinations.The Figure highlights large variabilities in reported accuracies of LLM performance. In the case of ChatGPT's performance on SANS questions, accuracy ratings vary from 50.4% to 73.4%. Is this variability due to differences in study design, data being collected before or after significant updates, or is it due to stochasticity in the underlying models? LLMs provide inconsistent output as they are inherently probabilistic and blind to the underlying meaning of their training data, leading to their description as “stochastic parrots” by Bender, et al.8 We believe stochasticity must be accounted for in all LLM studies and could be addressed by performing multiple new sessions with SANS questions, randomly ordering the questions to account for contextualization or correlation during each run. Taken to an extreme, the model may be refreshed after every input to best approximate zero-shot prompting. Differing methodologies in the current literature make comparisons between different LLM versions challenging. Creating a standardized prompting format for these studies, controlling for stochasticity and context, would enable temporal comparisons as LLMs continue to evolve. One study in gastroenterology by Patel et al9 compared multiple GPT versions, providing valuable field-specific accuracy insights. However, the development pace of LLMs is so rapid that this article was published after the release of GPT-4o, making its prospective utility questionable. Two studies using the same model may not be comparable due to their data collection being on opposite sides of a significant update. We encourage editors to require version numbers for LLM studies or, if unavailable, have the authors report the dates when they ran their prompts to provide additional context for comparisons. Many of the articles cited here make statements about the usage of LLMs within medical education. However, an LLM-based tool for neurosurgical training could be verified at the time of publication and then rendered inaccurate or obsolete later due to a software update or new release. Although LLM development seems to head toward higher fidelity, there are no apparent controls in place to say that improvements in the general accuracy of LLMs will also be reflected in specific fields such as neurosurgery. Even more troubling, if the tool is designed to interface with the general public, such as by using clinical text summarization tools,10 there would exist few controls to verify the ongoing accuracy of these tools as the underlying models change. The rapid advancements in the LLM space drastically exceed the pace of scientific review and publication. Care could be taken to implement an open-source database controlling for stochasticity and prompting on a standardized set of neurosurgical questions, with accuracy and reproducibility verifiable by outside users. This approach or others like it may bypass the delays associated with scientific publishing and may provide a more responsive longitudinal comparison to track the fidelity of LLM in neurosurgery contexts over time. It is critical to have a keen understanding of the current fidelity of these models so that they have a place in neurosurgical education and clinical practice, as the consequences of misinformation are significant.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.496 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.386 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.848 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.562 Zit.