Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of the ChatGPT-5 Language Model in Solving a Specialty Examination in Balneology and Physical Medicine
0
Zitationen
16
Autoren
2025
Jahr
Abstract
Background In recent years, there has been a breakthrough in the development of advanced computational systems based on neural networks. One such system is ChatGPT, first released in 2018, whose potential was quickly recognized, leading to its global popularity. Language models are increasingly capable of addressing complex problems, making them a promising tool to support the training of medical professionals. A particularly important aspect is AI's ability to solve medical examinations, such as the Medical Final Examination (LEK) and the National Specialty Examination (PES), as well as international exams, including the United States Medical Licensing Examination and various specialty board examinations. Objective The objective of this study is to analyze the potential of the latest publicly available version of the ChatGPT-5 model in addressing examination questions in balneology and physical medicine as part of the PES. The study focuses on analyzing the accuracy of the model's answers and evaluating the confidence of its decisions to assess its potential use as a supportive tool in medical education and specialty exam preparation. Materials and methods The experiment was based on the official Spring 2024 PES in Balneology and Physical Medicine, which consisted of 120 questions. The correctness of ChatGPT-5's answers was verified against the official key prepared by the Center for Medical Examinations (CEM), while also recording the model's self-declared confidence level on a 1-5 scale. Both the answer key and the examination database were obtained from the official CEM website. Prior to testing, ChatGPT-5 was introduced to the rules of the examination and provided with the full set of questions in Polish. The questions were divided into two groups: clinical and theoretical. Two questions were excluded due to inconsistency with current medical knowledge. Statistical analyses, including the chi-square test and the Mann-Whitney U test, were performed using Microsoft Excel (Microsoft Corporation, Redmond, WA, USA) and GraphPad Prism (GraphPad Software, San Diego, CA, USA). Results ChatGPT-5 provided 83 correct answers (70.34%), thereby surpassing the passing threshold. No statistically significant differences were observed between clinical and theoretical questions in terms of answer accuracy (p = 0.983), suggesting that the discrepancies were more likely attributable to random variation rather than true differences. Answer correctness was positively correlated with the model's self-assessed confidence level (p = 0.029): the higher the declared confidence, the greater the likelihood of a correct response. The Mann-Whitney U test (p = 0.07) indicated that the difference in confidence levels between clinical and theoretical questions did not reach statistical significance (α = 0.05), although a trend toward potential differences was observed. Conclusions ChatGPT-5 demonstrated sufficient performance to pass the specialization examination in Balneology and Physical Medicine. The model displayed lower confidence in solving advanced clinical questions compared to theoretical ones. Answer accuracy was correlated with the assigned confidence level. While the Mann-Whitney U test (p = 0.07) did not confirm statistically significant differences in confidence between the two categories of questions, it suggested a possible trend. Further expert research is required before such models can be widely implemented in medical education.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.