Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Artificial Intelligence (AI) in rheumatology: a comparative evaluation of the ChatGPT and DeepSeek application
0
Zitationen
2
Autoren
2026
Jahr
Abstract
The continuous increase in Artificial Intelligence (AI) applications in various areas of human life has brought about great changes in many sciences, among which is the health sector. ChatGPT and DeepSeek belong to the category of Large Language Models (LLMs) developed by Artificial Intelligence (AI) using supervised and reinforcement learning techniques. The aim of this article is to evaluate the accuracy and consistency of ChatGPT and DeepSeek models in the diagnosis and treatment of two rheumatologic diseases, ankylosing spondylitis (axSpA) and psoriatic arthritis (PsA). Both ChatGPT and the DeepSeek chat system have revolutionized information retrieval capabilities and are two of the fastest growing platforms. They are effective tools that produce text responses to human data with high accuracy, accessibility, and low cost, but their use has raised many questions about their reliability. The evaluation carried out in this article is done by comparing the responses obtained from the two models with the results of clinical findings in axSpA and PsA, using four statistical tests. Specifically, the comparison of the responses with the clinical data obtained from 116 patients, who were hospitalized for rheumatological diseases at the Rheumazentrum Ruhrgebiet in Herne, was carried out by calculating the differences in the mean values of the estimates, the Cohen Kappa coefficient, the Fleiss’ Kappa coefficient and the confidence level corresponding to the differences in the mean values, as well as from the calculation of certain other statistical indicators. Regarding the comparison of the mean values, there are results in which their coincidence for the three cases examined is very good, in other cases it is satisfactory, while in the rest there are large differences. Regarding the results of the calculation of the Cohen’s Kappa coefficient, no agreement is indicated between the clinical results and the answers of ChatGPT and DeepSeek and specifically the GPT-5 and DeepSeek-R1 models. The results of the calculations of the Fleiss’ Kappa coefficient showed that also, no satisfactory agreement was found in the values of the clinical data, with the answers of ChatGPT and DeepSeek. The results obtained from the calculation of certain other statistical indicators, as well as the probabilities corresponding to the differences between the mean values of the results obtained from the two models and the clinical findings, are similar. The final results and quantitative assessments of the analysis showed that the responses of the ChatGPT and DeepSeek models have moderate validity, reliability and utility in providing information to patients with axSpA and PsA. Therefore, the use of the information obtained from these models should be done after relevant evaluation and validation by doctors and cross-checking the recommendations with updated clinical guidelines. Not applicable.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.292 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.143 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.539 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.452 Zit.