Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
#1699 ChatGPT performance in solving medical residency questions in nephrology: a pilot study in Brazil
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Abstract Background and Aims Large-scale language models (LLMs), such as ChatGPT, have significantly influenced advances in artificial intelligence (AI). Although they demonstrate competence in general medical examinations, the effectiveness of GPT in answering specific questions in Nephrology is still poorly investigated. This study aimed to evaluate the performance of ChatGPT 4 and 3.5 in answering Nephrology questions from medical residency tests in Brazil. Method This is an experimental study comparing the performance of ChatGPT 4 and 3.5, analyzing 411 Nephrology questions extracted from 310 medical residency tests applied between 2010 and 2024 in Brazil. Multiple-choice questions were selected, with and without images, excluding open-ended or binary questions. For questions with images, only GPT 4 was applied. The questions were organized into 4 main themes: chronic kidney disease (CKD), hydroelectrolyte/acid-base disorders (DHAB), tubulointerstitial diseases (TID) and glomerular diseases (GD). The results and possible potential machine errors were cataloged in a Microsoft Excel spreadsheet and the statistical analysis was conducted through the SPSS software, using the chi-square test. Results Among the 411 questions selected, 389 without images and 22 with images; GPT 4 presented an overall accuracy of 79.80% (328) and GPT 3.5 56.29% (219) with P < 0.001; OR: 3.086; CI: 2.256–4.220. In the main themes, DHAB GPT 4 achieved 79.11% (125) vs 55.17% (80) of 3.5 (P < 0.001; OR: 3.078; CI: 1.859–5.096), TID 4 had an accuracy of 88.23% (60) vs 52.63% (35) of 3.5 (P < 0.001; OR: 6.857 and CI: 2.845 - 16.528), CKD 4 with 75.51% (74) vs 61.95% (57) (P < 0.001; OR: 1.893 and CI: 1.015–3.533) and GD 4 obtained 79.31% (69) vs 54.65% (47) of 3.5 (P < 0.001; OR: 3.099 and CI: 1.582–6.070). When comparing the performance of GPT 4 in questions without and with image, the machine presents an accuracy of 81.49% (317) vs 54.54% (12) with an accuracy of 60% (6) in electrocardiogram analysis. Regarding the errors, there were 36 in common, 19 in DHAB where 68.42% (13) were in acid-base calculation; 12 in GD where 58.33% (7) in conduct, TID 3 and CKD 2. Conclusion The widely used open source GPT 3.5 Chat has low reasoning capacity in solving nephrology questions compared to GPT-4, illustrating knowledge gaps. Other themes in nephrology still need to be explored by this AI tool.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.