Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Background: Large Language Models (LLMs) show medical promise. Standard LLMs primarily leverage pattern recognition from vast datasets, whereas newer reasoning models are architecturally designed to enhance multi-step logical inference. However, whether this fundamental difference translates to superior performance by reasoning models in nephrology remains underexplored. We compared cutting-edge reasoning models against standard LLMs using nephrology multiple choice questions. Methods: We used 209 self-assessment questions for nephrology board recertification from the Japanese Society of Nephrology (2014-2023). Reasoning models (OpenAI's o3, o3-2025-04-16; Google's Gemini 2.5 Pro, gemini-2.5-pro-preview-03-25) and standard models (OpenAI's GPT-4o, gpt-4o-2024-11-20; Google's Gemini 2.0 Flash, gemini-2.0-flash-001) were evaluated for accuracy via API. Accuracy was also analyzed by question characteristics (taxonomy, question type, image inclusion, subspecialty) and compared (chi-squared/Fisher's exact tests, p<0.05). Results: Reasoning models o3 (89.5%) and Gemini 2.5 Pro (83.7%) had significantly higher overall accuracy than standard GPT-4o (69.9%) and Gemini 2.0 Flash (62.7%) (all p<0.001). No significant difference was found between the reasoning models (p=0.114). Reasoning models met passing threshold (≥60%) in all 10 years; standard models in 7/10 years. Reasoning models showed superiority for recall, problem-solving, general, clinical, and non-image questions. Subspecialty performance varied; reasoning models generally outperformed standard ones, but no significant differences were noted for interpretation or image questions. Conclusion: Reasoning models significantly outperformed standard LLMs on nephrology multiple choice questions, showing high potential as educational and research support tools in nephrology. Responsible implementation requires further validation and expert oversight.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.