OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 10:31

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of Next-Generation Reasoning Models on Self-Assessment Questions for Nephrology Board Recertification

2025·0 Zitationen·Journal of the American Society of Nephrology
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Background: Large Language Models (LLMs) show medical promise. Standard LLMs primarily leverage pattern recognition from vast datasets, whereas newer reasoning models are architecturally designed to enhance multi-step logical inference. However, whether this fundamental difference translates to superior performance by reasoning models in nephrology remains underexplored. We compared cutting-edge reasoning models against standard LLMs using nephrology multiple choice questions. Methods: We used 209 self-assessment questions for nephrology board recertification from the Japanese Society of Nephrology (2014-2023). Reasoning models (OpenAI's o3, o3-2025-04-16; Google's Gemini 2.5 Pro, gemini-2.5-pro-preview-03-25) and standard models (OpenAI's GPT-4o, gpt-4o-2024-11-20; Google's Gemini 2.0 Flash, gemini-2.0-flash-001) were evaluated for accuracy via API. Accuracy was also analyzed by question characteristics (taxonomy, question type, image inclusion, subspecialty) and compared (chi-squared/Fisher's exact tests, p<0.05). Results: Reasoning models o3 (89.5%) and Gemini 2.5 Pro (83.7%) had significantly higher overall accuracy than standard GPT-4o (69.9%) and Gemini 2.0 Flash (62.7%) (all p<0.001). No significant difference was found between the reasoning models (p=0.114). Reasoning models met passing threshold (≥60%) in all 10 years; standard models in 7/10 years. Reasoning models showed superiority for recall, problem-solving, general, clinical, and non-image questions. Subspecialty performance varied; reasoning models generally outperformed standard ones, but no significant differences were noted for interpretation or image questions. Conclusion: Reasoning models significantly outperformed standard LLMs on nephrology multiple choice questions, showing high potential as educational and research support tools in nephrology. Responsible implementation requires further validation and expert oversight.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling
Volltext beim Verlag öffnen