Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of reasoning large language models on nephrology multiple-choice questions
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Abstract Aim Performance of large language models in medicine is improving, yet it remains unclear how the advantage of reasoning models depends on task characteristics in nephrology. Methods We evaluated four large language models in two families—OpenAI (GPT-5 reasoning, GPT-4o baseline) and Google (Gemini 2.5 Pro reasoning, Gemini 2.0 Flash baseline)—on 209 self-assessment questions for nephrology board renewal published by the Japanese society of nephrology. Questions were categorized by question type (general vs clinical), taxonomy (recall, interpretation, problem-solving), and image inclusion (non-image vs image). Models were assessed via application programming interface with default parameters; images were provided as PNG files. Accuracy used Wilson 95% confidence intervals (CIs); paired comparisons used McNemar’s exact test. Primary analyses used logistic generalized linear mixed models with fixed effects, random intercepts, and prespecified interactions. Results Overall accuracy was 87.6% (183/209, 95% CI 82.4–91.4) for GPT-5 and 83.7% (175/209, 95% CI 78.1-88.1) for Gemini 2.5 Pro vs. 69.9% (146/209, 95% CI 63.3-75.7) for GPT-4o and 62.7% (131/209, 95% CI 55.9-69.0) for Gemini 2.0 Flash. Paired analyses favored reasoning models, odds ratios of 6.29 for OpenAI and 7.29 for Google (both P<0.001). Adjusted odds ratios for reasoning vs. baseline were 5.00 for OpenAI and 7.28 for Google (both P<0.001). Interactions showed stronger effects in clinical questions for OpenAI and taxonomy-dependent effects for Google; no significant modification by image inclusion. Conclusion Reasoning models outperform baseline models with context-dependent advantages in nephrology, although their benefits vary by task and further validation is essential before routine use.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.316 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.468 Zit.