Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of reasoning large language models on nephrology multiple-choice questions

2025·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Aim Performance of large language models in medicine is improving, yet it remains unclear how the advantage of reasoning models depends on task characteristics in nephrology. Methods We evaluated four large language models in two families—OpenAI (GPT-5 reasoning, GPT-4o baseline) and Google (Gemini 2.5 Pro reasoning, Gemini 2.0 Flash baseline)—on 209 self-assessment questions for nephrology board renewal published by the Japanese society of nephrology. Questions were categorized by question type (general vs clinical), taxonomy (recall, interpretation, problem-solving), and image inclusion (non-image vs image). Models were assessed via application programming interface with default parameters; images were provided as PNG files. Accuracy used Wilson 95% confidence intervals (CIs); paired comparisons used McNemar’s exact test. Primary analyses used logistic generalized linear mixed models with fixed effects, random intercepts, and prespecified interactions. Results Overall accuracy was 87.6% (183/209, 95% CI 82.4–91.4) for GPT-5 and 83.7% (175/209, 95% CI 78.1-88.1) for Gemini 2.5 Pro vs. 69.9% (146/209, 95% CI 63.3-75.7) for GPT-4o and 62.7% (131/209, 95% CI 55.9-69.0) for Gemini 2.0 Flash. Paired analyses favored reasoning models, odds ratios of 6.29 for OpenAI and 7.29 for Google (both P<0.001). Adjusted odds ratios for reasoning vs. baseline were 5.00 for OpenAI and 7.28 for Google (both P<0.001). Interactions showed stronger effects in clinical questions for OpenAI and taxonomy-dependent effects for Google; no significant modification by image inclusion. Conclusion Reasoning models outperform baseline models with context-dependent advantages in nephrology, although their benefits vary by task and further validation is essential before routine use.

Autoren

Institutionen

St. Marianna University School of Medicine(JP)

Themen

Artificial Intelligence in Healthcare and EducationGenomics and Rare DiseasesMachine Learning in Healthcare

Volltext beim Verlag öffnen

Performance of reasoning large language models on nephrology multiple-choice questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen