OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 07.04.2026, 22:47

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

AI achieves board-level performance on the Japan diagnostic radiology board examination through direct image interpretation

2026·0 Zitationen·Japanese Journal of RadiologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

13

Autoren

2026

Jahr

Abstract

To evaluate text-only versus vision-enabled performance of late-2025 large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE) and compare model performance with newly board-certified radiologists. Image-based questions from the JDRBE 2021 and 2023–2025 were collected, and ground truth answers were determined by expert consensus. Four commercial multimodal LLMs were evaluated: Gemini 2.5 Pro (March 2025, baseline), Gemini 3 Pro, GPT-5.1, and Claude Opus 4.5 (all released in November 2025). Each question was answered with image input (“vision”) and without images (“text-only”). For the JDRBE 2025, subjective legitimacy of responses was independently rated by two radiologists using a five-point Likert scale, and low-rated responses were further analyzed by error type. Additional analyses on the JDRBE 2025 subset included image-shuffling and multi-run variability assessment (five runs). Model accuracies were also compared with those of five newly board-certified radiologists who passed the JDRBE 2025. Gemini 3 Pro achieved the highest accuracy among all models, scoring 85.3% (279/327) in the vision condition and significantly outperforming its text-only accuracy (74.3%, P < 0.001). Gemini 2.5 Pro and Claude Opus 4.5 also improved with image input, whereas GPT-5.1 did not. For the JDRBE 2025, Gemini 3 Pro in the vision condition received the highest legitimacy ratings, and its accuracy (88%) was above the range observed in a reference group of five newly board-certified radiologists (65%–83%), but hallucination was still the most common error type. Image-shuffling analysis using the 2025 subset showed no performance gain in all models, supporting reliance on visual input. Multi-run variability analysis showed high agreement across runs. Among late-2025 commercial LLMs, Gemini 3 Pro demonstrated board-level performance on the JDRBE through direct medical image interpretation. The performance of vision-enabled large language models on the Japan Diagnostic Radiology Board Examination was evaluated. Among the models released in November 2025, Gemini 3 Pro demonstrated significant capabilities in direct medical image interpretation, achieving accuracy above that of a reference group of five newly board-certified radiologists.

Ähnliche Arbeiten