OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 14.03.2026, 00:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking GPT-5 performance and repeatability on the Japanese National Examination for Radiological Technologists over the past decade (2016–2025)

2025·1 Zitationen·European Journal of Radiology Artificial IntelligenceOpen Access
Volltext beim Verlag öffnen

1

Zitationen

5

Autoren

2025

Jahr

Abstract

<h2>Abstract</h2><h3>Purpose</h3> To evaluate GPT-5 against GPT-4o on the Japanese National Examination for Radiological Technologists (2016–2025), assessing accuracy, repeatability, and factors influencing performance differences. <h3>Materials and Methods</h3> We analyzed 1,992 multiple-choice questions involving text and images, spanning the medical and engineering domains. Both models answered all questions in Japanese under identical conditions across three independent runs. Majority-vote accuracy (correct if ≥ 2 of 3 runs were correct) and first-attempt accuracy were compared using McNemar's test. Repeatability was quantified with Fleiss' κ. Univariable and multivariable analyses were conducted to identify question-level factors associated with GPT-5 improvements. <h3>Results</h3> Across all 10 examination years, GPT-5 achieved a majority-vote accuracy of 92.8% (95% CI: 91.5–93.8), consistently outperforming GPT-4o at 72.4% (95% CI: 70.4–74.4; P <.001). Repeatability was higher for GPT-5 (κ = 0.925, 95% CI: 0.915–0.935) than for GPT-4o (κ = 0.904, 95% CI: 0.894–0.914), with correct answers in all three runs for 88.2% vs. 68.9% of items. GPT-5 performed better than GPT-4o in text-based (96.5% vs. 78.1%) and image-based questions (72.6% vs. 41.9%). Significant improvements were observed for MRI, CT, and radiography images; however, performance improvements were smaller for clinically oriented ultrasound and nuclear medicine images. The greatest advantages were observed in calculation questions (97.3% vs. 39.3%) and engineering-related domains, consistent with external benchmarks highlighting GPT-5's improved reasoning. <h3>Conclusion</h3> GPT-5 demonstrated significantly higher accuracy and repeatability than GPT-4o over a decade of examination, with improvements in quantitative reasoning, engineering content, and diagram interpretation. Although improvements extended to medical images, performance in clinical image interpretation remains limited.

Ähnliche Arbeiten