Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking GPT-5 performance and repeatability on the Japanese National Examination for Radiological Technologists over the past decade (2016–2025)

2025·2 Zitationen·European Journal of Radiology Artificial IntelligenceOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

<h2>Abstract</h2><h3>Purpose</h3> To evaluate GPT-5 against GPT-4o on the Japanese National Examination for Radiological Technologists (2016–2025), assessing accuracy, repeatability, and factors influencing performance differences. <h3>Materials and Methods</h3> We analyzed 1,992 multiple-choice questions involving text and images, spanning the medical and engineering domains. Both models answered all questions in Japanese under identical conditions across three independent runs. Majority-vote accuracy (correct if ≥ 2 of 3 runs were correct) and first-attempt accuracy were compared using McNemar's test. Repeatability was quantified with Fleiss' κ. Univariable and multivariable analyses were conducted to identify question-level factors associated with GPT-5 improvements. <h3>Results</h3> Across all 10 examination years, GPT-5 achieved a majority-vote accuracy of 92.8% (95% CI: 91.5–93.8), consistently outperforming GPT-4o at 72.4% (95% CI: 70.4–74.4; P <.001). Repeatability was higher for GPT-5 (κ = 0.925, 95% CI: 0.915–0.935) than for GPT-4o (κ = 0.904, 95% CI: 0.894–0.914), with correct answers in all three runs for 88.2% vs. 68.9% of items. GPT-5 performed better than GPT-4o in text-based (96.5% vs. 78.1%) and image-based questions (72.6% vs. 41.9%). Significant improvements were observed for MRI, CT, and radiography images; however, performance improvements were smaller for clinically oriented ultrasound and nuclear medicine images. The greatest advantages were observed in calculation questions (97.3% vs. 39.3%) and engineering-related domains, consistent with external benchmarks highlighting GPT-5's improved reasoning. <h3>Conclusion</h3> GPT-5 demonstrated significantly higher accuracy and repeatability than GPT-4o over a decade of examination, with improvements in quantitative reasoning, engineering content, and diagram interpretation. Although improvements extended to medical images, performance in clinical image interpretation remains limited.

Autoren

Institutionen

Themen

Radiology practices and educationArtificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

Benchmarking GPT-5 performance and repeatability on the Japanese National Examination for Radiological Technologists over the past decade (2016–2025)

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen