Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking GPT-5 performance and repeatability on the Japanese National Examination for Radiological Technologists over the past decade (2016–2025)
1
Zitationen
5
Autoren
2025
Jahr
Abstract
<h2>Abstract</h2><h3>Purpose</h3> To evaluate GPT-5 against GPT-4o on the Japanese National Examination for Radiological Technologists (2016–2025), assessing accuracy, repeatability, and factors influencing performance differences. <h3>Materials and Methods</h3> We analyzed 1,992 multiple-choice questions involving text and images, spanning the medical and engineering domains. Both models answered all questions in Japanese under identical conditions across three independent runs. Majority-vote accuracy (correct if ≥ 2 of 3 runs were correct) and first-attempt accuracy were compared using McNemar's test. Repeatability was quantified with Fleiss' κ. Univariable and multivariable analyses were conducted to identify question-level factors associated with GPT-5 improvements. <h3>Results</h3> Across all 10 examination years, GPT-5 achieved a majority-vote accuracy of 92.8% (95% CI: 91.5–93.8), consistently outperforming GPT-4o at 72.4% (95% CI: 70.4–74.4; P <.001). Repeatability was higher for GPT-5 (κ = 0.925, 95% CI: 0.915–0.935) than for GPT-4o (κ = 0.904, 95% CI: 0.894–0.914), with correct answers in all three runs for 88.2% vs. 68.9% of items. GPT-5 performed better than GPT-4o in text-based (96.5% vs. 78.1%) and image-based questions (72.6% vs. 41.9%). Significant improvements were observed for MRI, CT, and radiography images; however, performance improvements were smaller for clinically oriented ultrasound and nuclear medicine images. The greatest advantages were observed in calculation questions (97.3% vs. 39.3%) and engineering-related domains, consistent with external benchmarks highlighting GPT-5's improved reasoning. <h3>Conclusion</h3> GPT-5 demonstrated significantly higher accuracy and repeatability than GPT-4o over a decade of examination, with improvements in quantitative reasoning, engineering content, and diagram interpretation. Although improvements extended to medical images, performance in clinical image interpretation remains limited.
Ähnliche Arbeiten
Refinement and reassessment of the SERVQUAL scale.
1991 · 3.966 Zit.
Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review
2005 · 3.757 Zit.
Radiobiology for the Radiologist.
1974 · 3.501 Zit.
International evidence-based recommendations for point-of-care lung ultrasound
2012 · 2.808 Zit.
Radiation Dose Associated With Common Computed Tomography Examinations and the Associated Lifetime Attributable Risk of Cancer
2009 · 2.428 Zit.