Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking GPT-5 performance and repeatability on the Japanese National Examination for Radiological Technologists over the past decade (2016–2025)
2
Zitationen
5
Autoren
2025
Jahr
Abstract
<h2>Abstract</h2><h3>Purpose</h3> To evaluate GPT-5 against GPT-4o on the Japanese National Examination for Radiological Technologists (2016–2025), assessing accuracy, repeatability, and factors influencing performance differences. <h3>Materials and Methods</h3> We analyzed 1,992 multiple-choice questions involving text and images, spanning the medical and engineering domains. Both models answered all questions in Japanese under identical conditions across three independent runs. Majority-vote accuracy (correct if ≥ 2 of 3 runs were correct) and first-attempt accuracy were compared using McNemar's test. Repeatability was quantified with Fleiss' κ. Univariable and multivariable analyses were conducted to identify question-level factors associated with GPT-5 improvements. <h3>Results</h3> Across all 10 examination years, GPT-5 achieved a majority-vote accuracy of 92.8% (95% CI: 91.5–93.8), consistently outperforming GPT-4o at 72.4% (95% CI: 70.4–74.4; P <.001). Repeatability was higher for GPT-5 (κ = 0.925, 95% CI: 0.915–0.935) than for GPT-4o (κ = 0.904, 95% CI: 0.894–0.914), with correct answers in all three runs for 88.2% vs. 68.9% of items. GPT-5 performed better than GPT-4o in text-based (96.5% vs. 78.1%) and image-based questions (72.6% vs. 41.9%). Significant improvements were observed for MRI, CT, and radiography images; however, performance improvements were smaller for clinically oriented ultrasound and nuclear medicine images. The greatest advantages were observed in calculation questions (97.3% vs. 39.3%) and engineering-related domains, consistent with external benchmarks highlighting GPT-5's improved reasoning. <h3>Conclusion</h3> GPT-5 demonstrated significantly higher accuracy and repeatability than GPT-4o over a decade of examination, with improvements in quantitative reasoning, engineering content, and diagram interpretation. Although improvements extended to medical images, performance in clinical image interpretation remains limited.
Ähnliche Arbeiten
Refinement and reassessment of the SERVQUAL scale.
1991 · 3.967 Zit.
Radiobiology for the Radiologist.
1974 · 3.502 Zit.
ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee
2017 · 2.421 Zit.
Accuracy of Physician Self-assessment Compared With Observed Measures of Competence
2006 · 2.324 Zit.
Technology as an Occasion for Structuring: Evidence from Observations of CT Scanners and the Social Order of Radiology Departments
1986 · 2.247 Zit.