Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Impact of Test Set Composition on AI Performance for Pediatric Radiograph Appendicular Skeleton Fracture Detection

2026·1 Zitationen·Radiology

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background Performance of artificial intelligence (AI)-based fracture detection is often evaluated on test sets that may not reflect real-world practice and overestimate AI accuracy, limiting clinical usability. Purpose To evaluate the impact of test set composition on the performance of AI models used for automated pediatric fracture detection at radiography. Materials and Methods This retrospective study analyzed pediatric appendicular trauma radiographs. Training and validation sets used consecutive radiographs, while two internal test sets were created: a "difficult" set containing radiographs with discrepancies between initial and final assessments and a "matched" set selected based on age, body region, and fracture presence. Three pediatric radiologists independently rated images for diagnostic difficulty and image quality. Sets were evaluated by EfficientNet (classification) and You Only Look Once, version 8 (YOLOv8) (object detection), models trained and validated on radiographs. Statistical significance was assessed with generalized estimating equations and odds ratios (ORs). Results A total of 39 703 pediatric patients were included (median age, 10.4 years [IQR, 6.3-13.3 years]; 56% [22 079 of 39 703] male). The difficult internal test set (392 patients, 786 radiographs) was associated with a 40% decrease in odds of a correct classification compared with the matched internal test set (526 patients, 786 radiographs) for EfficientNet variants (OR, 0.60 [95% CI: 0.53, 0.68]; P < .001). For YOLOv8 variants, difficult images were associated with an 80% decrease in odds of successful detection (OR, 0.20 [95% CI: 0.17, 0.25]; P < .001). Sensitivity analysis confirmed that images from the difficult set were rated as more challenging by experts than the "matched" set, and the difficult set contained more "complex" images (15% [119 of 786] vs 5% [41 of 786]) and fewer "simple" images (49% [381 of 786] vs 70% [550 of 786]) (P < .001). Conclusion AI performance in pediatric fracture detection was influenced by test set composition and radiograph complexity, where an internal test set of complex radiographs was associated with decreased odds of correct prediction. © RSNA, 2026 Supplemental material is available for this article.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMedical Imaging and AnalysisRadiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

Impact of Test Set Composition on AI Performance for Pediatric Radiograph Appendicular Skeleton Fracture Detection

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen