Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning
0
Zitationen
3
Autoren
2024
Jahr
Abstract
In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice-maximizing dataset size and class balance-does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>A</mml:mi></mml:mrow> </mml:math> -"big alpha"-a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow><mml:msub><mml:mi>A</mml:mi> <mml:mn>0</mml:mn></mml:msub> </mml:mrow> </mml:math> , explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus- <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow><mml:msub><mml:mi>A</mml:mi> <mml:mn>1</mml:mn></mml:msub> </mml:mrow> </mml:math> (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow><mml:msub><mml:mi>A</mml:mi> <mml:mn>0</mml:mn></mml:msub> </mml:mrow> </mml:math> performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>A</mml:mi></mml:mrow> </mml:math> as a way to improve deep learning performance in medical imaging.
Ähnliche Arbeiten
A survey on deep learning in medical image analysis
2017 · 13.563 Zit.
Dermatologist-level classification of skin cancer with deep neural networks
2017 · 13.184 Zit.
A survey on Image Data Augmentation for Deep Learning
2019 · 11.794 Zit.
QuPath: Open source software for digital pathology image analysis
2017 · 8.172 Zit.
Radiomics: Images Are More than Pictures, They Are Data
2015 · 8.011 Zit.