Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Machine Learning Evaluation Metric Discrepancies Across Programming Languages and Their Components in Medical Imaging Domains: Need for Standardization
11
Zitationen
8
Autoren
2025
Jahr
Abstract
This study evaluates metrics for tasks such as classification, regression, clustering, correlation analysis, statistical tests, segmentation, and image-to-image (I2I) translation in medical imaging domains. Metrics were compared across Python libraries, R packages, and Matlab functions to assess their consistency and highlight discrepancies. The findings underscore the need for a unified roadmap to standardize metrics, ensuring reliable and reproducible ML evaluations across platforms. This study examined a wide range of evaluation metrics across various tasks in medical imaging and found only some to be consistent across platforms, such as Accuracy, Balanced Accuracy, Cohens Kappa, F-beta Score, MCC, Geometric Mean, AUC, and Log Loss in binary classification; Accuracy, Cohens Kappa, and F-beta Score in multi-class classification; MAE, MSE, RMSE, MAPE, Explained Variance, Median AE, MSLE, and Huber in regression; Davies-Bouldin Index and Calinski-Harabasz Index in clustering; Pearson, Spearman, Kendall’s Tau, Mutual Information, Distance Correlation, Bicor, Percbend, Shepherd, and Partial Correlation in correlation analysis; Paired t-test, Chi-Square Test, ANOVA, Kruskal-Wallis Test, Shapiro-Wilk Test, Welch’s t-test, and Bartlett’s test in statistical tests; Accuracy, Precision, and Recall in 2D segmentation; Accuracy in 3D segmentation; MAE, MSE, RMSE, and R-Squared in 2D-I2I translation; and MAE, MSE, and RMSE in 3D-I2I translation. Given observation of discrepancies in a number of metrics (e.g. precision, recall and F1 score in binary classification, WCSS in clustering, and multiple statistical tests, amongst multiple metrics), this study concludes that ML evaluation metrics require standardization and recommends that future research use consistent metrics for different tasks to effectively compare ML techniques and solutions.INDEX TERMS 2D/3D medical images, consistency of evaluation metrics in multi-framework, evaluation metric roadmap, ML evaluation metrics.
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.988 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.698 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.808 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.118 Zit.