Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Impact of Label Noise from Large Language Model–generated Annotations on Evaluation of Diagnostic Model Performance
1
Zitationen
9
Autoren
2025
Jahr
Abstract
Purpose To systematically examine how large language model (LLM)-generated label noise impacts real-world evaluation of artificial intelligence (AI) binary classification model performance. Materials and Methods A simulation framework was developed to evaluate how LLM label errors affect estimated model performance. A synthetic dataset (10 000 cases) was generated across low- (10% and 30%) and high-prevalence (70% and 90%) conditions. LLM sensitivity and specificity values varied independently from 90% to 100%. AI binary classification models were simulated, with true performance ranging from 90% to 100% for sensitivity and specificity. Apparent performance was calculated with LLM-generated labels as the reference standard. Best- and worst-case performance bounds were calculated analytically, and empirical uncertainty distributions were obtained via Monte Carlo trials. Results Apparent performance was highly sensitive to LLM label quality, with estimation bias strongly modulated by disease prevalence. In low-prevalence settings, small reductions in LLM specificity substantially underestimated model sensitivity. For example, at 10% prevalence, an LLM with 90% specificity yielded an apparent sensitivity of ~53% despite being a perfect model. In high-prevalence conditions, LLM sensitivity reduction led to model specificity underestimation. At 90% prevalence, lowering LLM sensitivity from 100% to 90% reduced apparent specificity from 100% to ~53%, despite perfect true specificity. Monte Carlo simulations revealed consistent downward bias, with apparent values often falling below the true model performance even when within theoretical error bounds. Conclusion LLM-generated labels can introduce systematic prevalence-dependent bias into model evaluation. In low-prevalence tasks, ensuring high LLM specificity during label extraction was critical, as false-positive labels disproportionately biased estimated sensitivity and led to model performance underestimation. <b>Keywords:</b> Large Language Models, Report Labeling, Model Deployment, Diagnostic Performance, Observer Performance, Outcomes Analysis <i>Supplemental material is available for this article.</i> © RSNA, 2025 See also commentary by Maiter and Zapaishchykova in this issue.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.