OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 11:19

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Impact of Label Noise from Large Language Model–generated Annotations on Evaluation of Diagnostic Model Performance

2025·1 Zitationen·Radiology Artificial Intelligence
Volltext beim Verlag öffnen

1

Zitationen

9

Autoren

2025

Jahr

Abstract

Purpose To systematically examine how large language model (LLM)-generated label noise impacts real-world evaluation of artificial intelligence (AI) binary classification model performance. Materials and Methods A simulation framework was developed to evaluate how LLM label errors affect estimated model performance. A synthetic dataset (10 000 cases) was generated across low- (10% and 30%) and high-prevalence (70% and 90%) conditions. LLM sensitivity and specificity values varied independently from 90% to 100%. AI binary classification models were simulated, with true performance ranging from 90% to 100% for sensitivity and specificity. Apparent performance was calculated with LLM-generated labels as the reference standard. Best- and worst-case performance bounds were calculated analytically, and empirical uncertainty distributions were obtained via Monte Carlo trials. Results Apparent performance was highly sensitive to LLM label quality, with estimation bias strongly modulated by disease prevalence. In low-prevalence settings, small reductions in LLM specificity substantially underestimated model sensitivity. For example, at 10% prevalence, an LLM with 90% specificity yielded an apparent sensitivity of ~53% despite being a perfect model. In high-prevalence conditions, LLM sensitivity reduction led to model specificity underestimation. At 90% prevalence, lowering LLM sensitivity from 100% to 90% reduced apparent specificity from 100% to ~53%, despite perfect true specificity. Monte Carlo simulations revealed consistent downward bias, with apparent values often falling below the true model performance even when within theoretical error bounds. Conclusion LLM-generated labels can introduce systematic prevalence-dependent bias into model evaluation. In low-prevalence tasks, ensuring high LLM specificity during label extraction was critical, as false-positive labels disproportionately biased estimated sensitivity and led to model performance underestimation. <b>Keywords:</b> Large Language Models, Report Labeling, Model Deployment, Diagnostic Performance, Observer Performance, Outcomes Analysis <i>Supplemental material is available for this article.</i> © RSNA, 2025 See also commentary by Maiter and Zapaishchykova in this issue.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning and Data ClassificationText and Document Classification Technologies
Volltext beim Verlag öffnen