Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The Performance of ChatGPT-4o and DeepSeek-R1 in Interpreting Thyroid Nodule Ultrasound Text Report: A Multicenter Study
0
Zitationen
6
Autoren
2025
Jahr
Abstract
<title>Abstract</title> Objective To assess two large language models (LLMs), DeepSeek-R1 and ChatGPT-4o, in interpreting thyroid nodule ultrasound text report, emphasizing the accuracy in benign-malignant differentiation, the agreement of Chinese Thyroid Imaging Reporting and Data System (C-TIRADS) classification and management recommendation, and the stability of each task. Methods We analyzed 1,063 ultrasound text reports from three medical centers, with 306 nodules confirmed by histopathology. Each nodule's report was processed through two LLMs using standardized prompts, repeated five times, with the final result determined by mode voting. Results DeepSeek-R1 excelled over ChatGPT-4o in differentiating benign from malignant nodules, with superior sensitivity (0.879 vs. 0.692), accuracy (0.729 vs. 0.644), and Area Under the Curve (AUC) (0.694 vs. 0.632). However, senior radiologists achieved notably better results with higher accuracy (0.804), and AUC (0.865) compared two LLMs. In C-TIRADS classification, DeepSeek-R1 also outperformed ChatGPT-4o (κ = 0.770 vs. κ = 0.688, Δκ = 0.083 [95% CI: 0.048, 0.122]). Both models showed substantial agreement with clinicians on management recommendation (κ = 0.606 vs. κ = 0.608, Δκ=-0.002 [95% CI: -0.044, 0.041]). In terms of stability, LLMs exhibited almost perfect agreement in C-TIRADS classification (α = 0.864 vs. α = 0.866, Δα=-0.003 [95% CI: -0.023, 0.017]) and management recommendation (κ = 0.853 vs. κ = 0.849, Δκ = 0.004 [95% CI: -0.026, 0.033]). However, in benign-malignant discrimination, DeepSeek-R1 demonstrated significantly greater stability than ChatGPT-4o (κ = 0.849 vs. κ = 0.550, Δκ = 0.260 [95% CI: 0.191, 0.321]). Conclusion Our study highlights the potential of LLMs for interpreting thyroid nodule ultrasound text reports. DeepSeek-R1 outperformed in benign-malignant differentiation accuracy and classification consistency, whereas ChatGPT-4o and DeepSeek-R1 performed similarly in management recommendation.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.