Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Decoupling Visual Parsing and Diagnostic Reasoning for Vision-Language Models (GPT-4o and GPT-5): Analysis Using Thoracic Imaging Quiz Cases
2
Zitationen
5
Autoren
2025
Jahr
Abstract
<b>BACKGROUND.</b> Vision-language models (VLMs) have potential to identify findings on radiologic imaging (i.e., visual parsing) and translate findings into diagnoses (i.e., diagnostic reasoning). Current VLMs have shown insufficient performance to support clinical integration. <b>OBJECTIVE.</b> The purpose of our study was to evaluate the separate contributions of visual parsing and diagnostic reasoning toward GPT-based VLMs' performance in generating correct diagnoses for thoracic imaging. <b>METHODS.</b> This retrospective study included 128 publicly available thoracic imaging cases from the Korean Society of Thoracic Imaging quiz platform (accessed on June 15, 2025). Two VLMs (GPT-4o and GPT-5) processed cases, separately when patient metadata and images were inputted and when patient metadata and radiologist-generated image descriptions were inputted. The models provided five ranked differential diagnoses for each case; when metadata and images were inputted, the models first provided a summary of imaging findings. The proportion of cases for which the models' five differential diagnoses included the correct diagnosis was determined (i.e., top-5 accuracy). The performance of quiz participants, who interpreted cases using metadata and images, was extracted from the platform. The quality of the model-provided image summaries was scored on a 4-point scale (4 = best score). Logistic regression analyses assessed associations between model image summary scores and diagnostic performance. Diagnostic concordance was assessed between models' top-ranked diagnoses and quiz participants' top-10 differential diagnoses. <b>RESULTS.</b> Top-5 accuracy for GPT-4o and GPT-5 when metadata and images were inputted was 15.9% and 24.7% and when metadata and descriptions were inputted was 40.1% and 59.1%, respectively; quiz participants' pooled top-5 accuracy was 45.8%. Median image summary score was 2 for both models; these scores showed significant independent associations with a top-5 match (GPT-4o: OR = 5.95; GPT-5: OR = 2.77; <i>p</i> < .001). Concordance between models' top-ranked diagnosis and quiz participants' differential lists for GPT-4o and GPT-5 when metadata and images were inputted was 31.6% and 39.3% and when metadata and descriptions were inputted was 78.8% and 79.4%, respectively. <b>CONCLUSION.</b> Two VLMs showed limited ability to visually identify thoracic imaging findings but performed more favorably in generating accurate diagnoses when provided radiologist-generated descriptions. <b>CLINICAL IMPACT.</b> The results underscore the need for radiologist expertise in thoracic imaging interpretation and identify visual image parsing rather than diagnostic reasoning as the principal limitation constraining VLM performance.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.557 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.447 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.944 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.797 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.