Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
More Harm than Help? Evaluating the Capabilities of Vision-Language Models in Neurological Image Analysis
0
Zitationen
11
Autoren
2025
Jahr
Abstract
<title>Abstract</title> Objectives: This study evaluates the performance of both open-source and commercial Vision Language Models (VLMs) in interpreting radiological images of neurological diseases, comparing their diagnostic accuracy to that of experienced neuroradiologists. Methods: A dataset comprising 100 cases of brain and spine pathologies with confirmed diagnoses was curated from the Radiopaedia database to reflect routine clinical neuroradiology practice. Five neuroradiologists reviewed the cases—including imaging and case presentations—to determine the most probable diagnosis. In parallel, five VLMs (Gemini 2.0, GPT-4o1-Preview, Llama 3.2 90b, Qwen 2.5, and Grok-2-vision) were provided with the same cases and tasked with generating three differential diagnoses along with their reasoning. Two neuroradiologists then evaluated the accuracy of both the single most probable diagnosis and the top three diagnoses produced by the VLMs, as well as the rationale provided, and assessed the potential for harmful outcomes based on the VLM outputs. Results: Neuroradiologists achieved a mean diagnostic accuracy of 86.2%, significantly outperforming all VLMs. Among the models, Gemini 2.0 achieved the highest accuracy at 35% with 28% of its diagnoses deemed potentially harmful, while Grok-2-vision had the lowest accuracy at 9% with 45% of its outputs categorized as harmful. All models demonstrated a trend toward slightly lower accuracy with an increasing number of images, however the strength of this relationship was modest. Evaluation of potential harm revealed that treatment delay was the most common risk for VLMs, ranging between 28% for Gemini 2.0 and 45% for Grok-2-vision. Error analysis indicated that the most frequent causes of misdiagnosis were incorrect anatomic classification—with error rates ranging from 26% for Gemini 2.0 to 53% for Grok-2-vision —and inaccurate description of imaging findings, which ranged from 35% for Gemini 2.0 to 72% for Grok-2-vision. Conclusion: While VLMs hold promise for enhancing radiological workflows, the current state-of-the-art of open-source and commercial models is far from being reliable for the interpretation of radiological images of neurological diseases.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.