Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of State-of-the-Art Multimodal Large Language Models on an Image-Rich Radiology Board Examination: Comparison to Human Examinees
2
Zitationen
9
Autoren
2025
Jahr
Abstract
RATIONALE AND OBJECTIVES: This study aimed to assess the current multimodal capabilities of leading multimodal large language models (MLLMs) using a 2024 radiology board examination, evaluate their proficiency in utilizing medical image content, compare their performance against human examinees, and consider their cost-effectiveness. MATERIAL AND METHODS: Six contemporary MLLMs (GPT-4.1, o3, Claude 3.7 Sonnet, Claude 3.7 Sonnet-thinking, Gemini 2.5 Pro Preview, and Gemini 2.5 Flash Preview-thinking) were evaluated using the 100 multiple-choice questions (96 image-based) from the 2024 official board examination of the Japan Radiological Society. Questions, originally in Japanese, were instructed to be translated into English by the MLLMs. Performance was also analyzed with and without images for certain models to assess multimodal utility. RESULTS: Gemini 2.5 Pro Preview achieved the highest accuracy (76.0%), followed by o3 (75.0%), both surpassing the average human examinee score (72.9%). Gemini 2.5 Pro Preview showed 75.0% accuracy with images versus 63.5% without (p = 0.035), and Gemini 2.5 Flash Preview-thinking demonstrated 68.8% accuracy with images versus 57.3% without (p = 0.019), indicating significant performance gains with image inclusion. Notably, Gemini models demonstrated top-tier performance at a highly competitive cost. CONCLUSION: The latest generation of MLLMs, particularly Gemini 2.5 Pro Preview and o3, can exceed average human performance on radiology board examinations and effectively leverage image information. The Gemini series, in particular, shows rapid improvements and offers a compelling combination of high performance and cost-efficiency for potential applications in radiology. SUMMARY STATEMENT: Modern multimodal large language models, notably Gemini 2.5 Pro Preview and o3, surpassed average human performance on the 2024 Japanese Radiology Board Examination. Gemini models showed significant score improvements when utilizing image data and offer top-tier performance at a competitive cost, indicating rapid advancements and excellent cost-effectiveness for radiology applications.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.561 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.452 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.948 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.797 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.