Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of State-of-the-Art Multimodal Large Language Models on an Image-Rich Radiology Board Examination: Comparison to Human Examinees

2025·2 Zitationen·Academic RadiologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

RATIONALE AND OBJECTIVES: This study aimed to assess the current multimodal capabilities of leading multimodal large language models (MLLMs) using a 2024 radiology board examination, evaluate their proficiency in utilizing medical image content, compare their performance against human examinees, and consider their cost-effectiveness. MATERIAL AND METHODS: Six contemporary MLLMs (GPT-4.1, o3, Claude 3.7 Sonnet, Claude 3.7 Sonnet-thinking, Gemini 2.5 Pro Preview, and Gemini 2.5 Flash Preview-thinking) were evaluated using the 100 multiple-choice questions (96 image-based) from the 2024 official board examination of the Japan Radiological Society. Questions, originally in Japanese, were instructed to be translated into English by the MLLMs. Performance was also analyzed with and without images for certain models to assess multimodal utility. RESULTS: Gemini 2.5 Pro Preview achieved the highest accuracy (76.0%), followed by o3 (75.0%), both surpassing the average human examinee score (72.9%). Gemini 2.5 Pro Preview showed 75.0% accuracy with images versus 63.5% without (p = 0.035), and Gemini 2.5 Flash Preview-thinking demonstrated 68.8% accuracy with images versus 57.3% without (p = 0.019), indicating significant performance gains with image inclusion. Notably, Gemini models demonstrated top-tier performance at a highly competitive cost. CONCLUSION: The latest generation of MLLMs, particularly Gemini 2.5 Pro Preview and o3, can exceed average human performance on radiology board examinations and effectively leverage image information. The Gemini series, in particular, shows rapid improvements and offers a compelling combination of high performance and cost-efficiency for potential applications in radiology. SUMMARY STATEMENT: Modern multimodal large language models, notably Gemini 2.5 Pro Preview and o3, surpassed average human performance on the 2024 Japanese Radiology Board Examination. Gemini models showed significant score improvements when utilizing image data and offer top-tier performance at a competitive cost, indicating rapid advancements and excellent cost-effectiveness for radiology applications.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiology practices and educationCOVID-19 diagnosis using AI

Volltext beim Verlag öffnen

Performance of State-of-the-Art Multimodal Large Language Models on an Image-Rich Radiology Board Examination: Comparison to Human Examinees

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen