Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating GPT‑5 for zero‑shot multimodal reasoning in digital pathology
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Large multimodal models (LMMs) are increasingly being explored for medical imaging tasks, yet their reliability under zero-shot conditions in pathology remains poorly understood. We systematically evaluate the ChatGPT family GPT–5, GPT-5 Mini, GPT-5 Nano, and GPT 4, which are under a strictly zero-shot setting, without fine-tuning or in-context exemplars, across three pathology-oriented benchmarks: PathVQA (closed-form, stratified into whole-slide image patches, diagrams, gross images, and clinical images), OmniMedVQA–BreaKHis (histopathology), and OmniMedVQA–Blood Cell VQA (hematology). All experiments use standardized prompts and closed answer sets, and questions assessing basic modality recognition reach ceiling performance (100%) across all models and are therefore excluded from discriminative analysis. On PathVQA, GPT-5 achieves the highest weighted accuracy (70.9%) and consistently leads or ties across image sources, including WSI patches (69.4%), diagrams (76.7%), gross images (70.3%), and clinical images (78.9%). On OmniMedVQA–BreaKHis (Specific Question subset), GPT-5 attains 71.7% accuracy, comparable to GPT-4 (71.5%) and exceeding GPT-5 Mini (61.2%) and GPT-5 Nano (46.0%). In contrast, on OmniMedVQA–Blood Cell VQA (Specific Question subset), GPT-4 performs best (80.0%), followed by GPT-5 (75.2%), GPT-5 Mini (74.2%), and GPT-5 Nano (55.3%). Relative to supervised or domain-adapted methods reported for PathVQA-Closed (approximately 85–91% accuracy), zero-shot ChatGPT models exhibit a performance gap of 15–21 percentage points while still providing strong plug-and-play baselines without task-specific training. Overall, performance scales with model capacity and modality-specific visual cues, with GPT-5 generally leading on histopathology and cross-source visual question answering, and GPT-4 retaining an advantage in hematology; bridging the remaining gap will likely require medical instruction tuning or domain adaptation, retrieval or knowledge augmentation, and improved pathology-aware uncertainty calibration.
Ähnliche Arbeiten
A survey on deep learning in medical image analysis
2017 · 13.819 Zit.
Dermatologist-level classification of skin cancer with deep neural networks
2017 · 13.394 Zit.
A survey on Image Data Augmentation for Deep Learning
2019 · 11.983 Zit.
QuPath: Open source software for digital pathology image analysis
2017 · 8.334 Zit.
Radiomics: Images Are More than Pictures, They Are Data
2015 · 8.101 Zit.