OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 22.04.2026, 02:03

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating GPT‑5 for zero‑shot multimodal reasoning in digital pathology

2026·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Large multimodal models (LMMs) are increasingly being explored for medical imaging tasks, yet their reliability under zero-shot conditions in pathology remains poorly understood. We systematically evaluate the ChatGPT family GPT–5, GPT-5 Mini, GPT-5 Nano, and GPT 4, which are under a strictly zero-shot setting, without fine-tuning or in-context exemplars, across three pathology-oriented benchmarks: PathVQA (closed-form, stratified into whole-slide image patches, diagrams, gross images, and clinical images), OmniMedVQA–BreaKHis (histopathology), and OmniMedVQA–Blood Cell VQA (hematology). All experiments use standardized prompts and closed answer sets, and questions assessing basic modality recognition reach ceiling performance (100%) across all models and are therefore excluded from discriminative analysis. On PathVQA, GPT-5 achieves the highest weighted accuracy (70.9%) and consistently leads or ties across image sources, including WSI patches (69.4%), diagrams (76.7%), gross images (70.3%), and clinical images (78.9%). On OmniMedVQA–BreaKHis (Specific Question subset), GPT-5 attains 71.7% accuracy, comparable to GPT-4 (71.5%) and exceeding GPT-5 Mini (61.2%) and GPT-5 Nano (46.0%). In contrast, on OmniMedVQA–Blood Cell VQA (Specific Question subset), GPT-4 performs best (80.0%), followed by GPT-5 (75.2%), GPT-5 Mini (74.2%), and GPT-5 Nano (55.3%). Relative to supervised or domain-adapted methods reported for PathVQA-Closed (approximately 85–91% accuracy), zero-shot ChatGPT models exhibit a performance gap of 15–21 percentage points while still providing strong plug-and-play baselines without task-specific training. Overall, performance scales with model capacity and modality-specific visual cues, with GPT-5 generally leading on histopathology and cross-source visual question answering, and GPT-4 retaining an advantage in hematology; bridging the remaining gap will likely require medical instruction tuning or domain adaptation, retrieval or knowledge augmentation, and improved pathology-aware uncertainty calibration.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

AI in cancer detectionArtificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic Skills
Volltext beim Verlag öffnen