OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 26.03.2026, 01:40

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

326P Pathology’s last exam? A curated text-based benchmark dataset for diagnostic pathology

2025·0 Zitationen·ESMO Real World Data and Digital OncologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Background: Robust evaluation of large language models (LLMs) and agentic artificial intelligence (AI) in diagnostic pathology requires datasets that reflect the sequential, multimodal, and integrative nature of real-world practice.Existing resources rarely capture the structured interplay of clinical presentation, histology, immunohistochemistry, and molecular findings.To address this gap, we developed Pathology's Last Exam, a text-based benchmark of pathology cases designed to rigorously assess LLM-based diagnostic systems. Methods:We curated 100 pathology cases from practice and leading journals (e.g., Am J Surg Pathol, Mod Pathol), enriched for rare and emerging entities, aberrant immunophenotypes, lesions of intermediate biological potential, and other challenging scenarios.Each case comprises a clinical summary, histopathology, special stains/immunohistochemistry, molecular findings, final diagnosis with references, and standardized metadata.All diagnostic evidence was provided to four large language models (MedGemma-27B, GPT-OSS-120B, Llama-4-Maverick-17B, GPT-5-Mini), each tasked with generating a final diagnostic interpretation.The dataset further supports stepwise information release to emulate the temporal progression of real diagnostic workflows, enabling systematic evaluation of model reasoning at both early and fully informed stages. Results:The dataset spans several organ systems, and includes rare and complex diagnoses of neoplastic and non-neoplastic pathology cases (e.g., RUNX1-Mutant AML Mimicking B-Lymphoblastic Leukemia with aberrant B-cell immunophenotype; pilomatrix-like high-grade endometrioid carcinoma; POU2F3-positive, neuroendocrine marker low small cell carcinoma etc.).On the full-information diagnostic task, accuracy ranged from 29% (MedGemma-27B) to 75% (GPT-5-Mini).Conclusions: Pathology's Last Exam provides a unique dataset for diagnostic reasoning in surgical pathology.Its structured, literature-and practice-derived cases support rigorous evaluation of AI models.Our findings underscore the need for expanded, pathology-specific reasoning benchmarks that combine curated literaturederived cases with new, expert-generated scenarios.

Ähnliche Arbeiten