Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models

2025·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Large language models (LLMs) are evolving into diagnostic co-pilots, yet current benchmarks fail to test the integrated, stepwise reasoning required in diagnostic pathology. Here, we present Pathology’s Last Exam (PLE) , a curated, highly detailed, text-based benchmark of 100 complex cases spanning organ systems, enriched for rare/challenging entities, plus 20 adversarial cases designed to stress-test model safety. Each case provides structured blocks (Primary, Clinical, Histopathology, IHC/Special Stains, Molecular Pathology) with stepwise information release mirroring real sign-out. We evaluated five LLMs ( one proprietary, four open-source) across different stages. While the best model (GPT-5) achieved 70% accuracy on full evidence, performance on safety tests was alarming. Models frequently failed to detect biological contradictions, confidently diagnosing nonsensical “mix-up” cases rather than refusing them. This reveals a critical safety gap: high diagnostic capability is currently coupled with a dangerous inability to recognize impossible clinical scenarios. PLE provides a framework to measure and mitigate these risks before clinical deployment, as well as a foundation for developing multimodal evaluation protocols that can be extended to vision-language models and autonomous diagnostic agents in the future.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAI in cancer detectionExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen