Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Abstract Large language models (LLMs) are evolving into diagnostic co-pilots, yet current benchmarks fail to test the integrated, stepwise reasoning required in diagnostic pathology. Here, we present Pathology’s Last Exam (PLE) , a curated, highly detailed, text-based benchmark of 100 complex cases spanning organ systems, enriched for rare/challenging entities, plus 20 adversarial cases designed to stress-test model safety. Each case provides structured blocks (Primary, Clinical, Histopathology, IHC/Special Stains, Molecular Pathology) with stepwise information release mirroring real sign-out. We evaluated five LLMs ( one proprietary, four open-source) across different stages. While the best model (GPT-5) achieved 70% accuracy on full evidence, performance on safety tests was alarming. Models frequently failed to detect biological contradictions, confidently diagnosing nonsensical “mix-up” cases rather than refusing them. This reveals a critical safety gap: high diagnostic capability is currently coupled with a dangerous inability to recognize impossible clinical scenarios. PLE provides a framework to measure and mitigate these risks before clinical deployment, as well as a foundation for developing multimodal evaluation protocols that can be extended to vision-language models and autonomous diagnostic agents in the future.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.349 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.219 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.631 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.480 Zit.