Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative evaluation of AI architectures for medical triage safety: a real-world validation study (Preprint)
0
Zitationen
7
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Health care organizations increasingly deploy artificial intelligence (AI) systems for medical triage, yet systematic comparisons of architectural approaches on real-world patient communications remain limited. Most validation studies rely on physician-created vignettes that may not reflect naturalistic patient language encountered in deployment. </sec> <sec> <title>OBJECTIVE</title> We aimed to compare eight AI architectures on hazard detection and triage appropriateness using both physician-created scenarios and authentic patient messages from a Medicaid population health program. </sec> <sec> <title>METHODS</title> This retrospective evaluation study compared AI architectures on two held-out test sets: 200 physician-created triage scenarios and 2,000 real-world patient messages from a multistate Medicaid population health program (January 2023 to November 2025). Architectures included decision-theoretic controllers using conservative Q-learning, constellation architectures, rule-based guardrails, classical machine learning baselines (XGBoost, logistic regression), and large language models (GPT-5.1, DeepSeek-R1) with and without retrieval-augmented generation. Three board-certified physicians established ground truth for clinical safety hazards and appropriate triage actions (Fleiss kappa=0.82). </sec> <sec> <title>RESULTS</title> Real-world messages differed substantially from physician scenarios, featuring higher rates of colloquialisms (1410/3000, 47% vs 15/200, 8%), abbreviations (930/3000, 31% vs 24/200, 12%), and implicit context (690/3000, 23% vs 9/200, 5%). Decision-theoretic controllers achieved highest sensitivity (0.727; 95% CI 0.655-0.789) while maintaining specificity (0.728; 95% CI 0.705-0.750). Large language models demonstrated severe performance degradation (34-48 percentage point sensitivity decline) when transitioning from physician scenarios to real-world messages. At clinically common operating points (specificity≥0.70), decision-theoretic and constellation architectures outperformed large language models by 30-35 percentage points in sensitivity. Retrieval-augmented generation and few-shot prompting provided minimal improvement. </sec> <sec> <title>CONCLUSIONS</title> Architectures with explicit safety mechanisms substantially outperformed large language models under real-world deployment conditions. Performance degradation on authentic patient communications reveals brittleness of current large language model approaches to naturalistic language patterns. Safety evaluation using actual patient communications is essential before deploying AI triage systems. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.