Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative evaluation of AI architectures for medical triage safety: a real-world validation study (Preprint)

2026·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Health care organizations increasingly deploy artificial intelligence (AI) systems for medical triage, yet systematic comparisons of architectural approaches on real-world patient communications remain limited. Most validation studies rely on physician-created vignettes that may not reflect naturalistic patient language encountered in deployment. </sec> <sec> <title>OBJECTIVE</title> We aimed to compare eight AI architectures on hazard detection and triage appropriateness using both physician-created scenarios and authentic patient messages from a Medicaid population health program. </sec> <sec> <title>METHODS</title> This retrospective evaluation study compared AI architectures on two held-out test sets: 200 physician-created triage scenarios and 2,000 real-world patient messages from a multistate Medicaid population health program (January 2023 to November 2025). Architectures included decision-theoretic controllers using conservative Q-learning, constellation architectures, rule-based guardrails, classical machine learning baselines (XGBoost, logistic regression), and large language models (GPT-5.1, DeepSeek-R1) with and without retrieval-augmented generation. Three board-certified physicians established ground truth for clinical safety hazards and appropriate triage actions (Fleiss kappa=0.82). </sec> <sec> <title>RESULTS</title> Real-world messages differed substantially from physician scenarios, featuring higher rates of colloquialisms (1410/3000, 47% vs 15/200, 8%), abbreviations (930/3000, 31% vs 24/200, 12%), and implicit context (690/3000, 23% vs 9/200, 5%). Decision-theoretic controllers achieved highest sensitivity (0.727; 95% CI 0.655-0.789) while maintaining specificity (0.728; 95% CI 0.705-0.750). Large language models demonstrated severe performance degradation (34-48 percentage point sensitivity decline) when transitioning from physician scenarios to real-world messages. At clinically common operating points (specificity≥0.70), decision-theoretic and constellation architectures outperformed large language models by 30-35 percentage points in sensitivity. Retrieval-augmented generation and few-shot prompting provided minimal improvement. </sec> <sec> <title>CONCLUSIONS</title> Architectures with explicit safety mechanisms substantially outperformed large language models under real-world deployment conditions. Performance degradation on authentic patient communications reveals brittleness of current large language model approaches to naturalistic language patterns. Safety evaluation using actual patient communications is essential before deploying AI triage systems. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationDisaster Response and ManagementMachine Learning in Healthcare

Volltext beim Verlag öffnen

Comparative evaluation of AI architectures for medical triage safety: a real-world validation study (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen