Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
AI vs Human Performance in Conversational Hospital-Based Neurological Diagnosis
3
Zitationen
6
Autoren
2025
Jahr
Abstract
Abstract Background Most evaluations of artificial intelligence (AI) in medicine rely on static, multiple-choice benchmarks that fail to capture the dynamic, sequential nature of clinical diagnosis. While conversational AI has shown promise in telemedicine, these systems rarely test the iterative decision-making process in which clinicians gather information, order tests, and refine diagnoses. Methods We developed DiagnosticXchange, a web-based platform simulating realistic clinical interactions between providers and specialist consultants. A ‘nurse’ agent responds to requests from human physicians or AI systems acting as diagnosticians. Sixteen neurological diagnostic challenges of varying complexity were drawn from diverse educational and peer-reviewed sources. We evaluated 14 neurologists at different training stages and multiple state-of-the-art large language models (LLMs) using efficiency metrics, including: diagnostic accuracy, procedural cost efficiency (based on CPT codes and hospital pricing), and time to diagnosis (using actual procedure durations). We also developed Gregory, a specialized multi-agent system that systematically generates differential diagnoses, challenges initial hypotheses, and strategically selects high-yield diagnostic tests. Results Human neurologists achieved 81% diagnostic accuracy (79% residents, 88% specialists) across 97 sessions; base LLMs ranged from 81-94%. Gregory achieved perfect diagnostic accuracy with markedly lower diagnostic costs (average $1,423; 95% CI: $450-$2,860) compared with human neurologists (average $3,041; 95% CI: $2,464-$3,677; p=0.008) and base LLMs (average $2,759; 95% CI: $2,137-$3,476; p=0.002). Time to diagnosis was also shorter with Gregory (23 days; 95% CI: 6-48) versus human neurologists (43 days; 95% CI: 31-58; p=0.002) and base models (41 days; 95% CI: 31-51; p=0.07). The platform revealed distinct diagnostic patterns: human users and some base LLMs frequently ordered broad and expensive testing, while Gregory employed targeted strategies that avoided unnecessary procedures without sacrificing thoroughness. Conclusions A well-designed multi-agent AI system outperformed both human physicians and base LLMs in diagnostic accuracy, while reducing costs and time. DiagnosticXchange enables systematic evaluation of diagnostic efficiency and reasoning in realistic, interactive scenarios, offering a clinically relevant alternative to static benchmarks and a pathway toward more effective AI-assisted diagnosis.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.239 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.095 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.463 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.428 Zit.