Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking four large language models on emergency rheumatology scenarios evaluating AI in acute rheumatologic scenarios
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Aim: Large language models (LLMs) are increasingly integrated into medical education and decision support systems. However, their capabilities in acute care settings, such as emergency rheumatology, require systematic evaluation. This study aimed to compare the educational performance of four LLMs ChatGPT-4o, DeepSeek v3.2, Gemini 2.5 Pro, and Perplexity Academic across five domains: clinical accuracy, safety, diagnostic reasoning, realism, and use of evidence. Materials and Methods: Each model generated responses to 20 standardized emergency rheumatology scenarios. Two board-certified rheumatologists independently evaluated the outputs using a 10-point scoring rubric. Inter-rater reliability was calculated using the intraclass correlation coefficient (ICC). Differences among models were assessed using Friedman tests with post-hoc Wilcoxon signed-rank tests corrected for multiple comparisons. A Likert scale was also used to assess scenario complexity and educational utility. Results: ChatGPT-4o and DeepSeek v3.2 achieved the highest average total scores (mean: 7.75 each), significantly outperforming Perplexity Academic (mean: 6.70; p < 0.001). Although both models showed higher mean scores than Gemini 2.5 Pro (p ≈ 0.03), these differences were not statistically significant after correction. DeepSeek v3.2 showed slightly greater performance consistency. ICC analyses confirmed high inter-rater agreement across all domains (ICC>0.80). Conclusion: ChatGPT-4o and DeepSeek v3.2 demonstrated superior clinical reasoning, safety, and educational utility in emergency rheumatology scenarios. These findings support their potential role as adjunctive tools in medical training, provided expert oversight and validation mechanisms are in place.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.349 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.219 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.631 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.480 Zit.