Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Trauma triage performance of large language models on raw Turkish emergency notes: Artificial intelligence versus human expertise
0
Zitationen
9
Autoren
2026
Jahr
Abstract
Abstract Objectives Large language models, such as GPT‐4o, have demonstrated potential in clinical decision‐making; however, their reliability in high‐stakes environments, including emergency department triage, remains uncertain. This study assessed the triage performance of GPT‐4o and GPT‐o1‐mini using raw Turkish trauma notes, focusing on accuracy, expert agreement, and severity of misclassification as determined by the Emergency Severity Index (ESI). Methods The GPT‐4o and GPT‐o1‐mini models were prompted with unstructured Turkish triage notes to retrospectively predict ESI levels. Model outputs were compared to expert‐labeled ESI scores. Performance was evaluated using F1 scores, accuracy, weighted error analysis, and interrater agreement (Cohen's kappa). Spearman correlation was used to assess associations between model predictions and ESI scores. Results A total of 5031 patients were analyzed based on age, sex, and raw triage notes. GPT‐4o achieved its best F1 scores in ESI‐3 (0.42) and ESI‐5 (0.57), whereas GPT‐o1‐mini performed better in ESI‐3 and ESI‐4. Both models achieved high overall accuracy in the binary classification of high‐acuity trauma cases (GPT‐4o: 91%, GPT‐o1‐mini: 93%). However, sensitivity was limited (GPT‐4o: 0.45; GPT‐o1‐mini: 0.29). GPT‐4o performed better in identifying critical cases (ESI 1–2), whereas GPT‐o1‐mini excelled in low‐acuity predictions (ESI 4–5). Agreement with expert ratings was fair ( κ ≈ 0.22–0.25). The area under the curve was 0.69 (95%CI: 0.645–0; p ≤ 0.001) for GPT‐4o and 0.62 (95%CI: 0.577–0.669; p ≤ 0.001) for GPT‐o1‐mini, suggesting limited clinical reliability. Weighted error analysis showed that GPT‐o1‐mini resulted in less severe misclassifications. Conclusion Neither GPT‐4o nor GPT‐o1‐mini achieved expert‐level performance in trauma triage, particularly for high‐acuity cases. Although GPT‐4o showed slightly better accuracy and GPT‐o1‐mini produced fewer severe errors, both models require substantial refinement before they can be considered for clinical use.
Ähnliche Arbeiten
A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation
1987 · 49.113 Zit.
Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017
2018 · 13.773 Zit.
Global, regional, and national incidence, prevalence, and years lived with disability for 328 diseases and injuries for 195 countries, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016
2017 · 13.414 Zit.
The injury severity score: a method for describing patients with multiple injuries and evaluating emergency care.
1974 · 8.023 Zit.
Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015
2016 · 7.311 Zit.