OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 20:02

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Trauma triage performance of large language models on raw Turkish emergency notes: Artificial intelligence versus human expertise

2026·0 Zitationen·Hong Kong Journal of Emergency MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

Abstract Objectives Large language models, such as GPT‐4o, have demonstrated potential in clinical decision‐making; however, their reliability in high‐stakes environments, including emergency department triage, remains uncertain. This study assessed the triage performance of GPT‐4o and GPT‐o1‐mini using raw Turkish trauma notes, focusing on accuracy, expert agreement, and severity of misclassification as determined by the Emergency Severity Index (ESI). Methods The GPT‐4o and GPT‐o1‐mini models were prompted with unstructured Turkish triage notes to retrospectively predict ESI levels. Model outputs were compared to expert‐labeled ESI scores. Performance was evaluated using F1 scores, accuracy, weighted error analysis, and interrater agreement (Cohen's kappa). Spearman correlation was used to assess associations between model predictions and ESI scores. Results A total of 5031 patients were analyzed based on age, sex, and raw triage notes. GPT‐4o achieved its best F1 scores in ESI‐3 (0.42) and ESI‐5 (0.57), whereas GPT‐o1‐mini performed better in ESI‐3 and ESI‐4. Both models achieved high overall accuracy in the binary classification of high‐acuity trauma cases (GPT‐4o: 91%, GPT‐o1‐mini: 93%). However, sensitivity was limited (GPT‐4o: 0.45; GPT‐o1‐mini: 0.29). GPT‐4o performed better in identifying critical cases (ESI 1–2), whereas GPT‐o1‐mini excelled in low‐acuity predictions (ESI 4–5). Agreement with expert ratings was fair ( κ ≈ 0.22–0.25). The area under the curve was 0.69 (95%CI: 0.645–0; p ≤ 0.001) for GPT‐4o and 0.62 (95%CI: 0.577–0.669; p ≤ 0.001) for GPT‐o1‐mini, suggesting limited clinical reliability. Weighted error analysis showed that GPT‐o1‐mini resulted in less severe misclassifications. Conclusion Neither GPT‐4o nor GPT‐o1‐mini achieved expert‐level performance in trauma triage, particularly for high‐acuity cases. Although GPT‐4o showed slightly better accuracy and GPT‐o1‐mini produced fewer severe errors, both models require substantial refinement before they can be considered for clinical use.

Ähnliche Arbeiten