OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 11:47

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Symptom-Only Localization of Brainstem Ischemia: Large Language Models vs. Neurologists in 109 Diffusion-Weighted Imaging–Positive Cases: A Retrospective Study (Preprint)

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

14

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Localizing brainstem ischemic lesions based solely on neurological symptoms is challenging due to the complex anatomy and variable symptom presentation. Large language models (LLMs) take an emerging role in medical diagnostics by identifying patterns within clinical narratives. </sec> <sec> <title>OBJECTIVE</title> This study evaluates the diagnostic accuracy of LLMs compared to neurologists. </sec> <sec> <title>METHODS</title> We retrospectively analyzed 109 patients with diffusion-weighted imaging (DWI)-confirmed acute brainstem ischemia. Three neurologists and six LLMs (GPT-5, GPT-4, GPT-4.1, GPT-4o, o3, o3 pro) predicted lesion localization (midbrain, pons, medulla) and laterality (left/right) based on clinical symptoms alone. Accuracy, Cohen’s κ, regional performance, and correlations with symptom count were assessed, pairwise Chi2 tests with FDR corrections were performed to compare model performances. </sec> <sec> <title>RESULTS</title> GPT-4 and GPT-4o achieved the highest overall accuracy (56.0 %, 95 % CI 46.1–65.5), significantly outperforming all neurologists (χ² = 7.4–20.1, p &lt; 0.01) and reasoning-based models. No significant differences were observed among GPT-4, GPT-4o, GPT-4.1, and GPT-5 (p &gt; 0.05). In regional analysis, significant effects were restricted to pontine infarcts, where GPT-4 (74 %) and GPT-4o (69 %) exceeded all neurologists (χ² = 6.4–18.3, p &lt; 0.01). For mesencephalic and medullary lesions, accuracies did not differ significantly (p &gt; 0.05). GPT-o3 pro performed worst overall (10 %, p &lt; 0.001). Cohen’s κ reached 0.29 for GPT-4o, and accuracy correlated with symptom count (r = 0.28, p &lt; 0.01). </sec> <sec> <title>CONCLUSIONS</title> GPT-4, and GPT-4o outperformed experienced neurologists in this constrained diagnostic task. Accuracy remained modest, particularly for non-pontine lesions, and reasoning-augmented models did not improve additional benefit. These findings highlight both the potential and current limitations of LLMs in clinical reasoning, reinforcing the need for multimodal input and prospective validation. </sec>

Ähnliche Arbeiten