OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.04.2026, 20:16

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Where Are We Now? Benchmarking Large Language Models (LLMs) in Computed Tomography (CT)-Based Detection of Intracranial Hemorrhage

2026·0 Zitationen·CureusOpen Access
Volltext beim Verlag öffnen

0

Zitationen

12

Autoren

2026

Jahr

Abstract

Introduction: Rapid computed tomography (CT) interpretation for intracranial hemorrhage is vital for timely care. Large language models (LLMs) have rapidly advanced in image analysis, with some claiming high accuracy in medical imaging interpretation. Evaluate whether LLMs, like Grok-2, ChatGPT-4o, and Gemini 1.5 Flash, can outperform a human medical student in detecting and classifying intracranial hemorrhages. Methods: Non-contrast, axial CT head scans were sourced from the Radiological Society of North America (RSNA) 2019 database, in which each slice is annotated by expert neuroradiologists. A random sample of 400 scans was selected, consisting of 200 normal cases and 200 hemorrhage cases, with 40 cases representing each major hemorrhage subtype. Grok-2, ChatGPT-4o, Gemini 1.5 Flash, and a blinded medical student were each given an image and a prompt to determine: (1) whether an intracranial hemorrhage was present, and (2) the specific type of hemorrhage. McNemar’s test was used to compare paired classification accuracies, and Cohen’s kappa was used to measure inter-rater agreement. Results: LLM accuracy in detecting hemorrhage ranged from 59.3% to 61.0%, with Grok-2 showing the highest specificity and Gemini 1.5 Flash the highest sensitivity. The medical student outperformed all LLMs in accuracy and specificity. Subarachnoid hemorrhages were the hardest to detect. Agreement was lowest between Grok-2 and the human reviewer (κ = 0.0637). Conclusion: Current general-purpose LLMs demonstrate moderate but inconsistent ability to detect and classify intracranial hemorrhages, underperforming compared to a human medical student. None of the LLMs matched human specificity or accuracy. Refinement of task-specific systems may be required to enhance clinical applicability in neuroimaging.

Ähnliche Arbeiten