OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.03.2026, 04:26

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Is it time for the neurologist to use Large Language Models in everyday practice?

2025·1 ZitationenOpen Access
Volltext beim Verlag öffnen

1

Zitationen

10

Autoren

2025

Jahr

Abstract

Abstract Large Language Models (LLMs) such as ChatGPT and Gemini are gaining momentum in healthcare for their diagnostic potential. However, their real-world applicability in specialized medical fields like neurology remains inadequately explored. The possibility to use these tools in everyday diagnostic practice relies on the evaluation of their ability to serve as support for the clinician in assessing the patient, understanding the possible diagnosis and design the diagnostic pathway. To this end, in this study we (1) examined the available literature on the evaluation of LLMs in neurology diagnosis in order to understand whether the methodologies applied were adequate to translate the use of LLMs in everyday practice, and (2) designed and performed an experiment to evaluate the diagnostic accuracy and clinical recommendations of ChatGPT-3.5 and Gemini compared to neurologists using real-world clinical cases presented following the everyday diagnostic practice. In the vast literature of LLMs application in neurology, only 24 studies reported experiences using LLMs in clinical neurology. The experiments reported showed a heterogeneous scenario of prompt engineering and input formats. At present, while responses using structured prompts were well documented, there is a lack of studies using real-world clinical scenarios, and everyday workflows and practice. We therefore conducted a real-world experiment using a cohort of 28 anonymized patient records from the neurology department of the ASST Santi Paolo e Carlo Hospital (Milan, Italy). Cases were presented to ChatGPT-3.5 and Gemini replicating the typical clinical workflows. Diagnostic accuracy and appropriateness of recommended diagnostic tests were assessed against discharge diagnoses and neurologists’ performance. Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT-3.5 (54%) and Gemini (46%). Both LLMs exhibited difficulties in nuanced clinical reasoning and over-prescribed diagnostic tests in 17–25% of cases. Despite their ability to generate structured recommendations, they struggled with complex or ambiguous presentations, requiring additional prompts in some cases. We can therefore conclude that LLMs have potential as supportive tools in neurology but they currently lack the depth required for nuanced clinical decision-making. The findings emphasize the need for further refinement of LLMs and the development of evaluation methodologies that reflect the complexities of real-world neurology practice.

Ähnliche Arbeiten