Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Diagnostic accuracy and bias in open access and subscription-based large language models for multiple sclerosis and neuromyelitis optica spectrum disorder
0
Zitationen
10
Autoren
2025
Jahr
Abstract
Overlapping clinical symptoms between people with multiple sclerosis (PwMS) and those with neuromyelitis optica spectrum disorder (PwNMOSD) can result in misdiagnosis. Large language models, such as ChatGPT, offer accessible tools for preliminary health guidance. We assessed the accuracy of open-access (GPT-3.5) and subscription-based (GPT-4) models in diagnosing MS and NMOSD, and the influences of key diagnostic inflection points (initial MRI findings and aquaporin-4 (AQP4) antibody testing) and subject demographics on model performance. PwMS and PwNMOSD were retrospectively identified within a single academic center, and structured clinical timelines were processed through GPT-3.5 and GPT-4. Seven digital derivatives per subject, varying race, ethnicity, and sex, were also created to assess demographic influences. ChatGPT provided one diagnosis after each timepoint, and diagnostic accuracy was determined using mixed-effects logistic regression. A total of 98 PwMS and 157 PwNMOSD were included, generating 4,080 ChatGPT conversations across models and digital derivatives. GPT-4 demonstrated higher diagnostic accuracy for MS (OR=2.67) and NMOSD (OR=1.31), relative to GPT-3.5. Accuracy improved as the clinical time line progressed, although GPT-4 paradoxically performed worse after the initial MRI report for MS cases (OR=0.56). For PwMS, diagnostic accuracy was lower in males (OR=0.81) and older individuals (OR=0.56 per 10-year age increase). Conversely, accuracy was higher for African Americans (OR=1.30) and Asians (OR=1.38) for PwNMOSD. ChatGPT-4 demonstrated higher diagnostic accuracy for both diseases, but superior performance was not uniform across demographic groups. Further, the paradoxical decline in accuracy after MRI interpretation in MS cases suggests context-dependent performance, and responsible interpretation remains necessary. • ChatGPT-4 (paid model) outperformed ChatGPT-3.5 (open access) in diagnosing MS and NMOSD • Accuracy improved as additional events on the clinical time line were presented • ChatGPT-4 paradoxically underperformed after MRI reports were presented in MS cases • Males and certain racial groups showed lower diagnostic accuracy across both models • Inconsistent output and demographic biases underscore limitations of ChatGPT
Ähnliche Arbeiten
Rating neurologic impairment in multiple sclerosis
1983 · 14.780 Zit.
Diagnostic criteria for multiple sclerosis: 2010 Revisions to the McDonald criteria
2011 · 9.759 Zit.
Diagnosis of multiple sclerosis: 2017 revisions of the McDonald criteria
2017 · 7.651 Zit.
New diagnostic criteria for multiple sclerosis: Guidelines for research protocols
1983 · 7.390 Zit.
Recommended diagnostic criteria for multiple sclerosis: Guidelines from the international panel on the diagnosis of multiple sclerosis
2001 · 6.924 Zit.