Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Are clinical improvements in large language models a reality? Longitudinal comparisons of ChatGPT models and DeepSeek-R1 for psychiatric assessments and interventions
1
Zitationen
6
Autoren
2025
Jahr
Abstract
BACKGROUND: Potential clinical applications for emerging large-language models (LLMs; e.g. ChatGPT) are well-documented, and newer systems (e.g. DeepSeek) have attracted increasing attention. Yet, important questions endure about their reliability and cultural responsiveness in psychiatric settings. METHODS: This study explored the diagnostic accuracy, therapeutic appropriateness and cultural sensitivity of ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 (all March 2025 versions). DeepSeek-R1 was evaluated for one of the first times in this context, and this also marks one of the first longitudinal inquiries into LLMs in psychiatry. Three psychiatric cases from earlier literature about sleep-related problems and cooccurring issues were utilised, allowing for cross-comparisons with a 2023 ChatGPT version, alongside culturally-specific vignette adaptations. Thus, overall, outputs for six scenarios were derived and were subsequently qualitatively reviewed by four psychiatrists for their strengths and limitations. RESULTS: ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 showed modest improvements from the 2023 ChatGPT model but still exhibited significant limitations. Communication was empathetic and non-pharmacological advice typically adhered to evidence-based practices. Primary diagnoses were broadly accurate but often omitted somatic factors and comorbidities. Nevertheless, consistent with past findings, clinical reasoning worsened as case complexity increased; this was especially apparent for suicidality safeguards and risk stratification. Pharmacological recommendations frequently diverged from established guidelines, whilst cultural adaptations remained largely superficial. Finally, output variance was noted in several cases, and the LLMs occasionally failed to clarify their inability to prescribe medication. CONCLUSION: Despite incremental advancements, ChatGPT-4o, ChatGPT-4.5 and DeepSeek-R1 were affected by major shortcomings, particularly in risk evaluation, evidence-based practice adherence, and cultural awareness. Presently, we conclude that these tools cannot substitute mental health professionals but may confer adjunctive benefits. Notably, DeepSeek-R1 did not fall behind its counterparts, warranting further inquiries in jurisdictions permitting its use. Equally, greater emphasis on transparency and prompt engineering would also be necessary for safe and equitable LLM deployment in psychiatry.
Ähnliche Arbeiten
Amazon's Mechanical Turk
2011 · 10.034 Zit.
The Epidemiology of Major Depressive Disorder
2003 · 7.969 Zit.
The Transtheoretical Model of Health Behavior Change
1997 · 7.710 Zit.
Acute and Longer-Term Outcomes in Depressed Outpatients Requiring One or Several Treatment Steps: A STAR*D Report
2006 · 5.452 Zit.
Depression Is a Risk Factor for Noncompliance With Medical Treatment
2000 · 4.140 Zit.