OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 31.03.2026, 01:53

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The quality of AI-generated answers for patient inquiries on urolithiasis: a comparative study of ChatGPT and Deepseek

2025·0 Zitationen·Artificial Intelligence ReviewOpen Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2025

Jahr

Abstract

Patients increasingly rely on easily accessible online resources, often ignoring source credibility. Large Language Models such as ChatGPT and DeepSeek provide free, near human interaction on any imaginable topic, including medical conditions. While the benefits provided by this technology are evident and undeniable, concerns regarding the reliability and safety remain. In this study, we assessed the quality, safety, and reproducibility of responses generated by ChatGPT-4o mini and DeepSeek-R1 on the urolithiasis - an increasingly prevalent condition with complex aetiology and diverse management options. We screened for the most frequently asked questions on kidney stone disease. A set of 76 questions was generated and divided into six categories: general information, risk factors, symptoms, diagnosis, treatment and prognosis. Each question was entered into DeepSeek-R1 and ChatGPT-4o mini. Responses were independently evaluated by two attending urologists using a four-point scale based on clearly defined, pre-established criteria. Discrepancies were resolved by a third expert. Cosine similarity index was applied to evaluate the degree to which LLM responses remained stable over time in wording and meaning. Direct comparisons on the response lengths were conducted. Initial analysis with no category differentiation favoured DeepSeek R1 (p < 0.001). The worst outcomes for both models were recorded in the “treatment” category, yet with DeepSeek’s statistically significant advantage. Moreover, the Chinese LLM provided more accurate responses in “general information” category. The median cosine similarity score for responses generated by DeepSeek-R1 and ChatGPT-4o was 0.7 (IQR 0.655–0.736) and 0.86 (IQR 0.805–0.9), respectively. Responses from DeepSeek-R1 were significantly shorter, with a median word count of 385.5 (330.5–448.5) compared to and 672.5 (438–873.25) words for ChatGPT-4o mini (p < 0.001). Additionally, DeepSeek-R1 responses were more consistent in terms of length exhibiting a narrower distribution when compared to ChatGPT-4o mini. Among the evaluated LLMs available free of charge, DeepSeek-R1 emerged as a more accurate and concise source of patient information, while ChatGPT-4o mini demonstrated significantly greater reproducible responses. The reasoning process of DeepSeek-R1 has the potential to enhance patient comprehension of complex medical concepts thereby improving treatment adherence. Nevertheless, limitations of LLMs such as susceptibility to hallucinations and biases derived from their training data must be carefully considered.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsMeta-analysis and systematic reviews
Volltext beim Verlag öffnen