Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The quality of AI-generated answers for patient inquiries on urolithiasis: a comparative study of ChatGPT and Deepseek
0
Zitationen
10
Autoren
2025
Jahr
Abstract
Patients increasingly rely on easily accessible online resources, often ignoring source credibility. Large Language Models such as ChatGPT and DeepSeek provide free, near human interaction on any imaginable topic, including medical conditions. While the benefits provided by this technology are evident and undeniable, concerns regarding the reliability and safety remain. In this study, we assessed the quality, safety, and reproducibility of responses generated by ChatGPT-4o mini and DeepSeek-R1 on the urolithiasis - an increasingly prevalent condition with complex aetiology and diverse management options. We screened for the most frequently asked questions on kidney stone disease. A set of 76 questions was generated and divided into six categories: general information, risk factors, symptoms, diagnosis, treatment and prognosis. Each question was entered into DeepSeek-R1 and ChatGPT-4o mini. Responses were independently evaluated by two attending urologists using a four-point scale based on clearly defined, pre-established criteria. Discrepancies were resolved by a third expert. Cosine similarity index was applied to evaluate the degree to which LLM responses remained stable over time in wording and meaning. Direct comparisons on the response lengths were conducted. Initial analysis with no category differentiation favoured DeepSeek R1 (p < 0.001). The worst outcomes for both models were recorded in the “treatment” category, yet with DeepSeek’s statistically significant advantage. Moreover, the Chinese LLM provided more accurate responses in “general information” category. The median cosine similarity score for responses generated by DeepSeek-R1 and ChatGPT-4o was 0.7 (IQR 0.655–0.736) and 0.86 (IQR 0.805–0.9), respectively. Responses from DeepSeek-R1 were significantly shorter, with a median word count of 385.5 (330.5–448.5) compared to and 672.5 (438–873.25) words for ChatGPT-4o mini (p < 0.001). Additionally, DeepSeek-R1 responses were more consistent in terms of length exhibiting a narrower distribution when compared to ChatGPT-4o mini. Among the evaluated LLMs available free of charge, DeepSeek-R1 emerged as a more accurate and concise source of patient information, while ChatGPT-4o mini demonstrated significantly greater reproducible responses. The reasoning process of DeepSeek-R1 has the potential to enhance patient comprehension of complex medical concepts thereby improving treatment adherence. Nevertheless, limitations of LLMs such as susceptibility to hallucinations and biases derived from their training data must be carefully considered.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.