Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the performance of large language models in sarcopenia-related patient queries: a foundational assessment for patient-centered validation
0
Zitationen
13
Autoren
2026
Jahr
Abstract
Background Large Language Models (LLMs) have shown promise in clinical applications but their performance in specialized areas such as sarcopenia remains understudied. Methods A panel of sarcopenia clinician researchers developed 20 standardized patient-centered questions across six clinical domains. Each question was input into all three LLMs, and responses were anonymized, randomized, and independently assessed by three clinician researchers. Accuracy was graded on a four-point scale (“Poor” to “Excellent”), and comprehensiveness was evaluated for responses rated “Good” or higher using a five-point scale. Results All LLMs achieved good performance, with no responses rated “Poor” across any domain. Deepseek had the longest and most detailed responses (mean word count: 583.75 ± 71.89) and showed superior performance in “risk factors” and “prognosis.” ChatGPT provided the most concise replies (359.5 ± 87.89 words, p = 0.0011) but achieved the highest proportion of “Good” ratings (90%). Gemini excelled in “pathogenesis” and “diagnosis” but received the most critical feedback in “prevention and treatment.” Although trends in performance differences were noted, they did not reach statistical significance. Mean comprehensiveness scores were also similar across models (Deepseek: 4.017 ± 0.77, Gemini: 3.97 ± 0.88, ChatGPT: 3.953 ± 0.83; p > 0.05). Conclusion Despite minor differences in performance across domains, all three LLMs demonstrated acceptable accuracy and comprehensiveness when responding to sarcopenia-related queries. Their comparable results may reflect similarly recent training data and language capabilities. These findings suggest that LLMs could potentially serve as a valuable tool in patient education and care on sarcopenia. This study provides an initial, expert-based assessment of LLM information quality regarding sarcopenia. While the responses demonstrated good accuracy, this evaluation focuses on content correctness from a clinical perspective. Future research must complement these findings by directly engaging older adult cohorts before clinical implementation can be considered. However, human oversight remains essential to ensure safe and appropriate assessment and individually tailored advice and management.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.
Autoren
Institutionen
- Chinese University of Hong Kong(HK)
- The University of Melbourne(AU)
- Neuroscience Research Australia(AU)
- New Generation University College(ET)
- Seoul National University Bundang Hospital(KR)
- National University College(PR)
- McGill University Health Centre(CA)
- Monash University(AU)
- Peking University(CN)
- Beijing Jishuitan Hospital(CN)
- Harbin Medical University(CN)