Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of ChatGPT as a supplementary tool for pituitary adenomas: An observational study based on simulated consultations
2
Zitationen
5
Autoren
2025
Jahr
Abstract
Chat Generative Pretrained Transformer (ChatGPT), a large language model developed by OpenAI, has shown potential in healthcare communication and patient education. However, its performance in specialized medical domains, such as pituitary adenomas (PAs), remains unclear. Therefore, this study aimed to evaluate the reliability and consistency of ChatGPT in answering PA-related questions. We hypothesized that ChatGPT would demonstrate high reliability in responding to general patient-oriented queries but lower reliability for specialized clinical questions. A total of 256 PA-related questions were collected from patients and families, clinical practice guidelines, and medical question banks. Each question was input into ChatGPT (GPT-4, March 2025 version), and the generated responses were independently reviewed by 2 senior neurosurgeons. Any discrepancies in their assessments were resolved by a third neurosurgeon with over 30 years of clinical experience. Responses were categorized as completely correct, partially correct but usable, partially correct, or incorrect. Responses rated as completely correct or partially correct but usable were considered reliable. Consistency was assessed based on the stability of response quality across similar question types. Comparisons were made by question type (general vs professional) and source using univariate analysis. Among the 256 responses, 143 (55.8%) were completely correct, 68 (26.6%) were partially correct but usable, 19 (7.4%) were partially correct, and 26 (10.2%) were incorrect. Overall, 82.4% of the responses were considered reliable, and 68.4% demonstrated consistency. Reliability was significantly higher for general questions than for professional ones (95.0% vs 78.6%, OR = 5.182, 95% CI: 1.545-17.378, P = .003), and for guideline-derived questions compared to question bank-derived ones (100.0% vs 75.7%, OR = 1.321, 95% CI: 1.214-1.437, P = .017). Differences in consistency across subgroups were not statistically significant. ChatGPT exhibits high reliability and moderate consistency in answering PA-related questions, especially for general and guideline-based content. It may serve as a supplementary source of patient information but should not replace professional medical consultation, particularly in complex or surgical contexts. As this study was conducted in an artificial testing environment without validation in real patient consultations, the generalizability of the findings remains limited.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.400 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.261 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.695 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.506 Zit.