Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
AI at the Sella Turcica: Multi-Model Large Language Model Evaluation in Pituitary Adenomas
0
Zitationen
6
Autoren
2026
Jahr
Abstract
Large language models (LLMs) are explored as clinical decision-support tools in complex medical fields. However, their reliability and clinical usefulness in multidisciplinary management of pituitary adenomas remain insufficiently evaluated using validated, clinician-based frameworks. . Do LLMs differ in informational quality, clinical reasoning, and expert satisfaction when applied to pituitary adenoma–related clinical scenarios? A prospective comparative study evaluated three LLMs: ChatGPT-5.0, Claude Opus 4.1, and Gemini 2.5 Flash. A standardized prompt set covering general knowledge, surgical decision-making, endocrine evaluation, patient education, and MRI-based scenarios was submitted to each model identically. Outputs were anonymized and independently assessed by 10 board-certified doctors using three validated instruments: the Quality Assessment of Medical Artificial Intelligence (QAMAI), the Artificial Intelligence Performance Instrument (AIPI), and the Artificial Intelligence Satisfaction and Performance Evaluation Questionnaire (AISPE-Q). Claude Opus 4.1 achieved the highest performance across all major domains. Aggregate QAMAI scores were highest for Claude Opus 4.1 (4.39 ± 0.66), compared with ChatGPT-5.0 (4.12 ± 0.74) and Gemini 2.5 Flash (4.07 ± 0.76; p = 0.018). Clinical reasoning assessed by AIPI was superior for Claude Opus 4.1 versus Gemini 2.5 Flash and ChatGPT-5.0. Strong correlations were observed between informational quality, reasoning performance, and satisfaction. LLMs exhibit significant variability in performance when managing pituitary adenomas. Claude Opus 4.1 demonstrated the highest levels of informational quality, reasoning depth, and expert trust. While LLMs may serve as supportive adjuncts in multidisciplinary pituitary care, structured evaluation and expert oversight remain essential before clinical integration. 2 – Prospective comparative diagnostic accuracy study • A prospective comparative framework was used to evaluate ChatGPT-5.0, Claude Opus 4.1, and Gemini 2.5 Flash across standardized pituitary adenoma–related clinical, radiological, and patient-education scenarios using QAMAI, AIPI, and AISPE-Q instruments. • Claude Opus 4.1 demonstrated superior aggregate performance, achieving the highest mean QAMAI composite score, the highest AIPI clinical reasoning score, and the highest AISPE-Q evaluator satisfaction rating. • Domain-level analysis showed that Claude Opus 4.1 outperformed comparator models, particularly in accuracy, completeness, relevance, and usefulness of pituitary adenoma–related information. • Significant positive correlations between QAMAI, AIPI, and AISPE-Q scores confirmed alignment between informational quality, clinical reasoning capacity, and expert-perceived utility. • High inter-rater reliability across all models, with excellent agreement for Claude Opus 4.1, supports the methodological robustness and reproducibility of the expert-based evaluation approach.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.