Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Blind Expert Evaluation of Open-Weight LLMs for Thyroid Cancer Patient Education in a Non-English Setting: GPT-OSS-20B vs MedGemma-27B-Instruct
0
Zitationen
6
Autoren
2026
Jahr
Abstract
<title>Abstract</title> <bold>Background:</bold> Non-English and resource-constrained clinical contexts are underrepresented in current large language models (LLM) benchmarking, making it uncertain whether medical specialization improves patient-facing education when models are deployed locally. Open-weight LLMs are increasingly used for patient education, yet it remains unclear whether medical domain specialization improves patient-facing answers in non-English settings. We compared a general-purpose open-weight model (GPT-OSS-20B) with a medically specialized open-weight model (MedGemma-27B-Instruct) for thyroid cancer patient education in Turkish. <bold>Methods</bold> : Sixty Turkish patient questions about thyroid cancer were answered by both models. Five endocrinologists, blinded to model identity and study hypotheses, rated each response on 5-point Likert scales for Accuracy, Completeness, Clarity, Clinical Utility, and Satisfaction. Primary inference used per-question median ratings (N = 60 paired observations per criterion) with Wilcoxon signed-rank tests and Holm adjustment; effect size was rank-biserial correlation (RBC), and location shift was estimated with Hodges–Lehmann differences. Inter-rater reliability was assessed using ICC (2, k), and ceiling-aware summaries included perfect-score and top-box analyses. <bold>Results</bold> : GPT-OSS-20B achieved higher question-level median ratings than MedGemma-27B-Instruct across all five criteria after Holm correction. The largest differences were observed for Satisfaction (median 5.0 vs 4.0; RBC = 0.788; Holm-adjusted p < 0.001) and Completeness (median 5.0 vs 4.0; RBC = 0.599; Holm-adjusted p < 0.001). Inter-rater reliability was good and comparable across models (ICC (2, k) ≈ 0.74–0.80). Ceiling aware reporting showed consistently higher perfect-score proportions for GPT-OSS-20B across criteria, with the most pronounced gaps in Satisfaction and Completeness. <bold>Conclusions</bold> : In this first head-to-head comparison of open-weight LLMs for thyroid cancer patient education in Turkish, the general-purpose GPT-OSS-20B significantly outperformed the medically fine-tuned MedGemma-27B-Instruct across all evaluation criteria. These findings suggest that medical domain specialization does not necessarily yield superior patient-facing educational content in non-English settings and that general-purpose open-weight models may offer advantages for patient education tasks in resource-constrained contexts.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.292 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.143 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.539 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.452 Zit.