Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia
0
Zitationen
23
Autoren
2026
Jahr
Abstract
ABSTRACT Background Large Language Models (LLMs) have demonstrated expert-level performance across many medical domains, suggesting potential utility in clinical practice. However, their reliability in the highly specialized domain of moderate hyperthermia (HT) remains unknown. We therefore evaluated the performance of three modern LLMs in answering HT-related questions. Methods We conducted an evaluation study by posing 40 open-ended questions—22 clinical and 18 physics-related—to three modern LLMs (DeepSeek-V3, Llama-3.3-70B-Instruct, and GPT-4o). Responses were blinded, randomized, and evaluated by 19 international experts with either a clinical or physics background for quality (5-point Likert scale: 1=very bad, 2=bad, 3=acceptable, 4=good to 5=very good) and for potential harmfulness in clinical decision-making. Results A total of 1144 quality evaluation responses were collected. Overall reported mean quality scores were similar across models, with DeepSeek scoring 3.26, Llama 3.18, and GPT-4o 3.07, corresponding to an “acceptable” rating. Across expert evaluations, responses were considered potentially harmful in 17.8% of cases for DeepSeek, 19.3% for Llama, and 15.3% for GPT-4o. Notably, despite “acceptable” mean scores, approximately 25% of responses were rated “bad” to “very bad,” and potentially harmful answers occurred in ∼15–19% of evaluations, indicating a non-trivial risk if used without domain expertise. Conclusion Our findings indicate that the performance of LLMs in HT in versions available at the time of investigation is only partially satisfactory. The proportion of poor-quality responses is too high and may lead non-domain experts to misinterpret the available clinical evidence and draw inappropriate clinical conclusions.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.402 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.270 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.702 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.507 Zit.
Autoren
- Fabio Dennstaedt
- N. Cihoric
- Nicola Bachmann
- I. Filchenko
- Luc M. Berclaz
- H. Crezee
- Sergio Curto
- Pirus Ghadjar
- Boris Huebenthal
- Mark Hurwitz
- Petra Kok
- Lars H. Lindner
- Dietmar Marder
- Jason K. Molitoris
- Markus Notter
- Sultan Rahman
- Oliver Riesterer
- Mateusz Spalek
- Hana Dobšíček Trefná
- Thomas Zilli
- Dário B. Rodrigues
- M. Fuerstner
- Emanuel Stutz
Institutionen
- University of Bern(CH)
- University Hospital of Bern(CH)
- LMU Klinikum(DE)
- Ludwig-Maximilians-Universität München(DE)
- Amsterdam Neuroscience(NL)
- University of Amsterdam(NL)
- Erasmus MC Cancer Institute(NL)
- Charité - Universitätsmedizin Berlin(DE)
- Immunologie-Zentrum Zürich(CH)
- Westchester Medical Center(US)
- Kantonsspital Aarau(CH)
- University of Maryland, Baltimore(US)
- Lindenhofspital(CH)
- National Institute of Oncology(HU)
- The Maria Sklodowska-Curie National Research Institute of Oncology(PL)
- Chalmers University of Technology(SE)
- Ente Ospedaliero Cantonale(CH)
- Università della Svizzera italiana(CH)