OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 09.04.2026, 04:07

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia

2026·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

23

Autoren

2026

Jahr

Abstract

ABSTRACT Background Large Language Models (LLMs) have demonstrated expert-level performance across many medical domains, suggesting potential utility in clinical practice. However, their reliability in the highly specialized domain of moderate hyperthermia (HT) remains unknown. We therefore evaluated the performance of three modern LLMs in answering HT-related questions. Methods We conducted an evaluation study by posing 40 open-ended questions—22 clinical and 18 physics-related—to three modern LLMs (DeepSeek-V3, Llama-3.3-70B-Instruct, and GPT-4o). Responses were blinded, randomized, and evaluated by 19 international experts with either a clinical or physics background for quality (5-point Likert scale: 1=very bad, 2=bad, 3=acceptable, 4=good to 5=very good) and for potential harmfulness in clinical decision-making. Results A total of 1144 quality evaluation responses were collected. Overall reported mean quality scores were similar across models, with DeepSeek scoring 3.26, Llama 3.18, and GPT-4o 3.07, corresponding to an “acceptable” rating. Across expert evaluations, responses were considered potentially harmful in 17.8% of cases for DeepSeek, 19.3% for Llama, and 15.3% for GPT-4o. Notably, despite “acceptable” mean scores, approximately 25% of responses were rated “bad” to “very bad,” and potentially harmful answers occurred in ∼15–19% of evaluations, indicating a non-trivial risk if used without domain expertise. Conclusion Our findings indicate that the performance of LLMs in HT in versions available at the time of investigation is only partially satisfactory. The proportion of poor-quality responses is too high and may lead non-domain experts to misinterpret the available clinical evidence and draw inappropriate clinical conclusions.

Ähnliche Arbeiten