OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 31.03.2026, 12:53

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study

2025·0 Zitationen·HealthcareOpen Access
Volltext beim Verlag öffnen

0

Zitationen

17

Autoren

2025

Jahr

Abstract

<b>Background</b>: Infection prevention and control (IPC) in hospitals relies heavily on infection control nurses (ICNs) who manage complex consultations to prevent and control infections. This study evaluated large language models (LLMs) as artificial intelligence (AI) tools to support ICNs in IPC decision-making processes. Our goal is to enhance the efficiency of IPC practices while maintaining the highest standards of safety and accuracy. <b>Methods</b>: A cross-sectional benchmarking study at Queen Mary Hospital, Hong Kong assessed three LLMs-GPT-4.1, DeepSeek V3, and Gemini 2.5 Pro Exp-using 30 clinical infection control scenarios. Each model generated clarifying questions to understand the scenarios before providing IPC recommendations through two prompting methods: an open-ended inquiry and a structured template. Sixteen experts, including senior and junior ICNs and physicians, rated these responses on coherence, conciseness, usefulness and relevance, evidence quality, and actionability (1-10 scale). Quantitative and qualitative analyses assessed AI performance, reliability, and clinical applicability. <b>Results</b>: GPT-4.1 and DeepSeek V3 scored significantly higher on the composite quality scale, with adjusted means (95% CI) of 36.77 (33.98-39.57) and 36.25 (33.45-39.04), respectively, compared with Gemini 2.5 Pro Exp at 33.19 (30.39-35.99) (<i>p</i> < 0.001). GPT-4.1 led in evidence quality, usefulness, and relevance. Gemini 2.5 Pro Exp failed to generate responses in 50% of scenarios under structured prompt conditions. Structured prompting yielded significant improvements, primarily by enhancing evidence quality (<i>p</i> < 0.001). Evaluator background influenced scoring, with doctors rating outputs higher than nurses (38.83 vs. 32.06, <i>p</i> < 0.001). However, a qualitative review revealed critical deficiencies across all models, for example, tuberculosis treatment solely based on a positive acid-fast bacilli (AFB) smear without considering nontuberculous mycobacteria in DeepSeek V3 and providing an impractical and noncommittal response regarding the de-escalation of precautions for <i>Candida auris</i> in Gemini 2.5 Pro Exp. These errors highlight potential safety risks and limited real-world applicability, despite generally positive scores. <b>Conclusions</b>: While GPT-4.1 and DeepSeek V3 deliver useful IPC advice, they are not yet reliable for autonomous use. Critical errors in clinical judgment and practical applicability highlight that LLMs cannot replace the expertise of ICNs. These technologies should serve as adjunct tools to support, rather than automate, clinical decision-making.

Ähnliche Arbeiten