Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Reliability and Utility of GPT-4o as a Medical Expert Across Different Interaction Modes

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Generative Large Language Models (LLMs) such as GPT-4o are increasingly explored for use in clinical decision support, yet their real-world reliability and utility remain underexplored, especially in domains where factual accuracy is critical. This paper presents an empirical evaluation of GPT-40 in a medical expert role, focusing on its ability to provide clinically useful and reliable responses across varying degrees of external control. We evaluate three interaction strategies: Default Interaction, Hard Prompting, and Retrieval-augmented Generation (RAG), using a curated dataset of 70 authentic medical queries authored by practicing physicians. The practicing medical professionals evaluate the generated responses along two key dimensions: factual reliability and clinical utility. Our findings reveal critical trade-offs across the control spectrum: while RAG and hard prompting offer more constrained and verifiable responses, the default interaction approach achieved the highest combined ratings for reliability and clinical utility, showing the lowest abstention rate among the evaluated strategies. This study contributes a practical evaluation framework for assessing the potential of LLMs as medical experts and offers actionable insights for their deployment in healthcare contexts where factual precision, usability, and trust are paramount.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling

Volltext beim Verlag öffnen

Evaluating the Reliability and Utility of GPT-4o as a Medical Expert Across Different Interaction Modes

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen