Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the Reliability and Utility of GPT-4o as a Medical Expert Across Different Interaction Modes
0
Zitationen
3
Autoren
2025
Jahr
Abstract
Generative Large Language Models (LLMs) such as GPT-4o are increasingly explored for use in clinical decision support, yet their real-world reliability and utility remain underexplored, especially in domains where factual accuracy is critical. This paper presents an empirical evaluation of GPT-40 in a medical expert role, focusing on its ability to provide clinically useful and reliable responses across varying degrees of external control. We evaluate three interaction strategies: Default Interaction, Hard Prompting, and Retrieval-augmented Generation (RAG), using a curated dataset of 70 authentic medical queries authored by practicing physicians. The practicing medical professionals evaluate the generated responses along two key dimensions: factual reliability and clinical utility. Our findings reveal critical trade-offs across the control spectrum: while RAG and hard prompting offer more constrained and verifiable responses, the default interaction approach achieved the highest combined ratings for reliability and clinical utility, showing the lowest abstention rate among the evaluated strategies. This study contributes a practical evaluation framework for assessing the potential of LLMs as medical experts and offers actionable insights for their deployment in healthcare contexts where factual precision, usability, and trust are paramount.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.292 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.143 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.539 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.452 Zit.