Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Human-Machine Agreement in Medical Ethics: Patient Autonomy Case-Based Evaluation of Large Language Models (Preprint)
0
Zitationen
10
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Medical ethics provides a moral framework for the practice of clinical medicine. Four principles, that is, beneficence, nonmaleficence, patient autonomy, and justice, form the cornerstones of medical ethics as it is practiced today. Of these 4 principles, patient autonomy holds a pivotal position and often takes precedence in ethical dilemmas that result from conflicts among the 4 principles. Its importance serves as a constant reminder to the clinician that the “needs of the patient come first.” With their remarkable ability to process natural language, large language models (LLMs) have recently pervaded nearly every aspect of human life, including medicine and medical ethics. Reliance on tools such as LLMs, however, poses fundamental questions in medical ethics, where human-like reasoning, emotional intelligence, and an understanding of local context and values are of utmost importance. </sec> <sec> <title>OBJECTIVE</title> While emphasizing the central role of the human factor, we undertake a bold venture to establish some confidence in LLMs, as it pertains to medical ethics by not only evaluating the status quo of foundational LLMs but also exploring ways to improve the LLMs by using patient autonomy–based hypothetical cases. Although literature today is certainly lacking in such ventures, we also believe projects such as ours must be frequently revisited in the field of LLMs, which is evolving at a pace that is both rapid and unprecedented. </sec> <sec> <title>METHODS</title> We evaluated 3 foundational LLMs (ChatGPT, LLaMA, and Gemini) on hypothetical cases in patient autonomy. We used Cohen κ to compare LLM responses to the consensus from a physician panel. McNemar test was used during the improvement phase and to report the final significance of improved agreement of each LLM with physician consensus. <i>P</i> values less than .05 were considered significant. An agreement with κ&lt;0 was designated as poor, 0-0.2 as slight, 0.2-0.4 as fair, 0.41-0.6 as moderate, 0.61-0.8 as substantial, and 0.81-1 as almost perfect. </sec> <sec> <title>RESULTS</title> There was slight to fair agreement between the foundational LLMs and the physician consensus. With iterative improvement techniques, this agreement evolved to be substantial or higher (Cohen κ of 0.73-0.82). The degree of improvement was statistically significant (<i>P</i>=.006 for ChatGPT, <i>P</i>&lt;.001 for Gemini, and <i>P</i>&lt;.001 for LLaMA). </sec> <sec> <title>CONCLUSIONS</title> Although LLMs hold great potential for use in medicine, there needs to be an abundance of caution in using foundational LLMs in domains such as medical ethics. With adequate human oversight in testing and utilizing established techniques, LLM responses can be better aligned to human responses, even in the domain of medical ethics. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.