OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 19:42

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Human-Machine Agreement in Medical Ethics: Patient Autonomy Case-Based Evaluation of Large Language Models (Preprint)

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Medical ethics provides a moral framework for the practice of clinical medicine. Four principles, that is, beneficence, nonmaleficence, patient autonomy, and justice, form the cornerstones of medical ethics as it is practiced today. Of these 4 principles, patient autonomy holds a pivotal position and often takes precedence in ethical dilemmas that result from conflicts among the 4 principles. Its importance serves as a constant reminder to the clinician that the “needs of the patient come first.” With their remarkable ability to process natural language, large language models (LLMs) have recently pervaded nearly every aspect of human life, including medicine and medical ethics. Reliance on tools such as LLMs, however, poses fundamental questions in medical ethics, where human-like reasoning, emotional intelligence, and an understanding of local context and values are of utmost importance. </sec> <sec> <title>OBJECTIVE</title> While emphasizing the central role of the human factor, we undertake a bold venture to establish some confidence in LLMs, as it pertains to medical ethics by not only evaluating the status quo of foundational LLMs but also exploring ways to improve the LLMs by using patient autonomy–based hypothetical cases. Although literature today is certainly lacking in such ventures, we also believe projects such as ours must be frequently revisited in the field of LLMs, which is evolving at a pace that is both rapid and unprecedented. </sec> <sec> <title>METHODS</title> We evaluated 3 foundational LLMs (ChatGPT, LLaMA, and Gemini) on hypothetical cases in patient autonomy. We used Cohen κ to compare LLM responses to the consensus from a physician panel. McNemar test was used during the improvement phase and to report the final significance of improved agreement of each LLM with physician consensus. &lt;i&gt;P&lt;/i&gt; values less than .05 were considered significant. An agreement with κ&amp;lt;0 was designated as poor, 0-0.2 as slight, 0.2-0.4 as fair, 0.41-0.6 as moderate, 0.61-0.8 as substantial, and 0.81-1 as almost perfect. </sec> <sec> <title>RESULTS</title> There was slight to fair agreement between the foundational LLMs and the physician consensus. With iterative improvement techniques, this agreement evolved to be substantial or higher (Cohen κ of 0.73-0.82). The degree of improvement was statistically significant (&lt;i&gt;P&lt;/i&gt;=.006 for ChatGPT, &lt;i&gt;P&lt;/i&gt;&amp;lt;.001 for Gemini, and &lt;i&gt;P&lt;/i&gt;&amp;lt;.001 for LLaMA). </sec> <sec> <title>CONCLUSIONS</title> Although LLMs hold great potential for use in medicine, there needs to be an abundance of caution in using foundational LLMs in domains such as medical ethics. With adequate human oversight in testing and utilizing established techniques, LLM responses can be better aligned to human responses, even in the domain of medical ethics. </sec>

Ähnliche Arbeiten