OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 14:50

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Reasoning-based LLMs surpass average human performance on medical social skills

2025·3 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

3

Zitationen

5

Autoren

2025

Jahr

Abstract

A significant portion of medical licensing examinations assesses key social skills such as communication, ethics, and professionalism, which are vital for quality patient care. Artificial intelligence (AI) has been increasingly integrated into healthcare systems in recent years, raising concerns among regulators, providers, and patients regarding AI's capacity to handle complex, human-centered scenarios. Previous work has shown that large language models (LLMs) like GPT-3.5 and GPT-4 perform well on social skills questions from the United States Medical Licensing Examination (USMLE). However, newer models like GPT-4o, Gemini 1.5 Pro, and o1 have been introduced, with the latter designed to mimic human thinking through a "chain of thought" reasoning, unlike other LLMs that provide instantaneous answers. The impact of reasoning on LLMs' ability to navigate scenarios requiring social skills remains unclear. Here, we evaluate five LLMs: GPT-4, GPT-4o, Gemini 1.5 Pro, and o1-preview, and its full version, o1; using forty USMLE-style social skills questions from the UWORLD question bank covering several categories: communication & interpersonal skills, healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. After each LLM answered, it was subjected to an "Are you sure?" follow-up prompt to test consistency. Our results show that o1, the reasoning model, came out on top with 39 out of 40 correct final answers (97.5%). GPT-4o and Gemini 1.5 Pro (87.5%) tied in second place, followed by o1-preview (77.5%) and lastly GPT-4 (75%). All LLMs surpassed the UWORLD question bank's 64% average. Domain-specific analysis revealed that despite having equal overall scores, GPT-4o and Gemini 1.5 Pro -developed by two different companies- had varying strengths. GPT-4o demonstrated its greatest strengths in communication & interpersonal skills and patient safety, while Gemini 1.5 Pro achieved perfect scores in healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. Although o1-preview demonstrated strong initial performance, its inconsistency under skepticism; changing answers frequently, primarily to incorrect ones, reduced its overall ranking from second to fourth. This phenomenon was not observed in any other model, including the final o1 release, which maintained consistent, high-level performance. These findings, along with prior work, highlight the potential of LLMs to demonstrate effectiveness at answering knowledge-based social skills questions in a medical context, sometimes surpassing average human performance. As LLMs continue to grow in size and sophistication, their performance is expected to improve further. In particular, the strong performance of reasoning-based LLMs suggests that such architectures hold significant promise for advancing AI's role in socially oriented tasks. These results demonstrate the growing potential for reasoning-based LLMs to complement and enhance clinical training, medical education, and patient care.

Ähnliche Arbeiten