OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.05.2026, 02:22

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative analysis of GPT-4o and GPT-4o1 in Internal Medicine decision-making

2026·0 Zitationen·Asian Journal of Internal MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

Introduction: Artificial intelligence (AI) and large language models are increasingly being explored as support tools for human decision-making. OpenAI’s GPT-4o and newer GPT-4o1 have shown value in clinical diagnostic reasoning and treatment planning, but their ability to manage complex or ethically challenging cases is uncertain.Objective: This study was conducted to evaluate and compare GPT-4o and GPT-4o1 in simulated clinical scenarios related to internal medicine.Methods: A comparative analysis was conducted using six standardised internal medicine-related prompts, ranging from acute emergencies to complex multi-morbidity to ethical dilemmas. Responses were generated in five domains: reasoning and decision-making, clinical accuracy, clarity of communication, depth of explanation, and clinical utility from two AI models. Those responses were independently evaluated by six board-certified internal medicine specialists with ≥10 years of experience using a Likert scale of 1–5. The independent t-test was used to compare the mean scores of two models; overall scores, cumulative scores for each domain across all scenarios, and scores for each domain for each scenario. A p-value <0.05 is considered as statistically significant. The qualitative feedback was analysed thematically.Results: GPT-4o1 achieved significantly higher overall mean scores than GPT-4o (3.79 vs 3.58) (p=0.020). ChatGPT 4o1 performed better across all domains and scored significantly higher values for clinical utility domain (p=0.045) and in the emergency scenario (depth of explanation p=0.013, clinical utility p=0.025). Expert feedback highlighted that GPT-4o1 generated structured, comprehensive and evidence-based responses. GPT-4o mainly generated competent and ethically sensitive responses while occasionally being vague or incomplete. Both models rarely produced inaccurate responses while some lacked adaptation to the Sri Lankan clinical context.Conclusion: GPT-4o1 demonstrated incremental improvements over GPT-4o, especially in clinical utility and emergency scenarios. AI models provide reasonable responses related to simulated clinical scenarios. But it requires broader validation, contextual adaptation and ongoing human oversight in clinical practice.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen