Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative analysis of GPT-4o and GPT-4o1 in Internal Medicine decision-making
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Introduction: Artificial intelligence (AI) and large language models are increasingly being explored as support tools for human decision-making. OpenAI’s GPT-4o and newer GPT-4o1 have shown value in clinical diagnostic reasoning and treatment planning, but their ability to manage complex or ethically challenging cases is uncertain.Objective: This study was conducted to evaluate and compare GPT-4o and GPT-4o1 in simulated clinical scenarios related to internal medicine.Methods: A comparative analysis was conducted using six standardised internal medicine-related prompts, ranging from acute emergencies to complex multi-morbidity to ethical dilemmas. Responses were generated in five domains: reasoning and decision-making, clinical accuracy, clarity of communication, depth of explanation, and clinical utility from two AI models. Those responses were independently evaluated by six board-certified internal medicine specialists with ≥10 years of experience using a Likert scale of 1–5. The independent t-test was used to compare the mean scores of two models; overall scores, cumulative scores for each domain across all scenarios, and scores for each domain for each scenario. A p-value <0.05 is considered as statistically significant. The qualitative feedback was analysed thematically.Results: GPT-4o1 achieved significantly higher overall mean scores than GPT-4o (3.79 vs 3.58) (p=0.020). ChatGPT 4o1 performed better across all domains and scored significantly higher values for clinical utility domain (p=0.045) and in the emergency scenario (depth of explanation p=0.013, clinical utility p=0.025). Expert feedback highlighted that GPT-4o1 generated structured, comprehensive and evidence-based responses. GPT-4o mainly generated competent and ethically sensitive responses while occasionally being vague or incomplete. Both models rarely produced inaccurate responses while some lacked adaptation to the Sri Lankan clinical context.Conclusion: GPT-4o1 demonstrated incremental improvements over GPT-4o, especially in clinical utility and emergency scenarios. AI models provide reasonable responses related to simulated clinical scenarios. But it requires broader validation, contextual adaptation and ongoing human oversight in clinical practice.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.