Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the effectiveness of large language models in medicine education: a comparison of current medicine knowledge
2
Zitationen
2
Autoren
2025
Jahr
Abstract
Recent advancements in artificial intelligence have led to the development of powerful large language models (LLMs) like ChatGPT-4-turbo, Gemini 2.0 Flash, DeepSeek-R1, and Qwen2.5-Max. This study evaluates their medical knowledge proficiency using multiple-choice questions (MCQs) sourced from a reputable medical textbook, with answers verified by experts. Each model was tested on its ability to select correct answers, and performance was analysed using ANOVA and Tukey's HSD tests. Results showed that while all models exhibited some proficiency, ChatGPT-4-turbo significantly outperformed Gemini 2.0 Flash and Qwen2.5-Max, with no notable difference between ChatGPT-4-turbo and DeepSeek-R1. Despite their capabilities, these models remain unreliable for medical education and assistance. Enhancing their accuracy and reliability is crucial for their effective application in healthcare, enabling medical students and professionals to utilise AI for learning and clinical decision-making. Further development is needed to improve their utility in medical practice.
Ähnliche Arbeiten
The qualitative content analysis process
2008 · 21.591 Zit.
Making sense of Cronbach's alpha
2011 · 13.678 Zit.
Standards for Reporting Qualitative Research
2014 · 10.950 Zit.
Health professionals for a new century: transforming education to strengthen health systems in an interdependent world
2010 · 5.686 Zit.
Audit and feedback: effects on professional practice and healthcare outcomes
2012 · 5.489 Zit.