Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Performance of Large Language Models (ChatGPT 4, Gemini 1.5 Pro, Command R+, and Llama 3 70B) on the Turkish Medical Specialty Training Entrance Exam (Preprint)

2024·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

<sec> <title>UNSTRUCTURED</title> This study aims to evaluate the performance of advanced large language models (LLMs) - OpenAI-ChatGPT 4, Google AI-Gemini 1.5 Pro, Cohere-Command R+ and Meta AI-Llama 3 70B on questions within the scope of Basic Medical Sciences and Clinical Medical Sciences exams of the Turkish Medical Specialty Training Entrance Exam (2021, 1st semester) and to analyze the answers for user interpretability in languages other than English. This study used ChatGPT 4, Gemini 1.5 Pro, Command R+, and Llama 3 70B to test artificial intelligence models' medical knowledge and ethical evaluation capabilities. The research was conducted on the questions of Turkish Medical Specialty Training Entrance Exam held on March 21, 2021. Turkish Medical Specialty Training Entrance Exam is organized by the Student Selection and Placement Centre with 240 questions. These questions are divided into two main categories: questions that test the basic medical knowledge and ethics required of a doctor and clinical scenarios covering various diseases. Regarding Model Comparison, ChatGPT 4 was the best-performing model in most domains, with an overall accuracy of 88.75%. Its strengths lie in its ability to accurately answer basic and clinical medical science questions. Llama 3 70B followed closely behind with an overall accuracy of 79.17%. Although it could not quite match the performance of ChatGPT 4, it still showed strong knowledge consistency across various fields. Gemini 1.5 Pro and Command R+ lagged with overall accuracy rates of 78.13% and 50%, respectively. Furthermore, GPT-4 and Llama 3 70B achieved satisfactory results on the Turkish Medical Specialty Training Entrance Exam, demonstrating that it is a safe source for basic medical sciences and clinical medical sciences in languages other than English. ChatGPT 4 is currently the most suitable model for answering medical science questions in various domains. Gemini 1.5 Pro and Command R+ show potential but need significant improvement to compete with the best-performing models. LLMS could be a helpful resource for medical education and clinical support in areas where English is not the primary language, like Turkey. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareRadiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

Evaluating the Performance of Large Language Models (ChatGPT 4, Gemini 1.5 Pro, Command R+, and Llama 3 70B) on the Turkish Medical Specialty Training Entrance Exam (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen