OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 16:14

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Poor Performance of Large Language Models Based on the Diabetes and Endocrinology Specialty Certificate Examination of the United Kingdom

2025·1 Zitationen·CureusOpen Access
Volltext beim Verlag öffnen

1

Zitationen

6

Autoren

2025

Jahr

Abstract

Introduction The medical knowledge of large language models (LLMs) has been tested using several postgraduate medical examinations. However, it is rarely examined in diabetes and endocrinology. This study aimed to evaluate the performance of LLMs in answering multiple-choice questions using the Diabetes and Endocrinology Speciality Certificate Examination (SCE) of the United Kingdom. Methods The official diabetes and endocrinology SCE sample questions were used to assess the seven freely accessible and subscription-based commercial LLMs: ChatGPT-o1 Preview (OpenAI, USA), ChatGPT-4o (OpenAI, USA), Gemini (Google, USA), Claude-3.5 Sonnet (Anthropic, USA), Copilot (Microsoft, USA), Perplexity AI (Perplexity, USA), and Meta AI (Meta, USA). The accuracy of LLMs was calculated by comparing outputs against sample answers. Literacy metrics, including Flesch Reading Ease (FRES) and Flesch Kincaid Grade Level (FKGL), were calculated for each response. 83 questions, three of which included photographs, were entered into the LLMs without employing any prompt engineering techniques. Results A total of 581 responses were generated and captured between August and October 2024. Performance differed significantly between models, with ChatGPT-o1 Preview achieving the highest accuracy (73%). None of the other LLMs achieved the historical pass mark of 65%, with Gemini achieving the lowest accuracy of 33%. Readability metrics also differed significantly between LLMs (p=0.004). LLMs performed better for questions without reference ranges (p<0.001). Conclusions The performance of LLMs was generally inadequate in the diabetes and endocrinology examination. Of those tested, ChatGPT-o1 Preview achieved the highest score and is likely the most useful model to aid medical education. This may be due to it being an advanced reasoning model with a greater ability to solve complex problems. Nonetheless, continued research is needed to keep pace with the advances in LLMs.

Ähnliche Arbeiten