OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 06:08

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessment of large language models’ performances and hallucinations for Chinese postgraduate medical entrance examination

2025·1 Zitationen·Discover EducationOpen Access
Volltext beim Verlag öffnen

1

Zitationen

9

Autoren

2025

Jahr

Abstract

This study evaluates Large language models (LLMs)’ performance on Chinese Postgraduate Medical Entrance Examination (CPGMEE) as well as the hallucinations produced by LLMs and investigate their implications for medical education. We curated 10 trials of mock CPGMEE to evaluate the performances of 4 LLMs (GPT-4.0, ChatGPT, QWen 2.1 and Ernie 4.0). Each question was inputted into the LLMs, and the responses were independently reviewed by three experienced graders to determine the accuracy using a three-tier accuracy scale (poor, borderline, good). The hallucination rates of LLMs’ responses were also evaluated. We chose GPT-4.0 and Ernie 4.0 for further analysis since these two LLMs achieved better performance among the four. Ernie 4.0 outperformed GPT-4.0 in overall accuracy (76.2% vs. 69.1%, p < 0.0001), achieving higher ‘good’ (70.0% vs. 64.6%, p < 0.01) and lower ‘poor’ (25.2% vs 32.3%, p < 0.01) rating. Factuality hallucination was the most prevalent type of hallucination (9.7% and 14.7% for GPT-4 and Ernie 4 respectively). Ernie 4.0 exhibited lower rates in factual fabrication (6.0% vs 7.8%, p = 0.033), instruction inconsistency (2.3% vs 5.4%, p < 0.0001) and logical inconsistency (3.7% vs 5.7%, p = 0.005) than GPT-4.0.Our results underscore the promising potential of both GPT-4.0 and Ernie 4.0 in assisting CPGMEE preparation and enhancing postgraduate medical education programs.

Ähnliche Arbeiten