Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessment of large language models’ performances and hallucinations for Chinese postgraduate medical entrance examination
1
Zitationen
9
Autoren
2025
Jahr
Abstract
This study evaluates Large language models (LLMs)’ performance on Chinese Postgraduate Medical Entrance Examination (CPGMEE) as well as the hallucinations produced by LLMs and investigate their implications for medical education. We curated 10 trials of mock CPGMEE to evaluate the performances of 4 LLMs (GPT-4.0, ChatGPT, QWen 2.1 and Ernie 4.0). Each question was inputted into the LLMs, and the responses were independently reviewed by three experienced graders to determine the accuracy using a three-tier accuracy scale (poor, borderline, good). The hallucination rates of LLMs’ responses were also evaluated. We chose GPT-4.0 and Ernie 4.0 for further analysis since these two LLMs achieved better performance among the four. Ernie 4.0 outperformed GPT-4.0 in overall accuracy (76.2% vs. 69.1%, p < 0.0001), achieving higher ‘good’ (70.0% vs. 64.6%, p < 0.01) and lower ‘poor’ (25.2% vs 32.3%, p < 0.01) rating. Factuality hallucination was the most prevalent type of hallucination (9.7% and 14.7% for GPT-4 and Ernie 4 respectively). Ernie 4.0 exhibited lower rates in factual fabrication (6.0% vs 7.8%, p = 0.033), instruction inconsistency (2.3% vs 5.4%, p < 0.0001) and logical inconsistency (3.7% vs 5.7%, p = 0.005) than GPT-4.0.Our results underscore the promising potential of both GPT-4.0 and Ernie 4.0 in assisting CPGMEE preparation and enhancing postgraduate medical education programs.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.239 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.095 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.463 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.428 Zit.