Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of GPT-4 and mainstream Chinese Large Language Models on the Chinese Postgraduate Examination dataset: Potential for AI-assisted Traditional Chinese Medicine (Preprint)
0
Zitationen
8
Autoren
2024
Jahr
Abstract
<sec> <title>BACKGROUND</title> In China, the medical education system is characterized by multiple co-existing levels. Physicians with higher levels of education typically have better job prospects. Consequently, the medical master’s degree examination holds greater significance in the selection process compared to the Chinese licensing examination. The application of Large Language Models(LLMs) in Traditional Chinese Medicine (TCM) has rapidly expanded and intensified. Theories of TCM carry distinct scientific significance, requiring LLMs to have advanced information processing and comprehension abilities within a Chinese language context. LLMs have performed notably well in the medical licensing exams of many countries. However, their performance in selective examinations within TCM still requires further investigation. </sec> <sec> <title>OBJECTIVE</title> The study aimed to comprehensively evaluate and compare the performance of Ernie Bot, ChatGLM, SparkDesk, and GPT-4 in processing the 2023 Chinese Postgraduate Examination for Traditional Chinese Medicine (TCM) questions, and to explore their potential applications in the TCM field. </sec> <sec> <title>METHODS</title> The performance of the 4 mainstream Large Language Models(LLMs), namely Ernie Bot, ChatGLM, SparkDesk, and GPT-4, were evaluated using the 2023 Chinese Postgraduate Examination questions for TCM as a test dataset. We calculated the exam scores, displaying LLM's performance on various subjects, and evaluated the output responses based on three qualitative metrics: logical reasoning, and the ability to use internal and external information. </sec> <sec> <title>RESULTS</title> Ernie Bot and ChatGLM both achieved accuracy rates of 50.30% and 46.67%, respectively, which were over the passing score. There was a statistically significant difference in performance across test subjects observed among Ernie Bot, ChatGLM, and GPT-4, with the highest performance achieved in the module of medical humanistic spirit. Logical reasoning: ChatGLM and GPT-4 provided logical explanations in each response for the answer selection, whereas Ernie Bot and SparkDesk showed logical reasoning in 98.2% and 43.6% of responses, respectively. Internal information: Both ChatGLM and GPT-4 incorporated internal information into all answer explanations, whereas SparkDesk exhibited a notably low quantity of responses that used internal information. External information: Over 60% of responses from Ernie Bot, ChatGLM, and GPT-4 included external information. The utilization of external information did not significantly differ between correct and incorrect answers. Additionally, there was a statistically significant difference in the percentage of correct answers in SparkDesk based on the presence of internal or external information (P< .001). </sec> <sec> <title>CONCLUSIONS</title> Ernie Bot and ChatGLM's expertise in TCM surpassed the passing threshold for the postgraduate selection examination. The impressive ability of LLMs to reason logically and integrate background information demonstrated the substantial potential of LLMs in TCM. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.336 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.207 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.607 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.476 Zit.