Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Exploring Hallucination in Large Language Models
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Large Language Models such as GPT-4o and GPT-4o-mini have shown significant promise in various fields. However, hallucination, when models generate inaccurate information, remains a critical challenge, especially in domains that require high accuracy, such as the healthcare field. This study investigates hallucinations in two different LLMs, focusing on the healthcare domain. Four different experiments were defined to examine the two models’ memorization and reasoning abilities. For each experiment, a dataset with 193,155 multiple-choice medical questions from postgraduate medical programs was prepared by splitting it into 21 subsets according to medical topics. Each subset has two versions: one with the correct answers included and one without them. Accuracy and compliance were evaluated for each model. Models’ adherence to requirements in prompts was assessed. Also, the correlation between size and accuracy was tested. The experiments were repeated to evaluate the models’ stability. Finally, the models’ reasoning was evaluated by human experts who assessed the models’ explanations for correct answers. The results revealed poor rates of accuracy and compliance for the two models, with rates below 70% and 75%, respectively, in most datasets; yet, both models showed low uncertainty (3%) in their responses. The findings showed that the accuracy was not affected by the size of the dataset provided to the models. Also, the results indicated that GPT-4o-mini demonstrates greater performance stability compared to GPT-4o. Furthermore, the two models provided acceptable justifications for choosing the correct answer in most cases, according to 68.8% of expert questionnaire participants who agreed with both models’ justifications. According to these results, both models cannot be relied upon when accuracy is critical, even though GPT-4o-mini slightly outperformed GPT-4o in providing the correct answers. The findings highlight the importance of improving LLM accuracy and reasoning to ensure reliability in critical fields like healthcare.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.557 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.447 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.944 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.797 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.