Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Evaluation of a Citable RAG System and GPT-4o in Medical Education: Evidence Reliability and Usability Study (Preprint)
0
Zitationen
4
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> Modern medical education requires efficient tools for knowledge acquisition, and large language models (LLMs) appear promising; however, they face challenges such as factual inaccuracies (“hallucinations”) and a lack of evidence transparency, particularly in critical fields such as medicine. Retrieval augmented generation (RAG) systems address these limitations by grounding LLM responses in external knowledge sources, which offers a potential solution for developing reliable educational tools. </sec> <sec> <title>OBJECTIVE</title> This study aimed to examine the applicability of a citable RAG system in medical education, specifically to address the challenges of LLMs regarding hallucinations and unclear evidence, and to investigate its potential applications. </sec> <sec> <title>METHODS</title> We designed and implemented a RAG system in Python, using the LangGraph for system construction. The system generated responses by converting input questions into search queries, retrieving relevant information from academic databases and web sources through dedicated search agents, and then integrating this information to generate comprehensive, citable answers. GPT-4o was used for both query generation and report generation within the RAG system. We used a data set of 103 medical questions for evaluation. We compared the RAG system's answers with those generated by stand-alone GPT-4o. Evaluation was conducted quantitatively by LLMs (GPT-4 and Gemini Flash 2.0) using the CLEAR reliability metric (Completeness, Lack of false information, Evidence, Appropriateness, Relevance). We also performed a subjective evaluation with 40 medical and nursing students, who used a Likert scale based on the CLEAR criteria for a subset of five randomly selected questions. Additionally, we assessed the consistency between generated text and its references using the SourceCheckup tool, which evaluated Citation URL Validity, Statement-level Support, and Response-level Support. </sec> <sec> <title>RESULTS</title> In LLM evaluations, the RAG system consistently outperformed standalone GPT-4o in "Evidence." However, student evaluations showed a different trend, with GPT-4o receiving significantly higher scores in "Completeness," "Appropriateness," and "Relevance," while RAG excelled in "Evidence." For reference evaluation, all cited URLs were valid. However, Statement-level Support was 0.516 (95% CI: 0.460, 0.571), and Response-level Support was 0.228 (95% CI: 0.158, 0.317), indicating that not all claims or full responses were directly supported by the cited sources. </sec> <sec> <title>CONCLUSIONS</title> The RAG system effectively addressed the challenges of hallucination and unclear evidence in LLMs for medical education, consistently improving the evidence base of responses. However, discrepancies between LLM and human evaluations highlighted the need for further improvements in the overall structure and natural language flow of responses for practical educational implementation. Future work should focus on enhancing the consistency between referenced information and generated output, and on improving the overall coherence and clarity of the response structure, with Self-RAG emerging as a promising approach for self-verification and improved learning support. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.418 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.288 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.726 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.516 Zit.