Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance
6
Zitationen
6
Autoren
2024
Jahr
Abstract
INTRODUCTION: Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology. METHODS: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality. RESULTS: ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance. DISCUSSION: The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.646 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.554 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.071 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.851 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.