Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
7543 Comparative Analysis Of Language Model Systems In Endocrinology: Performance And Human Acceptability Assessment
2
Zitationen
5
Autoren
2024
Jahr
Abstract
Abstract Disclosure: A. Randhawa: None. A. Rahman: None. C. Zakka: None. W. Hiesinger: None. M. Sattar: None. Introduction: Large language models (LLMs) are Artificial Intelligence (AI) systems trained on extensive datasets from online content, articles, and books. Patients increasingly rely on AI chatbots for medical information. However, the reliability and accuracy of these AI models, especially concerning open-ended medical questions, remain inadequately established. Several studies have explored the performance of LLMs in responding to patient queries, with some comparing these responses to those of physicians. Leveraging AI to assist with tasks such as answering queries, triage, and screening could potentially allow clinicians to focus on more complex responsibilities and improve public access to healthcare services. A comprehensive evaluation of the acceptability, usability, and effectiveness of these agents in healthcare is crucial for consolidating evidence. We aim to compare two common LLMs, one open source and one commercial, to identify if there is a model with superior performance and human acceptability. We hypothesize that current advances in open source LLM research are now at par with commercially popular LLMs in their abilities to answer clinical questions. Methods: We enlist the aid of a board-certified endocrinologist to prepare a list of 60 questions frequently asked by patients suffering from diabetes and thyroid disorders. Each question was processed via either Mistral-7B or GPT-turbo-3.5 by prompting them to behave as “clinical grade endocrinology assistants”. We measure human acceptability and performance along three axes: “factuality” - a measure of factual correctness of the model outputs, “completeness” - a measure of how thorough response was, “preference” - a measure of which response was superior based on subjective human assessment. A blinded board-certified endocrinologist graded each response produced by the two LLMs to each question along these axes. Results: Chi-square tests were used to analyze the scored responses. In a blinded assessment, we found that gpt-turbo-3.5 outperformed mistral-7b significantly along all three axes. For factuality, 51 responses from gpt-turbo-3.5 were judged superior vs 9 from mistral 7b (Χ2= 29.4; p < 0.001); completeness similarly was 53 vs 7 (Χ2= 35.2; p < 0.001); and 54 responses from gpt-turbo-3.5 were preferred vs just 6 for mistral-7b (Χ2= 38.5; p < 0.001). We found that despite superficially similar performance on non-clinical performance benchmarks, on the specific task of responding to the queries, we found that gpt-turbo 3.5 performed better than mistral-7B. Conclusion: In our study, GPT3.5 proved superior with higher preferred responses and better factuality. Further investigations are needed to assess the accuracy and reproducibility of these findings and identify areas of improvement. Further research should also analyze ethical concerns such as information biases, privacy, and security aspects. Presentation: 6/3/2024
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.