Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

7543 Comparative Analysis Of Language Model Systems In Endocrinology: Performance And Human Acceptability Assessment

2024·2 Zitationen·Journal of the Endocrine SocietyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

Abstract Disclosure: A. Randhawa: None. A. Rahman: None. C. Zakka: None. W. Hiesinger: None. M. Sattar: None. Introduction: Large language models (LLMs) are Artificial Intelligence (AI) systems trained on extensive datasets from online content, articles, and books. Patients increasingly rely on AI chatbots for medical information. However, the reliability and accuracy of these AI models, especially concerning open-ended medical questions, remain inadequately established. Several studies have explored the performance of LLMs in responding to patient queries, with some comparing these responses to those of physicians. Leveraging AI to assist with tasks such as answering queries, triage, and screening could potentially allow clinicians to focus on more complex responsibilities and improve public access to healthcare services. A comprehensive evaluation of the acceptability, usability, and effectiveness of these agents in healthcare is crucial for consolidating evidence. We aim to compare two common LLMs, one open source and one commercial, to identify if there is a model with superior performance and human acceptability. We hypothesize that current advances in open source LLM research are now at par with commercially popular LLMs in their abilities to answer clinical questions. Methods: We enlist the aid of a board-certified endocrinologist to prepare a list of 60 questions frequently asked by patients suffering from diabetes and thyroid disorders. Each question was processed via either Mistral-7B or GPT-turbo-3.5 by prompting them to behave as “clinical grade endocrinology assistants”. We measure human acceptability and performance along three axes: “factuality” - a measure of factual correctness of the model outputs, “completeness” - a measure of how thorough response was, “preference” - a measure of which response was superior based on subjective human assessment. A blinded board-certified endocrinologist graded each response produced by the two LLMs to each question along these axes. Results: Chi-square tests were used to analyze the scored responses. In a blinded assessment, we found that gpt-turbo-3.5 outperformed mistral-7b significantly along all three axes. For factuality, 51 responses from gpt-turbo-3.5 were judged superior vs 9 from mistral 7b (Χ2= 29.4; p &lt; 0.001); completeness similarly was 53 vs 7 (Χ2= 35.2; p &lt; 0.001); and 54 responses from gpt-turbo-3.5 were preferred vs just 6 for mistral-7b (Χ2= 38.5; p &lt; 0.001). We found that despite superficially similar performance on non-clinical performance benchmarks, on the specific task of responding to the queries, we found that gpt-turbo 3.5 performed better than mistral-7B. Conclusion: In our study, GPT3.5 proved superior with higher preferred responses and better factuality. Further investigations are needed to assess the accuracy and reproducibility of these findings and identify areas of improvement. Further research should also analyze ethical concerns such as information biases, privacy, and security aspects. Presentation: 6/3/2024

7543 Comparative Analysis Of Language Model Systems In Endocrinology: Performance And Human Acceptability Assessment

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen