Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Physician and Large Language Model Chatbot Responses to Ear, Nose, and Throat Inquiries on an Online Forum: A Comparative Analysis (Preprint)
0
Zitationen
8
Autoren
2024
Jahr
Abstract
<sec> <title>BACKGROUND</title> Large language models (LLMs) have the potential to improve the accessibility and quality of medical information for patients. Assessing the quality of LLM-generated responses in real-world clinical settings is crucial for determining their suitability and optimizing healthcare efficiency. </sec> <sec> <title>OBJECTIVE</title> This study aims to comprehensively evaluate the reliability of responses generated by an LLM-driven chatbot compared to those written by physicians, demonstrating that artificial intelligence (AI) can enhance the quality of otorhinolaryngological advice in complex, nuanced text-based workflows. </sec> <sec> <title>METHODS</title> Inquiries and verified physician responses related to otorhinolaryngology posted on a public social media forum between December 20 and 21, 2023, were extracted and anonymized. ChatGPT-4 was tasked with generating responses to each inquiry. A panel of seven board-certified otorhinolaryngologists evaluated both physician and ChatGPT-4 responses in a masked, randomized manner. The responses were assessed based on six criteria: overall quality, empathy, alignment with medical consensus, accuracy or appropriateness of information, inquiry comprehension, and potential harm. Logistic regression analysis was employed to identify predictors of preference for ChatGPT-4 responses and their influence on overall quality. </sec> <sec> <title>RESULTS</title> A total of 60 question–response pairs were included in the analysis. ChatGPT-4 responses were significantly longer (median: 162 words) compared to physician responses (median: 67 words; p<.0001). The expert panel preferred ChatGPT-4-generated responses in 70.7% of cases. ChatGPT-4 responses were rated higher across all six criteria. Multivariate analysis identified significant predictors of preference for ChatGPT-4 responses: alignment with medical consensus (odds ratio [OR]: 2.783), incorrect or inappropriate information (OR: 2.540), and empathy (OR: 1.362). For physician responses, alignment with medical consensus (OR: 1.477), empathy (OR: 1.089), inquiry comprehension (OR: 0.529), and word count (OR: 0.007) positively impacted overall quality. For chatbot responses, empathy (OR: 1.209), information appropriateness (OR: 0.903), and alignment with medical consensus (OR: 0.768) were significantly associated with high-quality ratings. </sec> <sec> <title>CONCLUSIONS</title> ChatGPT-4 outperformed physicians in generating high-quality responses. Therefore, integrating AI into clinical workflows may enhance the quality of physicians’ responses by improving comprehension of complex inquiries and providing more detailed information, thereby enhancing perceived quality. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.393 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.259 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.688 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.502 Zit.