Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The performance of large language models on fictional consult queries indicates favorable potential for AI-assisted vascular surgery consult handling
8
Zitationen
3
Autoren
2024
Jahr
Abstract
Type of Research: Cross-sectional study Key Findings: Readily available large language models can identify vascular surgery emergencies with an accuracy rate from 76% to 100%. Models select the correct next most important management steps between 36% and 68% of cases. 89.5% of generative management free-responses adhere to scientific consensus, while 17.5% missed important information. Take home Message: Existing large language models can reliably identify vascular emergencies based on clinical vignettes. However, the ability to recommend treatment requires further fine-tuning. IntroductionRecently, the use of large language models in medicine has become a prominent topic of discussion due to the rapid improvement of these tools in understanding and responding to natural language. Several models are widely available to the public, both proprietary and open-sourced. We aim to evaluate the possible use of such LLMs in vascular surgery by understanding their abilities to process common consult requests.MethodsThe senior author created twenty-five fictional vascular surgery consultation queries based on common consultation requests. Five attending surgeons and four LLMs (GPT 3.5, GPT 4, Bard, and Falcon 40B) were asked to answer whether each consult was an emergency that needed immediate attention within an hour. Responders were also asked whether the next best step was an examination, additional imaging, or an urgent operation. GPT 3.5 and 4 also provided free-response answers on the next best step, graded by attending surgeons based on scientific accuracy, possible harm, and content completeness.ResultsThe rates of accurate emergency identification were 88%, 100%, 76%, and 88% for GPT 3.5, GPT 4, Falcon 40B, and Bard, respectively. While they have similar overall accuracy, GPT 3.5 has a high sensitivity at 100%, while Bard has a high specificity at 90%. GPT 4.0 had 100% sensitivity and specificity. LLMs agreed with the majority surgeon opinion on the next best step in 64% (GPT 3.5), 32% (GPT 4), 68% (Falcon 40B), and 36% (Bard) of cases. GPT 3.5 and 4 had a collective ratio of 89.5% of answers adhering to the scientific consensus. Only 5% of responses were highly likely to cause clinically significant harm. While only 4% included incorrect contact, 17.5% of answers missed important content. There was no significant difference between GPT 3.5 and 4 regarding the free-response grade.ConclusionExisting, widely available LLMs exhibited a solid ability to identify vascular emergencies, with GPT 4.0 agreeing with surgeon attendings in 100% of cases. However, these models continue to have identifiable deficiencies in treatment recommendations, a higher-level task. Future models might help triage incoming consults and provide preliminary management suggestions. The utility of such tools in clinical practice remains to be explored.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.336 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.207 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.607 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.476 Zit.