Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior
9
Zitationen
80
Autoren
2024
Jahr
Abstract
0. Abstract Background The integration of large language models (LLMs) in healthcare offers immense opportunity to streamline healthcare tasks, but also carries risks such as response accuracy and bias perpetration. To address this, we conducted a red-teaming exercise to assess LLMs in healthcare and developed a dataset of clinically relevant scenarios for future teams to use. Methods We convened 80 multi-disciplinary experts to evaluate the performance of popular LLMs across multiple medical scenarios. Teams composed of clinicians, medical and engineering students, and technical professionals stress-tested LLMs with real world clinical use cases. Teams were given a framework comprising four categories to analyze for inappropriate responses: Safety, Privacy, Hallucinations, and Bias. Prompts were tested on GPT-3.5, GPT-4.0, and GPT-4.0 with the Internet. Six medically trained reviewers subsequently reanalyzed the prompt-response pairs, with dual reviewers for each prompt and a third to resolve discrepancies. This process allowed for the accurate identification and categorization of inappropriate or inaccurate content within the responses. Results There were a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. Interestingly, 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs. Conclusion The red-teaming exercise underscored the benefits of interdisciplinary efforts, as this collaborative model fosters a deeper understanding of the potential limitations of LLMs in healthcare and sets a precedent for future red teaming events in the field. Additionally, we present all prompts and outputs as a benchmark for future LLM model evaluations. 1-2 Sentence Description As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.
Autoren
- Crystal Chang
- Hodan Farah
- Haiwen Gui
- Shawheen J. Rezaei
- Charbel Bou-Khalil
- Ye-Jean Park
- Akshay Swaminathan
- Jesutofunmi A. Omiye
- Akaash Kolluri
- Akash Chaurasia
- Alejandro Lozano
- Alice Heiman
- Allison Sihan Jia
- Amit Kaushal
- Angela Y. Jia
- Angelica Iacovelli
- Archer Y. Yang
- Arghavan Salles
- Arpita Singhal
- Balasubramanian Narasimhan
- Benjamin Belai
- Benjamin H. Jacobson
- Binglan Li
- Celeste H. Poe
- Chandan Sanghera
- Chenming Zheng
- Conor Messer
- Damien Varid Kettud
- Deven Pandya
- Dhamanpreet Kaur
- Diana Hla
- Diba Dindoust
- Dominik Moehrle
- Ross Duncan
- Ellaine Chou
- Eric Lin
- Fateme Nateghi Haredasht
- Cheng Ge
- Irena Gao
- Jacob Chang
- Jake Silberg
- Jason Fries
- Jiapeng Xu
- J. Weston Jamison
- John Tamaresis
- Jonathan H. Chen
- Joshua Lazaro
- Juan M. Banda
- Julie Lee
- Karen Ebert Matthys
- Kirsten R. Steffner
- Lü Tian
- Luca Pegolotti
- Malathi Srinivasan
- Maniragav Manimaran
- Matthew Schwede
- Minghe Zhang
- Minh Hoai Nguyen
- Mohsen Fathzadeh
- Qian Zhao
- Rika Bajra
- Rohit Khurana
- Ruhana Azam
- R. W. Bartlett
- Sang Truong
- Scott L. Fleming
- S. Varadha Raj
- Solveig Behr
- Sonia Onyeka
- Sri Muppidi
- Tarek Bandali
- Tiffany Eulalio
- Wenyuan Chen
- Xuanyu Zhou
- Yanan Ding
- Ying Cui
- Yuqi Tan
- Yutong Liu
- Nigam H. Shah
- Roxana Daneshjou
Institutionen
- Stanford Medicine(US)
- Stanford University(US)
- Thinkpath Engineering Services (Canada)(CA)
- McGill University(CA)
- Mayo Clinic(US)
- Mayo Clinic in Florida(US)
- Mayo Clinic in Arizona(US)
- WinnMed(US)
- Veterans Health Administration(US)
- Palo Alto Veterans Institute for Research(US)
- Center for Clinical Research (United States)(US)
- Stanford Health Care(US)
- Freie Universität Berlin(DE)