Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating <scp>ChatGPT</scp> 's accuracy in providing screening mammography recommendations among older women: Artificial Intelligence and cancer communication
7
Zitationen
9
Autoren
2024
Jahr
Abstract
ChatGPT 4 (Generative Pre-trained Transformer 4)1 has captured the world's attention due to its conversational ability. However, the extent to which large language models (LLMs) can generate accurate health advice is not well established. Recent studies have demonstrated that ChatGPT can respond appropriately to basic questions regarding cancer screening and prevention.2-5 However, it remains unknown whether ChatGPT can accurately respond to complex questions regarding screening such as whether to screen women aged 75 and older for breast cancer in a clinical setting where guidelines do not recommend for or against screening due to insufficient evidence.5 Therefore, we assessed the extent to which ChatGPT provided accurate advice regarding screening mammography among women aged 75+. The study was performed in July 2023 using ChatGPT 4. Informed by prior ChatGPT studies of cancer screening,2-5 we created 13 questions and four clinical vignettes that addressed key concepts regarding screening mammography and prevention among women aged 75+ (see Supplementary Methods S1). We submitted each question to ChatGPT three times consecutively, recording each response. A multidisciplinary panel of six experts in the fields of general internal medicine, family medicine, geriatric medicine, population health, cancer control, and radiology graded the appropriateness of ChatGPT responses. The response was graded “unreliable” if discrepancies or conflicting information were identified among the three answers provided by ChatGPT. Appropriateness was determined by a majority consensus from the panel. ChatGPT responses were graded appropriately for 11/17 questions (64%; Table 1). Responses to six questions were graded inappropriate (n = 3, 18%), unreliable (n = 2, 12%), or no consensus (n = 1, 6%). Mean word counts and Flesch–Kincaid readability grade level are reported in Table 1. Specifically, the inappropriate response to “When should I stop getting screened for breast cancer if I have dense breasts?” included inaccurate advice on screening for women aged 75 and older with dense breasts (Table 2). Other inappropriate responses pertained to decision-making tools for determining the necessity of screening mammography after the age of 75 and the influence of family history in determining when to cease screening. Appropriate Response: The age at which you should stop getting screened for breast cancer can vary depending on your overall health and personal risk factors. However, according to the guidelines from the American Cancer Society (ACS) as of my last update in September 2021: Women ages 40–44 should have the choice to start annual breast cancer screening with mammograms if they wish to do so. Women age 45–54 should get mammograms every year. Women 55 and older should switch to mammograms every 2 years, or they can continue yearly screening. Screening should continue as long as a woman is in good health and is expected to live at least 10 more years. ChatGPT-generated answers were generally correct or close to correct, demonstrating the chatbot's potential to serve as a useful adjunct to patient education and clinical decisions. Specifically, this study found that 64% of the answers regarding screening mammography among women aged 75+ provided by ChatGPT were graded as “appropriate” by the multidisciplinary expert panel. However, we observed a lower proportion of appropriate responses versus recent ChatGPT studies of cancer screening, which reported accuracy estimates ranging from 76% to 88%.2-5 The lower accuracy rate of ChatGPT in this study may be due to the topic portending inherently more uncertainty than fundamental screening concepts that were examined in previous work on breast cancer screening among younger women.2 Key strengths of our study include the multidisciplinary panel of experts that evaluated ChatGPT responses and the use of appropriateness ratings in response to questions and clinical vignettes based on published research regarding cancer screening advice within the LLMs. Notable limitations of this study include ChatGPT 4 being trained on data collected up to September 2021. Moreover, the readability of the chatbot's responses was lower than typically proposed by health literacy proponents, as evidenced by the Flesch–Kincaid readability grades, suggesting that the information may be less accessible to women with lower health literacy. Additional limitations of LLMs such as ChatGPT include output that may not be grounded in factual data, and containing cancer misinformation. For example, our question on dense breasts generated a response that included guidelines from the American Cancer Society (ACS) on supplemental screening for women with dense breasts. Crucially, these guidelines do not exist, and supplemental screening is not recommended. Given the ACS guidelines recommending screening with magnetic resonance imaging for younger women at high-risk (>20%–25%) lifetime risk, ChatGPT may have erroneously considered all older women with dense breasts to be at sufficiently high risk to warrant this screening modality. Due to a dearth of evidence on the effectiveness of screening mammography among women aged 75+, it is more challenging for ChatGPT to provide appropriate recommendations on topics such as screening mammography in this older population versus younger women. In conclusion, improving the accuracy of conversational artificial intelligence tools such as ChatGPT will be key to achieving their potential as patient education aids. Future research is needed to determine whether other LLMs provide accurate cancer screening advice consistently and what the ideal infrastructure might be for future monitoring to ensure the accuracy of cancer screening recommendations within the artificial intelligence ecosystem. Study concept and design: Dejana Braithwaite, Shama D. Karanth, and Joel Divaker. Acquisition of data: Joel Divaker and Dejana Braithwaite. Analysis of data: Nancy Schoenborn, Kenneth Lin, Illana Richman, Bruno Hochhegger, Suzanne O'Neill, and Mara Schonberg. Preparation of manuscript: Joel Divaker, Dejana Braithwaite, and Shama D. Karanth. This work was supported by the University of Florida Health Cancer Center (P30CA247796). The authors declare no conflicts of interest. The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data, and preparation or approval of the manuscript. Supplementary Methods S1. Supporting Information. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.
Autoren
Institutionen
- University of Florida Health(US)
- UF Health Cancer Center
- University of Florida(US)
- Florida College(US)
- University of Florida Health Science Center(US)
- Johns Hopkins University(US)
- Johns Hopkins Bayview Medical Center(US)
- Johns Hopkins Medicine(US)
- University of Pennsylvania Health System(US)
- Lancaster General Hospital(US)
- Yale University(US)
- Georgetown University(US)
- Harvard University(US)
- Beth Israel Deaconess Medical Center(US)