Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

ChatGPT Answers Common Patient Questions About Colonoscopy

2023·148 Zitationen·GastroenterologyOpen Access

Volltext beim Verlag öffnen

148

Zitationen

Autoren

2023

Jahr

Abstract

See editorial on page 336. See editorial on page 336. ChatGPT (OpenAI) is a 175 billion–parameter large language model (LLM) artificial intelligence (AI) that was released in November 2022. ChatGPT is developed based on the generative pretrained transformer (GPT) 3.5 natural language processing technology and provides a conversational text response to a given prompt.1OpenAI.https://openai.com/blog/chatgptDate accessed: February 8, 2023Google Scholar One potential application of ChatGPT is answering patients’ medical questions. With more than 70 million procedures annually in the United States,2Ladabaum U. et al.Gastroenterology. 2019; 157: 137-148Abstract Full Text Full Text PDF PubMed Scopus (122) Google Scholar screening colonoscopies are frequently the subject of questions in gastroenterology. In this study, we examine the quality of ChatGPT-generated answers to common questions (CQs) about colonoscopy. We retrieved 8 CQs and answers about colonoscopy from the publicly available webpages of 3 randomly selected hospitals from the top-20 list of the US News & World Report Best Hospitals for Gastroenterology and GI Surgery (Supplementary Methods). We input these questions as prompts for ChatGPT (January 30, 2023, version) twice on the same day and recorded the ChatGPT-generated answers as AI1 and AI2, respectively. We compared the text similarity among all answers using a plagiarism detection software (Supplementary Table 1). To objectively interpret the quality of ChatGPT-generated answers, 4 gastroenterologists (2 senior gastroenterologists and 2 fellows) rated 36 pairs of CQs and answers, randomly displayed, for the following quality indicators on a 7-point Likert scale: (1) ease of understanding, (2) scientific adequacy, and (3) satisfaction with the answer (Table 1) . Raters were also requested to interpret whether the answers were AI generated or not.Table 1Quality Indicators (Ease of Understanding, Scientific Adequacy, Satisfaction) for Answers From AI and From Non-AI SourcesCommon questions (CQ)Source of answers“The answers are easy to understand.”“The answers are scientifically adequate.”“I am satisfied with the answers.”MeanPMeanPMeanPCQ1What is a colonoscopy?AI6.4.545.9.945.6.78Non-AI5.75.75.7CQ2Why is a colonoscopy performed?AI5.9.205.8.655.6.39Non-AI4.85.54.9CQ3How to prepare for a colonoscopy?AI5.9.596.1.725.81Non-AI5.85.95.6CQ4What to expect during the colonoscopy procedure?AI5.9.785.6.605.3.66Non-AI5.66.05.5CQ5What to expect after the colonoscopy procedure?AI6.3.055.9.166.1.12Non-AI5.35.15.0CQ6What to do after a negative colonoscopy result?AI6.4.445.9.286.1.77Non-AI5.86.35.8CQ7What to do after a positive colonoscopy result?AI5.0.875.4.114.9.83Non-AI4.66.04.8CQ8What to expect about complications?AI6.1.406.5.016.3.02Non-AI5.65.44.8Interpreted by four physicians with 7-points Likert Scale. (7 = Strongly agree, 4 = Neutral, 1 = Strongly disagree)Interpretation 1: “The answers are easy to understand.”Interpretation 2: “The answers are scientifically adequate.”Interpretation 3: “I am satisfied with the answers.”Statistical analysis by Mann Whitney U test. After adjustment for multiple comparison, new Bonferroni corrected alpha value = 0.05 / 56 = 0.00089 ∗: P < .00089 as significant. Open table in a new tab Interpreted by four physicians with 7-points Likert Scale. (7 = Strongly agree, 4 = Neutral, 1 = Strongly disagree) Interpretation 1: “The answers are easy to understand.” Interpretation 2: “The answers are scientifically adequate.” Interpretation 3: “I am satisfied with the answers.” Statistical analysis by Mann Whitney U test. After adjustment for multiple comparison, new Bonferroni corrected alpha value = 0.05 / 56 = 0.00089 ∗: P < .00089 as significant. We found that ChatGPT answers had extremely low text similarity (0%–16%) compared to answers on hospital webpages, and the text similarity ranged from 28% to 77% between the 2 ChatGPT answers, except for the CQ7 (Supplementary Table 1). Gastroenterologists rated ChatGPT answers similarly to non-AI answers in ease of understanding (AI, 5.0–6.4 vs non-AI, 4.8–5.8; all P > .00089 comparing means after Bonferroni adjustment for 56 multiple comparisons), with the AI mean scores higher than non-AI scores. Scientific adequacy scores were also similar (AI, 5.4–6.5 vs non-AI, 5.1–6.3, nonsignificant), with the AI mean score higher than non-AI 63% of the time. AI and non-AI answers received similar ratings regarding satisfaction with the answers (AI, 4.9–6.3 vs non-AI, 4.8–5.8; nonsignificant) (Table 1). The raters demonstrated only 48% accuracy in identifying AI-generated answers, with 41% sensitivity and 54% specificity. Three raters had an accuracy of less than 50%, and 1 (a fellow) had 81% accuracy (Supplementary Figure 1 and Supplementary Table 2). This study is the first of its kind, to our knowledge, to demonstrate that a contemporary LLM-derived conversational AI program is able to provide easy-to-understand, scientifically adequate, and generally satisfactory answers to CQs about colonoscopy as determined by gastroenterologists. One surprising finding was the low sensitivity in identifying AI-generated answers (sensitivity of 6%, 25%, and 44%, respectively). Heuristic feedbacks from the outperforming fellow revealed that “ChatGPT answers tended to be lengthy, used many colons (‘:’) in the long list of possibilities it gave, and tended to be more of a list rather than a narrative paragraph in response.” Contrastingly, answers from hospital webpages were “more like verbal responses to a patient as opposed to something more encyclopedic.” This study suggests a potential role of conversational AI programs in optimizing the communication between patients and health care providers, especially for high-volume procedures like colonoscopy. Despite similar ratings, there was little overlap or plagiarism between the AI and non-AI answers as well as between the 2 AI answers (Supplementary Table 1 and Supplementary Table 3), which suggested the inherent plagiarism-avoiding design in LLMs and the capabilities of LLMs to create unique answers to the same question. Accumulated publications about ChatGPT in PubMed grew 10-fold from 20 on February 3, 2023, to 246 on April 14, 2023 (Supplementary Figure 2), with topics including board examinations,3Gilson A. et al.JMIR Med Educ. 2023; 9e45312Crossref PubMed Scopus (344) Google Scholar authorship, editorial policies,4Stokel-Walker C. et al.Nature. 2023; 614: 214-216Crossref PubMed Scopus (166) Google Scholar medical education,5Mbakwe A.B. et al.PLOS Digit Health. 2023; 2e0000205Crossref PubMed Google Scholar clinical decision support,6Gaumgartner C. Clin Transl Med. 2023; 13e1206Google Scholar a LLM assessment framework,7Howard A. et al.Lancet Infect Dis. 2023; 23: 405-406Abstract Full Text Full Text PDF PubMed Scopus (49) Google Scholar etc. Although early in the adoption curve,8Rogers E.M. Diffusion of innovations.5th ed. Simon and Schuster, New York2003Google Scholar LLMs (ChatGPT, BioGPT, BARD, and others) may represent a transformative innovation in how medical information (MI) is created by physicians and consumed by patients. Especially in the current era of shared decision making and the consumerization of health care, patients have been actively consuming MI through multiple channels and accessing providers through electronic patient portals at an exponential magnitude, which has the potential to benefit patients but, simultaneously, represents a heavy burden for providers and staff. We envision that AI-generated MI, with appropriate provider oversight, accreditation, and periodic surveillance, could improve the efficiency of care and free providers for more cognitively intensive patient communications. Nevertheless, potential pitfalls have to be addressed. Currently, ChatGPT-generated MI is not constructed on the basis of clinical evidence but is created through an LLM trained on diverse Internet texts with reinforcement learning by human feedback.1OpenAI.https://openai.com/blog/chatgptDate accessed: February 8, 2023Google Scholar LLM outputs may be sensitive and vulnerable to prompt engineering, that is, manipulation by subtle changes in inputting prompts, and the consistency of performance might be in “a state of constant change.”9Lee P. et al.N Engl J Med. 2023; 388: 1233-1239Crossref PubMed Scopus (254) Google Scholar Thus, there remains a large gap, technology- and format-wise, regarding the use LLMs in responsible clinical care.10Sackett D.L. et al.Br J Med. 1996; 312: 71-72Crossref PubMed Google Scholar Implicit bias is another concern, because the clinical utility might differ for patients with or without resources. Furthermore, readability analyses using validated reading-level metrics (Flesch-Kincaid Grade Level, Gunning Fog Index) revealed that the AI-generated answers were written with significantly higher grade reading levels than the hospital webpages (P < .001), far exceeding the eighth grade thresholds recommended (Supplementary Table 4). This study has several limitations. First, we did not include patient raters, the group to which colonoscopy preparation answers will be ultimately provided. For this study, we aimed to initially critique AI-generated MI through the lens of medical professionals. Future research should explore responses to a broader sample of questions and clinical conditions, as well as the inclusion of patient raters. Second, numbers of both the hospital webpages and raters were small, which limited broad generalizability. Finally, webpages of randomly selected top-tier hospitals may not be comprehensive. This study shows that a conversational AI program can generate credible MI in response to common patient questions. With dedicated domain training, there is meaningful potential to optimize clinical communication to patients. Tsung-Chun Lee, MD (Conceptualization: Equal; Data curation: Equal; Formal analysis: Equal; Investigation: Equal; Methodology: Equal; Project administration: Equal; Writing – original draft: Equal; Writing – review & editing: Equal). Kyle Staller, MD (Data curation: Equal; Formal analysis: Equal; Methodology: Equal; Writing – review & editing: Equal). Vlaicu Botoman, MD (Data curation: Equal; Writing – review & editing: Equal). Mythili P. Pathipati, MD (Data curation: Equal; Writing – review & editing: Equal). Sanskriti Varma, MD (Data curation: Equal; Writing – review & editing: Equal). Braden Kuo, MD (Conceptualization: Lead; Data curation: Lead; Formal analysis: Lead; Investigation: Lead; Methodology: Lead; Project administration: Lead; Resources: Lead; Supervision: Lead; Writing – original draft: Lead; Writing – review & editing: Lead). From the top-20 list of the US News & World Report Best Hospitals for Gastroenterology and GI Surgery,1US News and World Report.https://health.usnews.com/best-hospitals/rankings/gastroenterology-and-gi-surgeryDate accessed: February 8, 2023Google Scholar we randomly selected 1 university-affiliated teaching medical center located on the East Coast, 1 in the Midwest, and 1 on the West Coast. We retrieved 8 common questions (CQs) and answers about colonoscopy from publicly available webpages of these 3 randomly selected hospitals. These 8 CQs about colonoscopy were frequently asked questions by patients. We used these questions as prompts for ChatGPT (January 30, 2023, version) and recorded its answers. To compare the consistency of artificial intelligence (AI) answers, we entered the same prompt 2 times on the same day and recorded the answers as AI1 and AI2, respectively. We compared the text similarity of answers between AI1 and AI2 as well as between AI1 and hospital webpages (non-AI) using plagiarism detection software,2Copyleaks.https://copyleaks.comDate accessed: February 15, 2023Google Scholar and the results were shown in Supplementary Table 1. To objectively interpret the quality of ChatGPT-generated answers, 4 gastroenterologists (2 senior gastroenterologists, 2 fellows) rated the answers. All 4 raters were blind to the sources of answers. For a total of 36 pairs of CQs and answers displayed in a random fashion, raters interpreted the following quality indicators on a 7-point Likert scale: (1) ease of understanding, (2) scientific adequacy, and (3) satisfaction with the answer (Table 1). Raters were also requested to interpret whether the answers were AI generated or not. Raters’ performances in identifying AI-generated answers are shown in Supplementary Figure 1 and Supplementary Table 2. Example answers from AI and non-AI sources to CQ1, “What is a colonoscopy?” and to CQ2, “Why is a colonoscopy performed?” are exhibited in Supplementary Table 3. Medical information provided to patients is recommended to be readable such that an eighth grader could understand it.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google Scholar We measured the reading levels of all answers to CQs by 2 objective indexes of reading level of texts: Flesch-Kincaid Grade Level4Kincaid JP, et al. http://stars.library.ucf.edu/istlibrary/56. Accessed April 12, 2023.Google Scholar and Gunning Fog Index5Avra T.D. et al.J Vasc Surg. 2022; 76: 1728-1732Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar (Supplementary Table 4). Both indexes are well-recognized objective measures, in which index number x represents the corresponding xth grade reading level.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google Scholar Medical information given to patients ideally should have an index of 8 or less. Measurements were performed by an online readability tool (https://readable.com, accessed on April 12, 2023). We searched the PubMed database with the keyword of “ChatGPT” and obtained the list of publications that involved ChatGPT.6National Library of Medicine.https://pubmed.ncbi.nlm.nih.govDate accessed: April 12, 2023Google Scholar The chronology of ChatGPT publications in PubMed is shown in Supplementary Figure 2. Data are shown as mean or mean (standard deviation). Comparison of quality indicators on answers from AI vs non-AI sources was performed using the Mann-Whitney U test. After adjustment for multiple comparisons, the new Bonferroni-corrected α value was calculated as the following: 0.05 divided by 56 comparisons, that is, 0.00089. Therefore, P < .00089 was regarded as significant in the comparison of the 3 quality indicators among answers from AI and non-AI sources. Each rater’s performance in detecting AI-generated answers was collectively calculated and expressed as the sensitivity, positive negative and In reading level we compared the reading levels of the answers from the sources with an eighth grade reading level in and also compared the reading levels of answers from AI and non-AI sources using Mann-Whitney U Statistical analyses were performed using Figure of publications to ChatGPT in the PubMed database on April 14, Figure Table of Answers to 8 vs vs vs vs is a vs vs vs vs is a colonoscopy vs vs vs vs to prepare for a vs vs vs vs to expect during the colonoscopy vs vs vs vs to expect after the colonoscopy vs vs vs vs to do after a negative colonoscopy vs vs vs vs to do after a positive colonoscopy vs vs vs vs to expect about vs vs vs vs Answers as AI1 and AI2 were obtained with the 8 CQs as prompts to ChatGPT on the same Answers were retrieved from webpages of 3 top-tier hospitals in the United Text similarity of answers was compared with accessed: February 15, 2023Google Scholar and is as the AI answers shared except the AI answers had extremely low text similarity (0%–16%) to from the hospital not Open table in a new tab Supplementary Table of 4 in Answers by answers generated by = gastroenterologists = = Open table in a new tab Supplementary Table Answers From AI and Non-AI is a colonoscopy is a medical in which a a a to examine the of the and for such as or of The is performed the patient is and to an colonoscopy is a that a to examine the of the large and The is performed using a a which is with a and a The the through the and it the to the of the The the to such as or as well as to the health of the colonoscopy is an screening tool for and can the in its early it is The to an to and is performed the patient is to colonoscopy is a and that about to colonoscopy may be in a hospital or in a the a or (a in identifying and a to the is a a is the may it during the may use the to a – a sample – The sample will be to the for colonoscopy is an used to for changes such as or in the large and a a is the at the of the the to the of the or of can be through the during a colonoscopy. can be during a colonoscopy as is a that provider the of large or is using a The has a and on is in and In to provider the of the can be used the of with a with a in to it to with a provider may or for or may also be able to that are large or is the of to from to The large is about long in has 4 This on the of This from the to the This from the on This is because of its from the to the This is the of is a colonoscopy colonoscopy is performed for several for colonoscopy is an to for and is recommended for the of or for with a of the of a is such as or changes in a colonoscopy may be performed to the after a has had an on a or a colonoscopy may be performed for of For with such as a colonoscopy may be performed to the of the and for colonoscopy is performed to for and the large and common for a colonoscopy for colonoscopy is an screening tool for which is of the of in the United The can the in its early it is colonoscopy may be performed to such as and a is found during a the may it during the colonoscopies may be recommended to for new or to for the of colonoscopy can and such as and colonoscopy may be performed to of and to colonoscopy may also be performed to the large such as and to the for a colonoscopy with to the of for use a the and screening to patients for colonoscopy is a and that about to colonoscopy may be in a hospital or in a the a or (a in identifying and a to the is a a is the may it during the may use the to a – a sample – The sample will be to the for may a colonoscopy and colonoscopy can explore of and for or and at of have than may a colonoscopy have may a is of a for with about the for for more have had may a colonoscopy to for and This is to of an a colonoscopy may be for such as a or an in can provider for in These early of or is also used to for means for in have of the colonoscopy may be used to and such or the of the that might be in the may also be used to the of or in the GI can also be used to the after may be used as a for in or the for more provider may have to a colonoscopy. Open table in a new tab Supplementary Table of of Answers From AI and Non-AI of Grade Fog eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as vs analyses between the reading levels of answers from AI vs non-AI sources were with Mann-Whitney U vs analyses between the reading levels of answers from AI vs non-AI sources were with Mann-Whitney U α for multiple = = P < as α for multiple = = P < as eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as Data are shown as mean (standard deviation). Flesch-Kincaid Grade and Gunning Fog are well-recognized objective of the reading levels of in which the number x represents the corresponding xth grade reading level.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google JP, et al. http://stars.library.ucf.edu/istlibrary/56. Accessed April 12, 2023.Google T.D. et al.J Vasc Surg. 2022; 76: 1728-1732Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar Measurements were performed with an online readability tool (https://readable.com, accessed on April 12, Statistical analyses between the reading levels of the answers and the eighth grade reading level were with Bonferroni-corrected α for multiple = = P < as Statistical analyses between the reading levels of answers from AI vs non-AI sources were with Mann-Whitney U Open table in a new tab Answers as AI1 and AI2 were obtained with the 8 CQs as prompts to ChatGPT on the same Answers were retrieved from webpages of 3 top-tier hospitals in the United Text similarity of answers was compared with accessed: February 15, 2023Google Scholar and is as the AI answers shared except the AI answers had extremely low text similarity (0%–16%) to from the hospital not Data are shown as mean (standard deviation). Flesch-Kincaid Grade and Gunning Fog are well-recognized objective of the reading levels of in which the number x represents the corresponding xth grade reading level.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google JP, et al. http://stars.library.ucf.edu/istlibrary/56. Accessed April 12, 2023.Google T.D. et al.J Vasc Surg. 2022; 76: 1728-1732Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar Measurements were performed with an online readability tool (https://readable.com, accessed on April 12, 2023). of of ChatGPT and in on and of artificial intelligence large language are in health care, especially for patient and et the performance of ChatGPT in answering 8 common patient questions to colonoscopy and compared it with responses available on hospital The study that the ChatGPT answers were similar to non-AI answers in ease of understanding and scientific PDF on “ChatGPT Answers with the in Gastroenterology by et that aimed to the of ChatGPT to provide satisfactory answers to common patient questions about colonoscopy. artificial intelligence like ChatGPT more in health care, it is that we capabilities and limitations. The that ChatGPT can provide responses with medical in adequacy, and satisfaction ratings are PDF ChatGPT and “ChatGPT Answers et the of ChatGPT-generated responses to frequently asked questions about colonoscopy. to the conversational artificial intelligence (AI) software can medical information in response to patient With domain training, clinical communication with patients could be significantly PDF for the of in the and of health care to the of has the use of to and from the use of medical learning and artificial intelligence (AI) from and clinical and online this has in and to improve and patient PDF

Autoren

Institutionen

Themen

Colorectal Cancer Screening and DetectionRadiomics and Machine Learning in Medical ImagingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

ChatGPT Answers Common Patient Questions About Colonoscopy

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen