Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Trials and Tribulations: Responses of ChatGPT to Patient Questions About Kidney Transplantation

2024·4 Zitationen·Transplantation

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

Large language models (LLMs) such as chat generative pretrained transformer (ChatGPT) have introduced a new avenue for patients to acquire medical information.1-4 LLMs have the potential to facilitate interactions that closely resemble human communication, potentially making them attractive alternatives to conventional medical consultations or web search engines for medical information retrieval.1-5 However, the precision of medical advice offered by ChatGPT or other similar artificial intelligence (AI) systems must be systematically validated, particularly in specialized fields requiring high levels of expertise, such as kidney transplantation (KT).3,4,6 Transplantation is a multidisciplinary and rapidly evolving field. Thus, the dynamic nature of medical consensus compounds the challenge of providing timely and accurate recommendations regarding KT.3,4,6 LLMs like ChatGPT may struggle to provide accurate and comprehensive answers, potentially leading to misinformation or omissions, particularly in complex or evolving medical scenarios.4 Ensuring that LLM tools are aligned with current medical standards and are regularly updated to reflect the latest clinical guidelines is crucial for their safe implementation. Additionally, the inherent complexity of medical terminology and the subtleties of patient symptoms can sometimes result in misinterpretations by AI, potentially leading to inaccurate advice. Prior studies have assessed the accuracy and reliability of AI-generated generic medical responses using the Likert scale for completeness and accuracy with a large set of questions and a diverse group of clinicians.4,6-8 Hence, to make our results comparable, we evaluated the accuracy, completeness, and potential harm of ChatGPT-generated responses to KT-related patient queries. Incorporating the “potential harm” category ensures that the assessment of responses aligns with the core principles of AI ethics as outlined by the United Nations Educational, Scientific and Cultural Organization, particularly the need for proportionality and the imperative to avoid causing harm.9 In our study, clinically active transplant nephrologists and surgeons assessed responses of ChatGPT to patient questions derived from diverse internet sources. To each question, we also provided recommendations and evidence grades based on Kidney Disease: Improving Global Outcomes (KDIGO) guidelines.10 Clinicians scrutinized the accuracy, comprehensiveness, and potential harm9 associated with ChatGPT-generated responses. EXPERTS EVALUATION AND RESULTS We retrieved 20 frequently asked questions by KT patients and families from publicly available social media, including Quora (search: kidney transplant), Reddit (r/transplant-9.5k members, r/kidneydisease-10k members) and Facebook Groups (1: Kidney Support: Dialysis, Transplants, Donors and Recipients—32k members, >10 posts per day; 2: Kidney Disease Support Group—32k members, >10 posts per day). A total of >830 questions and queries were reviewed by 2 researchers (J.X. and M.M.), with the assistance of 2 experienced transplant surgeons (B.E.L. and B.J.O.), and based on the number of repetitions and number of replies in threads, 20 questions were determined to be commonly asked by patients and families in these categories: KT candidate evaluation process, donor kidney selection, posttransplant immunosuppressant medications, rejection of a transplanted organ, and overall long-term care of the transplanted kidney (Table 1).8 TABLE 1. - Quality of ChatGPT responses to patient questionsa Question KDIGO grade Accuracy Completeness Possible harm 1. If I am older or have other health issues, does it affect my chances of getting a kidney transplant 1A 4.4 (88%) 2.4 (80%) 4 (100%) 2. How do I start the process of getting a kidney transplant? How does the evaluation process work? 1D/not graded 3.9 (78%) 2.0 (67%) 3.7 (93%) 3. Would I have to lose weight to get a kidney transplant? 2B/2D 4.3 (86%) 2.4 (80%) 4 (100%) 4. How do I get a kidney from a living donor? 1A 3.9 (78%) 1.9 (62%) 3.1 (79%) 5. What are the financial costs to the living donor? Not graded 4.1 (82%) 2.3 (76%) 3.9 (96%) 6. How can I communicate with my donor family after kidney transplantation? Not graded 4.3 (86%) 2.7 (90%) 3.7 (93%) 7. What if my body tries to reject the new kidney? 1C/1D/2C/2D 3.4 (68%) 2.1 (71%) 3.4 (86%) 8. Can you start smoking after a kidney transplant? Not graded 3.4 (68%) 2.0 (67%) 3.1 (79%) 9. When do I start the immunosuppression medications for kidney transplant? And what are those medications? 1A/1B/2A/2B 3.4 (68%) 2.0 (67%) 3.6 (89%) 10. Can I have children after kidney transplant? 2C/2D/not graded 4.3 (87%) 2.2 (72%) 3.3 (83%) 11. How will a kidney transplant affect my sex life? 2C/2D/not graded 3.7 (74%) 1.7 (56%) 3.0 (75%) 12. When can I start exercising? What exercise is good for those who have kidney transplant? 1C 4.5 (90%) 2.8 (94%) 3.7 (92%) 13. What foods are not allowed after a kidney transplant? 1C/not graded 3.0 (60%) 1.5 (50%) 3.0 (75%) 14. Will I become a diabetic after kidney transplant? 1C/2D 3.9 (78%) 2.2 (72%) 3.7 (92%) 15. What are side effects of kidney transplant medication? 2B/2C 4.4 (88%) 2.7 (89%) 3.9 (96%) 16. Does the chance of having cancer after kidney transplant increase? 1C/1D/not graded 4.1 (82%) 2.6 (86%) 3.6 (89%) 17. How long do I have to wait until I am able to travel domestically by air or overseas after kidney transplant? I have a trip booked and really don’t want to cancel it. 2D 3.7 (74%) 2.0 (67%) 3.3 (82%) 18. Do I need lifelong medication after kidney transplant? 2B/2C/2D 4.0 (80%) 2.7 (90%) 3.9 (96%) 19. Can I get a kidney from a donor with HIV? Not graded 4.6 (92%) 2.7 (90%) 3.9 (96%) 20. What happens if my kidney donor is incompatible with me? Not graded 4.6 (92%) 2.7 (90%) 3.9 (96%) aData reported as n (%) using majority rule of annotators’ scores.KDIGO, Kidney Disease: Improving Global Outcomes. Seven experts independently rated ChatGPT responses for accuracy, completeness, and potential harm, using 3- and 5-point Likert scales. All 7 experts practice clinically (3 transplant nephrologists and 4 transplant surgeons specialized in KT) and hold academic appointments at the NYU Grossman School of Medicine. Senior surgeons were clinically active transplant surgeons with >10 y of practice, >200 publications, and a track of NIH-funded research. Nephrologists were all specialized and trained in KT, with an average of 15 y of experience. ChatGPT responses to selected questions were benchmarked against KDIGO Clinical Practice Guidelines (Evaluation and Management of Candidates for KT, Care of Kidney Transplant Recipients, and Living Kidney Donor) and we have chosen to use KDIGO based on its citations (KDIGO for kidney recipients: 968 citations, KDIGO for kidney donors: 608 citations, and KDIGO for kidney candidates: 447 citations).10 We analyzed responses to 20 patient questions, rated by 7 experts across 3 categories: accuracy (poor, fair, average, good, and excellent), completeness (incomplete, adequate, and comprehensive), and possible harm (very harmful, somewhat harmful, not too harmful, and not harmful at all). Overall, ChatGPT responses were rated positively for accuracy and completeness, with minimal perceived harm. In 11 of 20 questions, ChatGPT responses achieved an average score of ≥4 out of 5 for their accuracy. Responses regarding HIV-positive donors and incompatible donors scored highest, whereas those about posttransplant care varied in assessed accuracy. Notably, responses on the accuracy of organ rejection and immunosuppression medications received low ratings. In addition, 17 of 20 questions received an average completeness score of ≥2 out of 3. Responses regarding communication with donor families, exercising after KT, and medication management scored highest, whereas those on pretransplant evaluation and posttransplant lifestyle choices scored lower in completeness. Furthermore, clinicians unanimously rated ChatGPT responses as minimally harmful, with all questions scoring ≥3 out of 4. The only question with one “very harmful” rating was regarding smoking after KT, whereas questions on posttransplant lifestyle and sexual health received relatively lower scores. Clinicians provided qualitative feedback on response accuracy, completeness, and possible harm. Comments ranged from praising the user friendliness of responses to suggesting improvements in providing specific details and guidance on sensitive topics like sexual health and contraception. Positive feedback included remarks on the user-friendly nature of responses and their comprehensiveness. However, some clinicians noted areas for improvement, such as providing more specific details and guidance on sensitive topics like sexual health and contraception. Overall, the evaluation suggests that while ChatGPT responses are generally accurate and comprehensive, there is room for improvement in addressing specific nuances and providing detailed guidance on selected topics. IMPACT There are significant implications for using LLMs like ChatGPT in patient education and communication in the context of KT. By demonstrating the potential of ChatGPT to provide accurate and comprehensive responses to KT patient queries, our study highlights the role of LLMs in augmenting healthcare delivery and improving patient engagement.7 Moreover, our identification of risks associated with ChatGPT responses underscores the importance of implementing safeguards and quality control measures to mitigate potential harm to patients. ChatGPT RESPONSES TO CONTROVERSIAL TOPICS IN KT Topics in KT, such as the use of immunosuppressive drugs and the risk of donor-derived HIV infection, often lack consensus or have evolving evidence.11,12 Our clinicians found responses of ChatGPT on these topics thorough and thoughtful, offering nuanced insights and up-to-date information. However, critiques were noted regarding omitting key details, leading to potential misinformation or lack of clarity. This aligns with findings from a recent study that looked at ChatGPT responses to questions about liver transplantation, highlighting the risk of misleading information dissemination by ChatGPT on such topics.13 PATIENT EDUCATION AND COMMUNICATION Prior studies across various medical fields have suggested that LLMs like ChatGPT can generate credible medical information, enhancing patient education and communication.6,7,13 Our study echoes these findings, indicating that ChatGPT can effectively and accurately address KT patient queries. The accessibility and timeliness of LLM-generated responses can potentially improve patient satisfaction and engagement in healthcare decision-making processes. Additional benefits, including improvement of personalized learning, understanding of complex medical concepts, step-by-step instructions, interactive content, and instant feedback, can all be used to enhance patient education.6,13 Unlike traditional information retrieval systems like web browsing or patient portal messages with providers, LLMs like ChatGPT engage in a back-and-forth dialogue with users. This conversational interaction is dynamic, with users posing follow-up questions based on previous responses, enhancing user engagement and satisfaction and also introducing complexities in evaluating the model’s performance.14 Although LLMs cannot gauge patients’ understanding directly, they can facilitate provider–patient communication by simplifying, translating, and summarizing complex medical information. Additionally, they can translate medical information into multiple languages, catering to non-English speaking patients and improving inclusivity.15 RISKS OF ChatGPT RESPONSES Despite its benefits, the use of ChatGPT in a medical context carries inherent risks, including the potential for “hallucination,” where responses lack factual accuracy or logical coherence.16 Our study identified instances of hallucinations in ChatGPT responses, such as the provision of incorrect guidance on dietary restrictions after KT. Moreover, the omission of critical information, particularly regarding sensitive topics like sexual health, poses risks to patients’ well-being.6,16 Furthermore, ChatGPT can provide detailed and accurate information, but it lacks the interactive capability to verify a user’s understanding of that information. For instance, they cannot ask clarifying questions or adapt their responses based on a user’s level of comprehension.17 It is imperative to recognize and mitigate these risks to ensure the safe and effective use of ChatGPT in patient education and decision-making processes. FUTURE IMPLICATIONS OF TRUSTWORTHY AI Building on our findings, future transplant research should focus on developing and deploying AI systems that are reliable, transparent, ethical, and safe for use in healthcare settings—in other words, trustworthy AI. Trustworthiness in AI encompasses several key components based on our results, including further validation of the utilization and integration of LLMs into clinical practice for potential biases and harm, education and training of clinicians and patients on the use and risks of LLMs, accountability and responsibility, transparency, and explainability, and regulatory compliance of LLMs usage in healthcare.18 Further validation studies involving larger cohorts of clinical experts and patients’ engagement from multiple international medical centers are needed to corroborate our findings and refine the performance of ChatGPT in addressing KT patient queries in different languages than English, as ChatGPT can recognize and generate answers in >90 languages. Continuous refinement of LLMs and the creation of new, transplant-specific models based on real-world feedback and clinical validation is essential to enhance their accuracy and reliability. LLMs like ChatGPT can be trained and programmed to provide more reliable clinical responses and fine-tuning based on reliable data sources like academic papers or clinical guidelines.17 It should be noted that these suggested methods are currently under development and remain imperfect. However, one possible approach involves benchmarking alongside novel methods such as fine-tuning on dedicated “safety” data sets. Medical Natural Language Processing benchmarks, including MedQA, MedMCQA, and PubMedQA, are established standards for evaluating the performance of state-of-the-art medical LLMs.19 It is important to highlight that most of these benchmarks primarily focus on performance metrics rather than safety considerations. Efforts to integrate LLMs like ChatGPT into clinical practice should focus on developing safe and reliable integration with existing healthcare systems. LLMs deployed for clinical use require rigorous evaluation to address risks such as hallucinations—where LLMs can produce erroneous or unsupported information. Including diverse stakeholder groups in the development process is essential, as it helps ensure that functionalities and assessment criteria of LLMs reflect a broad spectrum of user needs and expectations. Additionally, advanced detection mechanisms are necessary to identify and correct errors in LLM outputs, from factual inaccuracies to complex diagnostic mistakes. Although ChatGPT can generate helpful medical advice, it lacks mechanisms to ensure that this advice adheres to professional standards and can be reliably attributed to a responsible party.14 Tracking responsibility and accountability in LLMs presents significant challenges due to their reliance on extensive and diverse data sets, the frequent absence of clear documentation regarding data sources, and the intricate decision-making processes inherent in these models. These complexities make it difficult to pinpoint responsibility to specific developers or stages of development. The opaque, “black-box” nature of many LLMs further complicates transparency and explainability, impeding the understanding and communication of how particular outputs are generated. To address these issues, it is essential to implement comprehensive documentation, conduct regular audits, develop interpretable models, and establish clear regulatory and ethical guidelines to ensure responsible AI development and usage. For instance, in the United States, the current Food and Drug Administration approval process, which primarily compares new AI devices with preexisting ones, may not adequately address safety and biases.20 Close collaboration between policymakers, industry stakeholders, and healthcare professionals is crucial to ensure responsible and ethical use of LLMs in patient care. SUMMARY We demonstrated that LLMs like ChatGPT can effectively provide high-quality information on KT, with responses generally marked by accuracy and completeness and minimal perceived harm. However, the varying quality of responses to complex medical topics highlights the necessity for ongoing updates and stringent quality controls. Our findings support the potential of future research and the use of LLMs in improving patient education and communication within clinical settings, stressing the importance of keeping these tools updated and closely monitored to ensure safety and reliability. However, before deploying LLMs in a clinical setting, further studies engaging patients are essential to fully understand their impacts and effectiveness.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRenal and Vascular PathologiesMachine Learning in Healthcare

Volltext beim Verlag öffnen

Trials and Tribulations: Responses of ChatGPT to Patient Questions About Kidney Transplantation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen