Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating artificial intelligence responses to respiratory medicine questions

2024·5 Zitationen·RespirologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

The Chat Generative Pre-trained Transformer (ChatGPT) is an advanced artificial intelligence-powered natural language processing tool. Released in November 2022, it quickly gained recognition for its ability to generate human-like textual responses in conversational settings.1, 2 Originating from the GPT-3.5 model, its subsequent evolution, GPT-4, was introduced in March 2023, significantly enhancing its performance and accuracy. ChatGPT is designed to engage users in discussions on a wide array of topics, ranging from casual conversations to generating complex content such as stories or poems. It claims to provide informative responses, answer queries and foster creative expression with a remarkable semblance to human interaction. Despite its acclaim, the use of ChatGPT within the medical sector is met with scepticism. The tool has not been extensively trained with biomedical datasets, nor have its generated responses been thoroughly vetted by medical professionals for accuracy and reliability.3, 4 This has led to concerns about its capability to handle medical inquiries accurately, often resulting in responses that may appear general or lacking depth.5 To date, there is a lack of peer-reviewed research assessing ChatGPT's effectiveness in delivering precise and detailed answers to medical questions, particularly those that are disease-specific and clinical in nature. In light of these challenges, this study aims to critically evaluate the ability of GPT-4 to address a variety of questions frequently encountered in respiratory medicine. This includes assessing its capacity to provide basic information as well as detailed explanations of clinical phenomena. It's important to note that while ChatGPT exhibits potential in various applications, it is not a substitute for professional medical advice. Users and researchers should exercise caution, acknowledging that ChatGPT should complement, not replace, the expertise of qualified healthcare professionals. This study seeks to underscore the necessity for clear disclaimers about ChatGPT's current limitations when applied to medical inquiries, ensuring that the public maintains realistic expectations about its use as a healthcare information resource. We posed questions to ChatGPT covering a range of respiratory conditions, including asthma, chronic obstructive pulmonary disease, pneumonia, chronic bronchitis, emphysema, tuberculosis, lung cancer, pulmonary embolism, sleep apnoea, cystic fibrosis, pneumothorax and acute respiratory distress syndrome. The questions were derived from diverse sources such as textbooks, clinical consultations and real-life case scenarios. These questions were categorized on themes of fundamental medical knowledge, clinical presentation and diagnosis and treatment and management. To assess the quality of ChatGPT's responses, four experts, who all have doctoral degrees and skills in respiratory medicine, provided ratings of completely correct, incomplete/partially correct, incorrect/misleading or irrelevant. The assessors demonstrated a significant consensus on the quality of responses across all prompted questions. A total of 63 questions were presented to ChatGPT for assessment (Table 1). Three examples of the questions used in this study are shown in Table 2. The overall performance revealed that it provided correct answers for 63.5% of the questions, with incomplete or partially correct responses for 17.5% of the questions. It displayed incorrect or misleading answers in 9.5% of the questions, and irrelevant responses in 9.5% of the questions. Notably, questions pertaining to basic medical knowledge had the highest proportion of correct answers at 90.0%, whereas questions on the treatment and management of diseases had a higher incidence of incorrect responses at 27.8%. It is notable that GPT-4 was more accurate than its predecessor, GPT-3.5, which achieved a correct response rate of only 30% for the given questions. However, limitations persist within ChatGPT's functionality. For instance, it fails to fully interpret reports on lung function in respiratory diseases and inaccurately diagnoses patients with elevated carcinoembryonic antigen as having lung cancer, without considering pathological findings. Despite the advanced capabilities of GPT-4, it is paramount to recognize that the model's training data predominantly consists of a diverse range of internet text available up to September 2021. This encompasses a broad spectrum of information but lacks a focus on specialized biomedical datasets. Consequently, while GPT-4 can generate responses that utilize appropriate medical terminology and appear convincing, these responses may include inaccuracies, be overly general or insufficiently address specific clinical scenarios. The restricted nature of the training data inherently limits ChatGPT's ability to interpret highly specialized reports, such as those pertaining to lung function in respiratory diseases, or to accurately diagnose conditions without considering the full clinical picture. For example, it erroneously identified elevated carcinoembryonic antigen levels as indicative of lung cancer. Recognizing these limitations is critical when considering ChatGPT's application in medical settings. While it shows promise in medical education and offering preliminary responses to routine clinical inquiries, reliance on ChatGPT without expert human supervision is not advisable at this stage. Variability in responses can occur based on how questions are phrased or the specificity of the query. This variability, coupled with the potential for outdated or generalized responses, underscores the necessity for medical professionals to critically evaluate the information provided by ChatGPT and to supplement it with the latest medical knowledge and clinical judgement. Apart from considering the accuracy and relevance of ChatGPT's responses, there are several other factors to take into account when utilizing it in respiratory medicine. The quality of the generated responses is contingent upon the quality of the prompts used. For instance, asking for a ‘brief overview of asthma’ may result in a general response covering basic aspects of the condition, whereas a prompt requesting ‘detailed pathophysiological mechanisms of asthma exacerbation’ is likely to elicit a more comprehensive and technical reply. These examples demonstrate how the precision and clarity of a prompt directly impact the usefulness and accuracy of ChatGPT's, response to medical inquiries. It should be noted that ChatGPT exhibits uncertainty, as the same prompt can yield different responses for different users. Furthermore, rephrasing a query or posing a more specific question may elicit a distinct answer. This variability is undesirable for its application as a medical reference. Users should be mindful that both GPT-3.5 and GPT-4 were trained on data up until September 2021, meaning that their answers may not be up to date. For instance, the training data used for ChatGPT does not encompass the latest version of the COPD Global Initiative, which was released online on 14 November 2022. In light of the findings of this study, it is clear that while ChatGPT can generate useful responses in respiratory medicine, it should be used with caution. Despite the advancements of GPT-4, significant limitations remain in its ability to deliver accurate and reliable medical advice, particularly when handling complex clinical scenarios. These limitations underscore the importance of not relying solely on ChatGPT for medical decision-making or diagnosis without the oversight of qualified healthcare professionals. The potential utility of ChatGPT in the medical field appears more suited to generating preliminary insights or as a supplementary educational tool under expert guidance. This approach ensures that the limitations of ChatGPT are acknowledged, mitigating the risks associated with its current use in clinical settings. Future research should focus on exploring methods to enhance the model's accuracy and reliability, particularly in specialized domains such as respiratory medicine, to better support medical professionals and potentially improve patient care. Hong Luo: Conceptualization (supporting); formal analysis (lead); investigation (lead); methodology (lead); project administration (equal); software (lead); writing – original draft (lead); writing – review and editing (supporting). Jisong Yan: Formal analysis (supporting); methodology (supporting); project administration (supporting); software (supporting); writing – original draft (supporting); writing – review and editing (supporting). Xia Zhou: Conceptualization (lead); formal analysis (supporting); investigation (supporting); methodology (supporting); project administration (lead); software (lead); writing – review and editing (lead). None declared. Not applicable. Data sharing is not applicable to this article as no datasets were generated or analysed during the study.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationCOVID-19 diagnosis using AIMachine Learning in Healthcare

Volltext beim Verlag öffnen

Evaluating artificial intelligence responses to respiratory medicine questions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen