Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Effects of prompt engineering on large language model performance in response to questions on common ophthalmic conditions

2024·7 Zitationen·Taiwan Journal of OphthalmologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

Introduction With the ability to respond to free text queries in a human-like manner, online large language models (LLM) such as ChatGPT have garnered great interest, especially in healthcare.[1] Since more patients, including ophthalmic patients, are seeking health information online, whether LLM may properly serve as patient resources and provide appropriate responses to patient inquiries has become increasingly relevant. For this, it is essential that the performance of LLM be examined and optimized to ensure patient safety. Prompt engineering, which is the integration of prompt information describing the specific assumed task of the model, has been shown to fine-tune LLM output.[2] This feature has become available in most LLM platforms and can be easily utilized by patients. It is thus wondered if prompt engineering could further improve LLM performance as an alternative resource for ophthalmic patients. To investigate the potential benefits of incorporating prompt engineering to this task, this study assessed its effects on LLM responses to questions about common ophthalmic conditions. Methods Using Google “People Also Ask,”[3] we collected five frequent online questions about each of these ophthalmic conditions: myopia, cataract, glaucoma, macular degeneration, diabetic retinopathy, and presbyopia. A total of 30 questions were collected and submitted to GPT-4 (ChatGPT and OpenAI Inc) for two times [Table 1], with the second time augmented by prompt engineering [Supplementary Method 1].Table 1: Frequently asked questions by internet users regarding common ophthalmic diseasesThe accuracy and readability of each response were recorded. Two experts (JHW and TN) assessed the response accuracy independently using the 5-point Likert scale, with each response being scored as “Strongly agree,” “Agree,” “Neutral,” “Disagree,” or “Strongly disagree” (from most accurate to least accurate) initially. A response was eventually considered “accurate” when both experts entered a score better than “Agree.” For readability, we used an online tool (https://readabilityformulas.com/) to calculate the automated readability index (ARI), which represents the grade level needed to comprehend the text [lower ARI indicates better readability, Supplementary Method 2].[4] The average response accuracy and ARI during the two submissions were analyzed using Microsoft Excel (Microsoft Corp.). Material changes in responses between the two submissions, defined as notable changes to the content not attributable to rephrasing or wordsmithing, were also assessed based on consensus by the experts.[3]Results At the first submission, the overall response accuracy was 60.0%, and the average ARI was 9.9. At the second submission with prompt engineering, the overall accuracy was 73.3%, and the ARI was 10.2. Figure 1 shows the accuracy and ARI breakdown by ophthalmic conditions. Examples of responses receiving different accuracy/readability gradings are shown in Supplementary Materials. A response graded as “accurate” is shown in Supplementary Table 1, while in Supplementary Table 2, the response was determined “inaccurate” due to false referencing of unverified data. Examples shown in Supplementary Tables 3 and 4 had an ARI of 6.3 (better readability) and 12 (worse readability), respectively.Figure 1: Average accuracy rates and automated readability index of GPT-4 responses break down by ophthalmic conditionsWe observed text changes in responses to all questions during the second submission. Material changes were found for 14 (47%) responses when comparing the content returned from the two times of submissions, with 9 and 5 responses showing improvement and worsening, respectively. For instance, in the example shown in Supplementary Table 5, the GPT-4 response from 2nd submission provided additional information such as the nonpathological nature of presbyopia and measures to manage this condition.Discussion The overall accuracy of GPT-4 responses increased after prompt engineering, with a tendency to show material improvement by providing more comprehensive or relevant information to the questions. Our results support further exploration of the utility of prompt engineering to improve LLM performance on answering ophthalmic questions, as well as patient education on the potential benefits of this feature. Nonetheless, it is crucial to note that prompt engineering does not eliminate common LLM issues such as response inconsistency and hallucination.[5] The inherent inconsistency of LLM responses can be observed when the same question or prompt is submitted for multiple times. Although prior works did not show a specific tendency of accuracy or material change in between submissions, such inconsistency could cause variability in response accuracy, and the utilization of prompt engineering does not necessarily guarantee an improved accuracy. Compared with descriptive answers, numerical responses by LLM are also more susceptible to errors due to variations in reference data [as shown in the example in Supplementary Table 2]. Moreover, the response accuracy varied widely by ophthalmic conditions, suggesting GPT-4 may not properly provide information within certain subspecialty domains. Responses at the second submission with prompt engineering were overall shorter with more frequent appearance of long sentences, likely due to the prompt instruction to provide “concise” responses. This might explain the slightly worsened readability after prompt engineering since long sentences can lead to increased ARI. Optimization of prompt instruction is needed to best tailor the LLM outputs to the specific context of use.[1] Therefore, to provide readable medical information to a wider audience, prompt instructions on specific language difficulty levels (e.g. comprehensible for elementary or middle school students) or format (e.g. to use plain language and short, simple sentences) may be beneficial. This also suggests the importance of user/patient education on not only the benefits of prompt engineering but also the effective utilization of this feature through the construction of an accurate, targeted prompt. There are several study limitations. First, ChatGPT was not originally designed for medical use. Second, qualitative grading of accuracy and material change can be subjective, and the specialty of the experts might influence the grading across different ophthalmic conditions. Last, only five questions were selected for each ophthalmic condition and there could be sampling bias. More exhaustive question collection is required to obtain more reliable results. To conclude, prompt engineering may improve LLM response to questions regarding ophthalmic conditions. With appropriate prompt design and cautious use, this strategy may help exert the potential of LLM platforms as alternative patient resources. Data availability statement All data generated or analyzed during this study are included in this published article. Financial support and sponsorship This work is supported by National Institutes of Health/National Eye Institute Grants (P30EY022589, R01MD014850, R01EY029058, and R01EY034148), University of California Tobacco-Related Disease Research Program (T31IP1511), and an unrestricted grant from Research to Prevent Blindness (New York, NY). The sponsor or funding organization had no role in the study design or conduction. Conflicts of interest Takashi Nishida is a consultant for Topcon. Sasan Moghimi reported grants from the National Eye Institute. Robert N. Weinreb is a consultant of Abbvie, Allergan, Alcon, Allergan, Amydis, Editas, Eyenovia, Iantrek, IOPtic, Implandata, Nicox, Santen, and Topcon. He receives grants from the National Eye Institute and National Institute of Minority Health Disparities and Research to Prevent Blindness. He reports patents to Toromedes and Carl Zeiss Meditec from UCSD; founder of Toromedes; all outside the submitted work. No other disclosures were reported.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAcute Ischemic Stroke ManagementMachine Learning in Healthcare

Volltext beim Verlag öffnen

Effects of prompt engineering on large language model performance in response to questions on common ophthalmic conditions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen