OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 05:40

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Characteristics of information on inflammatory skin diseases produced by four large language models

2024·1 Zitationen·International Journal of DermatologyOpen Access
Volltext beim Verlag öffnen

1

Zitationen

9

Autoren

2024

Jahr

Abstract

There has been growing interest in ways to utilize artificial intelligence (AI) in healthcare, including the generation of patient-facing educational information.1-3 Ensuring that such materials are comprehensible, factually correct, and comprehensive is crucial. The American Medical Association and the National Institutes of Health advise that these resources be written below an eighth-grade reading level.4 This study evaluated the ability of four large language models (LLMs)—Google Bard, Microsoft Bing, ChatGPT-3.5, and ChatGPT-4—to generate accurate and accessible patient information regarding inflammatory skin diseases (ISD). Each LLM was queried on a list of 42 ISD-related terms on October 11, 2023 (Table 1). We assessed the readability of the generated content using standardized readability scores: Gunning Fog Index (GF), Flesch–Kincaid Grade Level (FK), Automated Readability Index (ARI), and Coleman Liau (CL).4 The reading grade level scores for each metric were averaged (aRGL).4 The Wilcoxon signed-rank test was used for statistical analysis. Python version 3.11 (2022) was used to gather readability; R (R Core Team, 2022) was used for data visualization and statistical analyses. Three dermatology resident physicians, blinded to the specific model, also rated the LLMs on their output accuracy, completeness, safety, and educational utility. The raters were given six statements to rate via a 5-point Likert scale (Figure 1). We calculated Kendall's W coefficient of concordance to assess inter-rater reliability, yielding W = 0.548 (p < 0.01). This indicates a moderate and statistically significant level of agreement among the three raters, suggesting substantial consensus with some variability in their assessments of the LLM outputs. Between the four models, Bard's output had the lowest aRGL scores (median aRGL 9.79; 10th-grade reading level) (P < 0.01). Bing's outputs were more complex (aRGL of 10.67; 11th-grade reading level) (P < 0.01, Table 1). Both GPT models had higher reading levels (GPT-4: 11.85; GPT-3.5: 13.15; P < 0.05, Table 1). The accuracy and completeness (Statements 1 and 2) for each statement favored the GPT models. ChatGPT-3.5 and Bard provided treatment options 100% of the time, whereas Microsoft Bing and ChatGPT-4 provided treatment options 71.4% and 97.6% of the time, respectively. When comparing the accuracy and completeness of treatment (Statements 3 and 4) information, ChatGPT-3.5 and ChatGPT-4 received higher average ratings than Bard and Bing. Concerning the potential harm or presence of non-evidence-based information (Statement 5), Google Bard's outputs were rated as containing more potentially harmful information compared to the other models, followed by Bing, GPT-3.5, and ChatGPT-4. Finally, when considering the comfort of providing the output to patients for educational purposes without modifications (Statement 6), ChatGPT-3.5 and ChatGPT-4 were rated higher than Google Bard and Microsoft Bing (Figure 1). In this study, we assessed the characteristics of outputs pertaining to ISDs generated by four LLMs. We observed variability in the characteristics of outputs, emphasizing the need to review outputs and the importance of thoughtfully approaching the use of LLMs.5 Since our study identifies varying strengths and weaknesses of specific LLMs related to accuracy, variability, and readability, we do not seek to highlight one as better than another but rather to demonstrate the range of outputs and room for growth similar to prior studies.3 To continually evaluate and update patient education materials generated by LLMs, we recommend: (1) regular reviews by medical professionals to ensure ongoing accuracy and relevance, (2) patient-facing studies to understand utility and improve clarity, and (3) comparing LLM-generated content with current clinical guidelines to ensure alignment with best practices. As AI continues to evolve and become more widely used, continued evaluation of its performance is necessary, as our study only reflects outputs from one point in time.

Ähnliche Arbeiten