OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 30.04.2026, 13:37

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Avaliação do ChatGPT-4.0 Versus ChatGPT-Mini na Geração de Conteúdo sobre Hipertensão Baseado em Diretrizes

2026·0 Zitationen·Arquivos Brasileiros de CardiologiaOpen Access
Volltext beim Verlag öffnen

0

Zitationen

15

Autoren

2026

Jahr

Abstract

BACKGROUND: Artificial intelligence (AI) language models are increasingly used to generate patient education materials. However, their accuracy, completeness, and adherence to clinical guidelines remain uncertain. OBJECTIVES: To compare ChatGPT-Mini and ChatGPT-4.0 in the generation of hypertension education content with respect to accuracy, completeness, structural quality using the Ensuring Quality Information for Patients (EQIP), response consistency, and alignment with established guidelines. METHODS: A standardized set of 31 hypertension-related questions was submitted to both models. Outputs were independently evaluated by 10 blinded clinicians using a modified EQIP score, a 5-point accuracy scale, and a 3-point completeness scale. Response consistency was assessed using BERTScore. Between-model comparisons were performed using the two-sided Wilcoxon rank-sum test (p < 0.05). Effect sizes were reported as Hodges-Lehmann (HL) median differences and Cliff's delta (δ), both with 95% CIs. Inter-rater reliability was estimated using the intraclass correlation coefficient (ICC; two-way random effects model, absolute agreement). RESULTS: Central tendency measures favored ChatGPT-4.0, although differences were small. Median scores were as follows: accuracy, 4.10 (3.70-4.20) versus 3.73 (3.60-4.05); completeness, 1.26 (1.17-1.41) versus 1.10 (0.96-1.23); and total EQIP score, 19.5 (18.0-25.0) versus 18.5 (16.0-23.0) for ChatGPT-4.0 and ChatGPT-Mini, respectively. HL median differences were small, with 95% CIs crossing zero (accuracy: +0.37, -0.25 to +0.50; completeness: +0.16, -0.06 to +0.36; EQIP: +1.0, -1.0 to +6.0). Cliff's δ values were consistently small and positive across primary outcomes, indicating only modest stochastic dominance of ChatGPT-4.0. Identification clarity tended to be higher with ChatGPT-4.0, whereas response consistency measured by BERTScore F1 was generally higher for ChatGPT-Mini (> 0.92 versus 0.885-0.932). Inter-rater reliability was good to excellent across all measures (ICC > 0.80). CONCLUSIONS: ChatGPT-4.0 demonstrated small, non-significant improvements in accuracy, completeness, and structural quality compared with ChatGPT-Mini. Effect sizes were modest, and all 95% CIs included zero. ChatGPT-Mini produced more consistent responses. These findings underscore the importance of routinely reporting effect sizes with 95% CIs and support the use of standardized evaluation methods and real-time validation frameworks for AI-generated medical education content.

Ähnliche Arbeiten