Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Navigating artificial intelligence (AI) accuracy: A meta-analysis of hallucination incidence in large language model (LLM) responses to oncology questions.

2025·1 Zitationen·Journal of Clinical Oncology

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

e13686 Background: Conversational AI, such as LLM-powered Chat-GPT, provides immediate, human-like responses to oncology questions. However, such LLMs are prone to generating information that is factually incorrect, or hallucinated. This raises significant ethical concerns in the medical field, where the "do no harm" principle is paramount. Our meta-analysis examines the pooled incidence of hallucination in LLM responses and explores the impact of user type (patient- vs. physician-orientated), question structure (simple vs. contextual prompts), and ChatGPT model versions. Methods: We performed a comprehensive literature search for studies on LLM-generated responses to oncology questions, which includes expert assessment of accuracy using an ordinal accuracy scale. Responses categorized by experts as completely inaccurate or containing incorrect information were defined as hallucinations. A meta-analysis was conducted to estimate pooled hallucination incidence. Random-effects models were used to assess hallucination rates by user type (patient, physician), prompt type (simple, contextual), and LLM version (GPT-3.5, GPT-4). Meta-regression was performed to explore the interactions between these variables. Results: A total of 39 studies involving 6,523 responses were included. Patient-oriented prompts were predominantly simple (n=3,756, 89%), whereas physician-oriented prompts were primarily contextual (n=1,630, 75%) ( χ² =2,658, P <0.001). The overall hallucination rate was 23% (95% CI: 17–28%). User type, prompt type, and LLM model significantly influenced the IRR ( P <0.05). Physician-oriented prompts demonstrated significantly higher hallucination rates compared to patient-oriented prompts. Contextual prompts were associated with significantly increased hallucination rates compared to simple prompts. Conclusions: This is the first meta-analysis of hallucination incidence in LLM responses to oncology questions. The finding that approximately one in five responses contains inaccurate information raises significant concerns for patient safety. Variability in hallucination rates were observed across user type, prompt type, and LLM version. This highlights the need for further standardization and research on prompt optimization to reduce AI-generated misinformation in oncology. Category Hallucination Rate 95% CI Incidence rate ratio (IRR) P -value Overall (n=6,523, 100%) 0.23 [0.17–0.28] – – User type Patients (n=4,397, 67%) 0.21 [0.14–0.28] Reference – Physicians (n=2,250, 34%) 0.24 [0.12–0.35] 1.51 <0.001 Prompt type Simple Prompts (n=2,959, 45%) 0.19 [0.12–0.25] Reference - Contextual Prompts (n=1,998, 31%) 0.28 [0.11–0.46] 1.13 0.04 LLM version ChatGPT-3.5 (n=3,213, 49%) 0.23 [0.15–0.30] Reference – ChatGPT-4 (n=2,096, 32%) 0.19 [0.11–0.28] 0.78 <0.01

Autoren

Institutionen

Themen

Clinical Reasoning and Diagnostic SkillsArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Navigating artificial intelligence (AI) accuracy: A meta-analysis of hallucination incidence in large language model (LLM) responses to oncology questions.

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen