Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

2025·3 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Importance Large language models (LLMs) show promise for improving clinical reasoning, but they also risk inducing automation bias, an over-reliance that can degrade diagnostic accuracy. Whether AI-trained physicians are vulnerable to this bias when LLM use is voluntary remains unknown. Objective To determine whether exposure to erroneous LLM recommendations degrades AI-trained physicians’ diagnostic performance compared to error-free AI advice. Design A single-blind randomized clinical trial was conducted from June 20 to August 15, 2025. Setting Physicians were recruited from multiple medical institutions in Pakistan, participating through in-person or remote video conferencing. Participants Physicians registered with the Pakistan Medical and Dental Council with MBBS degrees, who had completed a 20-hour AI-literacy training covering LLM capabilities, prompt engineering, and critical evaluation of AI output. Intervention Participants were randomized 1:1 to diagnose 6 clinical vignettes in 75 minutes. The control group received unmodified ChatGPT-4o’s diagnostic recommendations; the treatment group’s recommendations contained deliberate errors in 3 of 6 vignettes. Physicians could voluntarily consult offered ChatGPT-4o recommendations alongside conventional diagnostic resources based on their clinical judgment. Main Outcomes and Measures Primary outcome was the diagnostic reasoning accuracy (percentage), assessed by three blinded physicians using an expert-validated rubric to evaluate: differential diagnosis accuracy, appropriateness of supporting and opposing evidence, and quality of recommended diagnostic steps. Secondary outcome was the top-choice diagnosis accuracy. Results Forty-four physicians (22 treatment, 22 control) participated. Physicians receiving error-free recommendations achieved mean (SD) diagnostic accuracy of 84.9% (19.7%), whereas those exposed to flawed recommendations scored 73.3% (30.5%), resulting in an adjusted mean difference of -14.0 percentage points (95% CI: -8.3 to -19.7; P <.0001). Top-choice diagnosis accuracy per case was 76.1% (42.5) in the treatment group and 90.5% (28.9) in the control group, with an adjusted difference of -18.3 percentage points (95% CI, -26.6 to -10.0; P <.0001). Conclusions and Relevance This trial demonstrates that erroneous LLM recommendations significantly degrade physicians’ diagnostic performance by inducing automation bias, even in AI-trained physicians. Voluntary deference to flawed AI output highlights critical patient safety risk, necessitating robust safeguards to ensure human oversight before widespread clinical deployment. Trial Registration ClinicalTrials.gov Identifier: NCT06963957

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen