OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 12:52

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

2025·3 ZitationenOpen Access
Volltext beim Verlag öffnen

3

Zitationen

6

Autoren

2025

Jahr

Abstract

Abstract Importance Large language models (LLMs) show promise for improving clinical reasoning, but they also risk inducing automation bias, an over-reliance that can degrade diagnostic accuracy. Whether AI-trained physicians are vulnerable to this bias when LLM use is voluntary remains unknown. Objective To determine whether exposure to erroneous LLM recommendations degrades AI-trained physicians’ diagnostic performance compared to error-free AI advice. Design A single-blind randomized clinical trial was conducted from June 20 to August 15, 2025. Setting Physicians were recruited from multiple medical institutions in Pakistan, participating through in-person or remote video conferencing. Participants Physicians registered with the Pakistan Medical and Dental Council with MBBS degrees, who had completed a 20-hour AI-literacy training covering LLM capabilities, prompt engineering, and critical evaluation of AI output. Intervention Participants were randomized 1:1 to diagnose 6 clinical vignettes in 75 minutes. The control group received unmodified ChatGPT-4o’s diagnostic recommendations; the treatment group’s recommendations contained deliberate errors in 3 of 6 vignettes. Physicians could voluntarily consult offered ChatGPT-4o recommendations alongside conventional diagnostic resources based on their clinical judgment. Main Outcomes and Measures Primary outcome was the diagnostic reasoning accuracy (percentage), assessed by three blinded physicians using an expert-validated rubric to evaluate: differential diagnosis accuracy, appropriateness of supporting and opposing evidence, and quality of recommended diagnostic steps. Secondary outcome was the top-choice diagnosis accuracy. Results Forty-four physicians (22 treatment, 22 control) participated. Physicians receiving error-free recommendations achieved mean (SD) diagnostic accuracy of 84.9% (19.7%), whereas those exposed to flawed recommendations scored 73.3% (30.5%), resulting in an adjusted mean difference of -14.0 percentage points (95% CI: -8.3 to -19.7; P <.0001). Top-choice diagnosis accuracy per case was 76.1% (42.5) in the treatment group and 90.5% (28.9) in the control group, with an adjusted difference of -18.3 percentage points (95% CI, -26.6 to -10.0; P <.0001). Conclusions and Relevance This trial demonstrates that erroneous LLM recommendations significantly degrade physicians’ diagnostic performance by inducing automation bias, even in AI-trained physicians. Voluntary deference to flawed AI output highlights critical patient safety risk, necessitating robust safeguards to ensure human oversight before widespread clinical deployment. Trial Registration ClinicalTrials.gov Identifier: NCT06963957

Ähnliche Arbeiten