Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of the large language model in general medicine
0
Zitationen
12
Autoren
2025
Jahr
Abstract
Domain-specific large language models (LLMs) may aid primary care, but reliability and safety are uncertain, and general-practice evaluations remain limited. Observational, exam-based comparison (Feb–May 2025) using standardized items from the Chinese General Practitioner Licensing Examination. A general-practice LLM (AIGP) was evaluated against 389 physicians on the same items under deterministic decoding. Each item was run 50 technical replicates. Outcomes: accuracy; stability (coefficient of variation, ICC(2,k)); and expert-rated potential clinical harm for incorrect outputs. AIGP achieved 88.41 % ± 1.87 accuracy vs physicians 72.12 % ± 4.97 (mean difference 16.29, 95 % CI 15.56–17.01; P < .001). Stability favored AIGP (CV 2.12 vs 6.89; agreement across replicates ICC(2,k) = 0.993, 95 % CI 0.991–0.995). Although overall errors were fewer, AIGP's mistakes were systematic and carried higher average harm ratings, clustering in red-flag contexts (pregnancy, pediatrics, anticoagulation, acute chest pain). In a controlled test setting, AIGP outperformed physicians and showed excellent stability. Findings are exploratory and context-limited (exam-based, text-only) and reveal systematic higher-harm failure modes. AIGP is a starting point for decision support, not a deployable clinical tool without prospective real-world validation and safety guardrails (human-in-the-loop review, contraindication/dose checks, uncertainty-aware refusal). • AIGP scored 88.41% on Chinese GP licensing exam items vs 72.12% for 389 physicians. • AIGP showed strengths in cardiology, general practice theory/practice, and obstetrics/gynecology. • AIGP had higher response stability than clinicians (CV 2.12 vs 6.89) across diagnostic scenarios.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.179 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.561 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Analysis of Survival Data.
1985 · 4.379 Zit.