OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 15:37

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of the large language model in general medicine

2025·0 Zitationen·Global TransitionsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

12

Autoren

2025

Jahr

Abstract

Domain-specific large language models (LLMs) may aid primary care, but reliability and safety are uncertain, and general-practice evaluations remain limited. Observational, exam-based comparison (Feb–May 2025) using standardized items from the Chinese General Practitioner Licensing Examination. A general-practice LLM (AIGP) was evaluated against 389 physicians on the same items under deterministic decoding. Each item was run 50 technical replicates. Outcomes: accuracy; stability (coefficient of variation, ICC(2,k)); and expert-rated potential clinical harm for incorrect outputs. AIGP achieved 88.41 % ± 1.87 accuracy vs physicians 72.12 % ± 4.97 (mean difference 16.29, 95 % CI 15.56–17.01; P < .001). Stability favored AIGP (CV 2.12 vs 6.89; agreement across replicates ICC(2,k) = 0.993, 95 % CI 0.991–0.995). Although overall errors were fewer, AIGP's mistakes were systematic and carried higher average harm ratings, clustering in red-flag contexts (pregnancy, pediatrics, anticoagulation, acute chest pain). In a controlled test setting, AIGP outperformed physicians and showed excellent stability. Findings are exploratory and context-limited (exam-based, text-only) and reveal systematic higher-harm failure modes. AIGP is a starting point for decision support, not a deployable clinical tool without prospective real-world validation and safety guardrails (human-in-the-loop review, contraindication/dose checks, uncertainty-aware refusal). • AIGP scored 88.41% on Chinese GP licensing exam items vs 72.12% for 389 physicians. • AIGP showed strengths in cardiology, general practice theory/practice, and obstetrics/gynecology. • AIGP had higher response stability than clinicians (CV 2.12 vs 6.89) across diagnostic scenarios.

Ähnliche Arbeiten