OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 07.04.2026, 17:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients

2026·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2026

Jahr

Abstract

Abstract Background Large language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. Methods We developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular disease risk, chronic kidney disease risk, systemic inflammation, nutrient deficiency, liver risk, and anemia) from laboratory biomarkers. We evaluated five LLMs from four providers—Grok-3 (xAI), GPT-4o and GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.0 Flash (Google)—using identical system prompts and inputs on 4,018 adults from the CDC NHANES 2017–2018. Ground truth was established using published clinical criteria (ADA, AHA, KDIGO, WHO). Performance was measured by F1 score with 95% confidence intervals, sensitivity, specificity, and positive predictive value. Results All five models achieved clinical-grade performance (F1 > 0.86) on eight evaluable patterns. Mean F1 scores ranged from 0.865 (95% CI: 0.799–0.931) for GPT-4o-mini to 0.963 (95% CI: 0.930–0.996) for Grok-3. Flagship models significantly outperformed economy-tier models (mean F1: 0.940 vs 0.881; paired t-test p=0.004). Grok-3 achieved near-perfect scores on liver risk (F1=1.000), anemia (0.999), and nutrient deficiency (0.997). Cardiovascular disease risk was the most challenging pattern (F1 range: 0.853–0.885). JSON parse rates exceeded 99.9% for all models. Total benchmark cost was approximately $59 USD. Conclusions A standardized prompt-based framework achieves clinical-grade accuracy across five LLMs from four independent providers, demonstrating model-agnostic generalizability. These findings support the feasibility of vendor-independent clinical AI systems that can leverage multiple models without requiring framework revalidation.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareArtificial Intelligence in Healthcare
Volltext beim Verlag öffnen