Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients
0
Zitationen
1
Autoren
2026
Jahr
Abstract
Abstract Background Large language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. Methods We developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular disease risk, chronic kidney disease risk, systemic inflammation, nutrient deficiency, liver risk, and anemia) from laboratory biomarkers. We evaluated five LLMs from four providers—Grok-3 (xAI), GPT-4o and GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.0 Flash (Google)—using identical system prompts and inputs on 4,018 adults from the CDC NHANES 2017–2018. Ground truth was established using published clinical criteria (ADA, AHA, KDIGO, WHO). Performance was measured by F1 score with 95% confidence intervals, sensitivity, specificity, and positive predictive value. Results All five models achieved clinical-grade performance (F1 > 0.86) on eight evaluable patterns. Mean F1 scores ranged from 0.865 (95% CI: 0.799–0.931) for GPT-4o-mini to 0.963 (95% CI: 0.930–0.996) for Grok-3. Flagship models significantly outperformed economy-tier models (mean F1: 0.940 vs 0.881; paired t-test p=0.004). Grok-3 achieved near-perfect scores on liver risk (F1=1.000), anemia (0.999), and nutrient deficiency (0.997). Cardiovascular disease risk was the most challenging pattern (F1 range: 0.853–0.885). JSON parse rates exceeded 99.9% for all models. Total benchmark cost was approximately $59 USD. Conclusions A standardized prompt-based framework achieves clinical-grade accuracy across five LLMs from four independent providers, demonstrating model-agnostic generalizability. These findings support the feasibility of vendor-independent clinical AI systems that can leverage multiple models without requiring framework revalidation.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.400 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.261 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.695 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.506 Zit.