Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Continuous Evaluation Frameworks for Retrospective Evaluation of Clinical Machine Learning Models

2026·1 Zitationen·Critical CareOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

The performance of clinical machine learning (ML) models that continuously predict the risk of a future event is typically evaluated using metrics at a single time point. While these metrics are widely accepted to evaluate the performance of ML models, they are not comprehensive enough to capture the performance and behavior of the ML models when executing continuously in real-world settings. Additionally, these single time point metrics assess model discriminability, which is different from clinical utility. In this paper, we propose three novel frameworks to continuously evaluate clinical ML models on retrospective datasets: (1) Zone-based (2) Predictive Activity Monitoring Characteristic curve-based and (3) Notification-based. We demonstrate the value of these three continuous evaluation frameworks using predictions from sepsis models evaluated on two publicly available datasets. The three proposed frameworks provide a more comprehensive evaluation and comparison of different sepsis prediction models and reveal additional aspects of model performance beyond standard metrics computed at a single time point. These frameworks provide users of clinical ML models a more realistic understanding of the model’s utility. Clinical machine learning (ML) models are increasingly used to predict patient risk continuously for future adverse events, such as sepsis. Traditionally, these models are evaluated using performance metrics computed at a single timepoint. While these metrics—such as Area Under the Receiver Operating Characteristic curve (AUROC) and Area Under the Precision Recall curve ( AUPRC)—are widely accepted, they fail to capture how a model behaves when operating continuously in real-world clinical environments. Moreover, single time point evaluations primarily assess discriminability, which may not directly reflect a model’s clinical utility. We propose and implement three novel retrospective evaluation frameworks to continuously assess clinical ML models: zone-based, predictive activity monitoring characteristic curve (AMOC)-based and notification-based. These frameworks were applied to sepsis prediction models using two publicly available clinical datasets. Model outputs were continuously monitored, and results were compared against traditional single time point metrics to assess differences in performance. The three continuous evaluation frameworks reveal multiple dimensions of model performance not captured by conventional metrics. Across both datasets, models with high AUROC demonstrate markedly different performance when continuously evaluated. The proposed zone-based, predictive AMOC-based, and notification-based frameworks offer a more comprehensive and realistic assessment of continuously operating clinical ML models. By uncovering temporal and behavioral aspects of model performance, these methods enable more informed model selection, deployment, and monitoring in healthcare settings. Implementing continuous evaluation can enhance trust, transparency, and clinical applicability of ML systems beyond what single time point metrics provide. Not Applicable

Autoren

Institutionen

Themen

Sepsis Diagnosis and TreatmentArtificial Intelligence in Healthcare and EducationElectronic Health Records Systems

Volltext beim Verlag öffnen

Continuous Evaluation Frameworks for Retrospective Evaluation of Clinical Machine Learning Models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen