Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Continuous Evaluation Frameworks for Retrospective Evaluation of Clinical Machine Learning Models
0
Zitationen
5
Autoren
2026
Jahr
Abstract
The performance of clinical machine learning (ML) models that continuously predict the risk of a future event is typically evaluated using metrics at a single time point. While these metrics are widely accepted to evaluate the performance of ML models, they are not comprehensive enough to capture the performance and behavior of the ML models when executing continuously in real-world settings. Additionally, these single time point metrics assess model discriminability, which is different from clinical utility. In this paper, we propose three novel frameworks to continuously evaluate clinical ML models on retrospective datasets: (1) Zone-based (2) Predictive Activity Monitoring Characteristic curve-based and (3) Notification-based. We demonstrate the value of these three continuous evaluation frameworks using predictions from sepsis models evaluated on two publicly available datasets. The three proposed frameworks provide a more comprehensive evaluation and comparison of different sepsis prediction models and reveal additional aspects of model performance beyond standard metrics computed at a single time point. These frameworks provide users of clinical ML models a more realistic understanding of the model’s utility. Clinical machine learning (ML) models are increasingly used to predict patient risk continuously for future adverse events, such as sepsis. Traditionally, these models are evaluated using performance metrics computed at a single timepoint. While these metrics—such as Area Under the Receiver Operating Characteristic curve (AUROC) and Area Under the Precision Recall curve ( AUPRC)—are widely accepted, they fail to capture how a model behaves when operating continuously in real-world clinical environments. Moreover, single time point evaluations primarily assess discriminability, which may not directly reflect a model’s clinical utility. We propose and implement three novel retrospective evaluation frameworks to continuously assess clinical ML models: zone-based, predictive activity monitoring characteristic curve (AMOC)-based and notification-based. These frameworks were applied to sepsis prediction models using two publicly available clinical datasets. Model outputs were continuously monitored, and results were compared against traditional single time point metrics to assess differences in performance. The three continuous evaluation frameworks reveal multiple dimensions of model performance not captured by conventional metrics. Across both datasets, models with high AUROC demonstrate markedly different performance when continuously evaluated. The proposed zone-based, predictive AMOC-based, and notification-based frameworks offer a more comprehensive and realistic assessment of continuously operating clinical ML models. By uncovering temporal and behavioral aspects of model performance, these methods enable more informed model selection, deployment, and monitoring in healthcare settings. Implementing continuous evaluation can enhance trust, transparency, and clinical applicability of ML systems beyond what single time point metrics provide. Not Applicable
Ähnliche Arbeiten
Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study
2020 · 28.965 Zit.
The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3)
2016 · 26.833 Zit.
APACHE II
1985 · 13.503 Zit.
Definitions for Sepsis and Organ Failure and Guidelines for the Use of Innovative Therapies in Sepsis
1992 · 13.152 Zit.
The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure
1996 · 11.407 Zit.