Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Machine Learning in Emergency Medicine: Keys to Future Success

2020·24 Zitationen·Academic Emergency MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2020

Jahr

Abstract

An era for artificial intelligence has arrived for emergency medicine. In the systematic review by Kareemi et al.1 published in this issue of Academic Emergency Medicine, the authors evaluate the performance of machine learning (ML) models versus standard care (e.g., clinical decision rules, provider judgment) in emergency medicine across a variety of clinical scenarios and outcomes. The systematic review concludes that ML has superior performance in almost all tasks, but also calls attention to several widespread shortcomings including limited adherence to reporting guidelines and the lack of evaluation through interventional trials. These findings highlight the need for a new phase in clinical decision support (CDS) for emergency care with research and practice focused on integrated, ML-driven CDS systems that are usable, interpretable, and effective. In this commentary, we review key concept areas for enhancing the performance, promoting the adoption, and studying the impact of ML within emergency medicine (Figure 1). We also discuss the interpretation and application of ML studies and projects, dividing key concepts into two domains: intrinsic—elements of the model and its task-based performance—and extrinsic—the ability for the model to achieve a desired objective with respect to patient care. Here, we first introduce readers to core principles that define ML and provide contrast to traditional statistical approaches. We focus specifically on “supervised” ML, where an algorithm is trained to recognize relationships between an outcome (e.g., mortality, length of stay) and any number of independent variables (“covariates” in statistical language and “features” in ML). This contrasts “unsupervised” learning, where data patterns are learned without attention to outcome prediction and is outside of the scope of this piece. One of the most common ways to investigate associations is via regression techniques, which include linear and logistic methods. Logistic regression has become ubiquitous because it efficiently weighs the relative contributions of multiple variables in an easily interpretable way.2 Logistic regression is used in classification problems (i.e., predicting an outcome) and is a bridge between traditional statistics and ML approaches. Supervised ML algorithms all address the same fundamental prediction task, but use different underlying methods to derive their predictions. ML expands on traditional statistical approaches using general-purpose learning algorithms with minimal assumptions about the data-generating process to find patterns in rich and unwieldy data. ML models are therefore often employed when the focus is model accuracy; the data are multimodal or nontraditional (i.e., text, images, connectivity graphs); computational constraints exist; or when the goal is to capture complex, nonlinear relationships. Critical assessment of ML algorithms requires understanding of model strengths and weaknesses, transparency in presentation, and methods to maintain deployed algorithms over time. The success of ML algorithms generally depends on automated methods of data representation that can uncover the different explanatory factors of variation behind the data. Conceptually, ML “features” and statistical “covariates” are synonymous when using traditional data types like continuous variables representing laboratory values. In these scenarios, however, ML models are readily able to expand on the logistic regression rule-of-thumb of limiting modeling to approximately one feature per 10 outcomes because they are much more resilient to overfitting.2, 3 Rapid growth in the number of variables highlights the need for parallel advances in interpretability and implementation, both of which will be discussed later. Unlike traditional statistical regression models, ML models are able to look for nonlinear relationships in data. This advance is critical because not all interactions can be captured by repeatedly adding weighted variables together. For example, decision trees enable “OR” logic, like “does this patient have known CAD OR a smoking history.” Exploration of nonlinear relationships is likely to highlight underappreciated clinical variables for further study. Beyond using traditional covariates to directly make predictions, ML is also able to achieve one further level of abstraction called feature or “representation” learning, where the algorithm is asked to make predictions after first learning how to best look at the data. Representation learning has facilitated key advances in clinical computer vision (e.g., finding blood on head CTs) and in massive-scale medical note processing (e.g., predicting hospital readmission from discharge notes). ML models thus enable researchers to break out of traditional approaches using small number of human-selected features and instead cast a wider net that may enable enhanced predictive capabilities and the discovery of new relational and potentially causal factors for outcomes. Adopting state-of-the-art models and methods for their development will enhance performance and generalizability of ML solutions in emergency care. ML is now composed of a dizzying array of steps including preprocessing, exploratory data analyses, feature selection, training, model selection, validation, testing, and hyperparameter searches, which, as noted by Kareemi et al., complicates critical assessment. Expansion of and adherence to reporting guidelines and standards will play a key role for ML going forward. Recently, the EQUATOR network announced a planned expansion of the “Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis” (TRIPOD) guideline for ML.4 Studies reporting artificial intelligence interventions should adhere to CONSORT-AI or SPIRIT-AI guidelines.5 The MI-CLAIM checklist highlights metrics for evaluating algorithm performance addition to those for model clinical utility.6 For the former task, area under receiver operating characteristic (AU-ROC, otherwise known as c-statistic), which is synonymous to the area under a model-derived sensitivity, 1-specificity curve is most commonly used, but it is important to recall caveats to its use. For example, in studies where one of the outcomes (e.g., mortality) is rare, it is more useful to present data from precision (i.e., positive predictive value)-recall curves.7 Both these benchmarks capture model “discrimination,” which says that patients at higher risk for an outcome should have a higher predicted risk. In contrast, model “calibration” provides information on the reliability of risk estimates for individuals, matters significantly for patient decision-making, and is often not reported.8, 9 Model clinical utility is also dependent on benchmarks that can be applied in the interpretation at the level of individual patients, namely, positive/negative predictive values, numbers needed to treat/harm, sensitivity, and specificity.6 Research has revealed troubling examples in which the reality of algorithmic decision making falls short of our expectations of impartiality and freedom from bias. Some algorithms have been shown to replicate and even amplify human biases, particularly those affecting protected groups.10 In some cases, clinical implementation of race is explicit, such as in GFR estimation.11 Elsewhere, bias can be an emergent property of missing types of data like inability to control for socioeconomic or health determinants. Finally, the use of proxy outcomes, like cost of care standing in place of health needs can lead to unintended and deeply problematic consequences.12 Subgroup testing (e.g., gender and/or race) should be a core component of model assessment as should careful consideration of outcome choices. As ML models are transitioned from individual benchmarking studies to longitudinally deployed tools, new standards for algorithmic stewardship will be needed.13 There is mounting evidence that algorithms may experience “calibration drift” over time as features and outcomes change distribution over time.14 Emergency medicine researchers, data scientists, and others interested in implementing ML models in the real world must begin to plan for the full life cycle of models, which will require funding, expertise, and periodic model reevaluation. Focus on model parsimony, that is, minimizing the number of features required to execute the prediction task, is likely to help this aim as a simpler model is easier to upkeep, but needs to be balanced against a desire for high performance. The success of ML in emergency care requires implementation, measurement of impact on patient care, and dissemination to the larger community of healthcare providers. Linear regression models have stood the test of time in part because they are easily interpreted and have known limitations. With the complexity of ML approaches comes complexity in interpretation—model predictions themselves are simple, but how they arrived at those findings are not. There is a fundamental difference in accepting the analysis of a field expert (e.g., a teleneuroradiologist interpreting an MRI) and doing the same for an algorithm, even though the requesting practitioner may find the study itself opaque. Intepretability is the goal of opening the “black box” of artificial intelligence. Most ML tools have a mechanism to indicate importance of features in determining a model outcome, which provides some sense of algorithm underpinnings. Unfortunately, these rarely other the tangibility found in simple regressions. Significant leaps have been made in complex model interpretation, enabling both population-level and patient-level interpretation the key drivers of risk.15, 16 The MI-CLAIMS checklist includes requirements regarding model feasibility and interpretability.6 Intrepretability also plays a key role in implementation. Dissemination and implementation (D&I) sciences are a growing force in emergency medicine just as they are beginning to contend with a new wave of ML tools.17, 18 D&I science serves as a reminder of the significant effort required beyond model development. ML will need to have a meaningful point of entry into the clinical workflow and accessible interpretation, while not demanding either too much attention or monopolizing computational resources. There is currently a significant barrier to entry for ML deployment as the financial incentives of health systems need to align with EHR vendors to purchase ML packages or build out their own data infrastructure. More fundamentally, ML will confront “the realities of entrenched institutions,” defined in part by the decision-making authority of medical providers within the health care systems.18 D&I sciences provide a framework for the engagement of provider, patient, organizational, and policy stakeholders. Kareemi et al. find that ML outperforms usual care, as defined by clinician judgment, clinical decision tools, and triage-based scores. As with prior studies, they find that CDS tools are rarely compared to provider judgement.19 We propose that new models be designed with eventual comparison to provider judgment in mind. Kareemi et al. also note a paucity of randomized trials rigorously assessing for meaningful differences in patient-centered outcomes. Multiple examples of interventional trials for CDS and AI, however, have emerged.20, 21 We anticipate rapid expansion of interventional trials with fewer barriers to implementation and the emergence of novel automated methods of EHR randomization. Open science, a movement that promotes sharing of both primary data and source code, is a promising initiative to enhance reproducibility. Part of the open science movement is the recognition that data management and analytic decisions have critical implications for interpretation and that computing workflows need to follow the same practices as lab projects and notebooks, with organized data, and documented steps. Adherence to these principles is critical for the success and transparency of ML in emergency medicine. Unfortunately, few studies published in widely accessed emergency medicine journals meet this bar.22 Researchers, funders, and journals will need to align on standards and expectations for data and algorithm sharing.6 Kareemi et al. highlight the promise of ML. Realizing ML’s potential in emergency care is a multifactorial challenge motivated by the potential for higher-quality, more efficient patient care. ML tools leverage many of the same core principles of traditional statistical approaches, while relaxing limitations on the number of variables under study, varieties of input data, and the types of relationships between variables. EM, a field defined by high volume, acuity, and flexibility, is poised to participate in this paradigm shift, informed both by the intrinsic (features, models, bias, maintenance) and by the extrinsic (interpretability, implementation, evaluation, open science) factors.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationSepsis Diagnosis and TreatmentMedical Coding and Health Information

Volltext beim Verlag öffnen

Machine Learning in Emergency Medicine: Keys to Future Success

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen