Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Seeing the Forest for the Trees: Random Forest Models for Predicting Survival in Kidney Transplant Recipients
33
Zitationen
2
Autoren
2019
Jahr
Abstract
Risk prediction plays an important role in clinical transplantation research. Traditionally, most risk models have been based on regression models.1 Although useful to help understand relationships between predictors and outcomes, these statistical methods can typically evaluate only a small number of predictors, which are assumed to affect everyone in the same way, and uniformly throughout the participants’ lifespan. These methods have several limitations,2 including the inability to analyze nonlinear relationships, the requirement of setting a level of binary significance, impracticality for analyzing large datasets, and vulnerability to bias secondary to variable selection and/or omission of relevant confounders. With the emergence of P4 (Predictive, Preventive, Personalized, and Participatory) and Precision Medicine, artificial intelligence and machine learning methods have come to attention as methods aimed at solving the challenges in analysis not well addressed by regression approaches. Machine learning methods provide algorithms to understand patterns from large, complex, and heterogeneous data.3 Of the machine learning methods, recursive partitioning, and especially random forests, can deal with large numbers of predictor variables even in the presence of complex interactions.2,4 These methods have been applied successfully in genetics, clinical research, and bioinformatics. In this issue of Transplantation, Scheffner et al report on the development and internal validation of a random forest prediction model for patient survival.5 Random forest models are composed of a collection of decision trees. In the process of building each decision tree, different random subsets of the variables from the training dataset are selected to establish how best to partition the dataset at each node.6 Random forest models are considered less vulnerable to overfitting the training dataset given the large number of trees built, making each tree an independent model. The lower likelihood of bias is a result of bootstrapping several trees over randomly selected subsets of variables and subsamples of data.6 Random forest models require little preprocessing of data; the data need not be normalized; and the approach is resilient to outliers. While missing data will be a challenge when trying to draw clinical inferences from standard statistical models, machine learning methods tend to make fewer assumptions about the underlying data and, thus, are less vulnerable to the challenges associated with violation of those assumptions. Relying on fewer assumptions than regression analysis, machine learning methods have been shown to deliver more robust predictions. Scheffner and colleagues5 split a retrospective cohort of kidney transplant recipients with posttransplantation protocol biopsies into training and validation datasets (Figure 2A and B). Using all pretransplant and 3- and 12-months posttransplant variables, the obtained models showed good performance to predict death (concordance index: 0.77–0.78). Validation showed a concordance index of 0.76 and good discrimination of risks by the models, despite substantial differences in clinical variables and the derivation dataset representing an earlier era (2000–2007) than the validation dataset (2008–2013). To contrast with outputs of multivariable regression models using the same datasets, see Tables 2 and 3 and nomograms predicting mortality risk using estimators from multivariable Cox models (Figure 3) in Abeling et al.7 Random survival forests also inform on the importance of descriptive variables.6 Scheffner found the potentially modifiable (and highly correlated) graft rejection treatment and urinary tract infection to be important predictors of patient survival in addition to established factors like age, cardiovascular disease, diabetes, and graft function (Figure 3A and B).5 Many of the predictors retained in multivariable regression models7 were also deemed important in random forest survival analyses.5 To validate selected predictors and model construction, it is important to pursue external validation with independent datasets. Random survival forests may complement regression analyses when handling highly correlated complex survival data. Opportunities for application (and limitations) of each of the regression and random survival forests for prediction are summarized in Table 1.TABLE 1.: Regression and random survival forests for survival analysisPredictive models in transplantation and donation help risk stratify patients and could improve quality of healthcare delivery as well as patient outcomes. The increasing interest in these tools warrants a better understanding of their challenges and limitations.8 First, highly predictive variables may not necessarily be causally related to the outcomes of interest. Second, the success of machine learning models depends on the relationship between predictors and outcome being represented in training/validation datasets, the number of observations and features, selection and parameterization of features, and the algorithm chosen for the model. Careful variable definition (eg, urinary tract infection) is necessary. Presence of highly correlated linear and nonlinear relationships between independent variables may warrant mechanisms for removal of the correlated variables. Model performance may also be compromised when studying rare outcomes.4 Inevitably, generalizability of machine learning models may be limited when the clinical context, local factors (including patient/physician preferences, health systems, and care standards), and therapeutic strategies vary. To enable assessment of model validity, correct interpretation of model outputs, replication, and future knowledge synthesis, it is vital that the transplantation and donation community promote adherence to guidelines on the dissemination and reporting of machine learning models.8,9 Authors should be encouraged to report all model parameters, transformations applied to raw data, sampling methods, and random number generator seeds. Whenever possible, algorithms and associated code should be released in public software archive domains. There is a need for new models of health data ownership with rights to the individual, highly secure data repositories, government legislation for data sharing, and usage policies to ensure privacy and data security. Moreover, with wide uptake of machine learning and artificial intelligence tools, the scale of iatrogenic risks and liabilities related to their application, in contrast to the implications of a single doctor’s mistake for a given patient, also warrant assessment.10 Most practice guidelines are geared toward the “average patient.” Machine learning tools can capture the complexity of individual patients’ characteristics and aid transplant clinicians with patient-specific care decisions. As these tools become more prevalent, it is important to develop best practice guidelines and ensure there is regulatory oversight on their development and application.
Ähnliche Arbeiten
Biostatistical Analysis
1996 · 35.445 Zit.
UCI Machine Learning Repository
2007 · 24.290 Zit.
An introduction to ROC analysis
2005 · 20.596 Zit.
The use of the area under the ROC curve in the evaluation of machine learning algorithms
1997 · 7.102 Zit.
A method of comparing the areas under receiver operating characteristic curves derived from the same cases.
1983 · 7.061 Zit.