Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project
2
Zitationen
5
Autoren
2022
Jahr
Abstract
The object of this paper was the application of machine learning to a clinical dataset that was anonymized using the Mondrian algorithm. (1) Background: The preservation of patient privacy is a necessity rising from the increasing digitization of health data; however, the effect of data anonymization on the performance of machine learning models remains to be explored. (2) Methods: The original EHR derived dataset was subjected to anonymization by applying the Mondrian algorithm for various k values and quasi identifier (QI) set attributes. The logistic regression, decision trees, k-nearest neighbors, Gaussian naive Bayes and support vector machine models were applied to the different dataset versions. (3) Results: The classifiers demonstrated different degrees of resilience to the anonymization, with the decision tree and the KNN models showing remarkably stable performance, as opposed to the Gaussian naïve Bayes model. The choice of the QI set attributes and the generalized information loss value played a more important role than the size of the QI set or the k value. (4) Conclusions: Data anonymization can reduce the performance of certain machine learning models, although the appropriate selection of classifier and parameter values can mitigate this effect.
Ähnliche Arbeiten
k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY
2002 · 8.397 Zit.
Calibrating Noise to Sensitivity in Private Data Analysis
2006 · 6.878 Zit.
Deep Learning with Differential Privacy
2016 · 5.604 Zit.
Communication-Efficient Learning of Deep Networks from Decentralized\n Data
2016 · 5.592 Zit.
Large-Scale Machine Learning with Stochastic Gradient Descent
2010 · 5.569 Zit.