Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From One-hot Encoding to Privacy-preserving Synthetic Electronic Health Records Embedding
11
Zitationen
6
Autoren
2020
Jahr
Abstract
Categorical Encoding, typically one-hot encoding, plays a central role when we learn Machine Learning models. This classic approach is the most prevalent strategy due to its simplicity. However, as the number of categories grows large and sparse, it becomes infeasible to train since it creates high-dimensional vectors, which is also at the risk of revealing private information and breaking its underlying structure. We here propose to utilize data intermediate representation learning (embedding) to overcome such limitations. Instead of representing data with a one-hot vector of many cardinalities, an embedding serves as a lower-dimensional dense vector in which each cell can contain any number, capturing the latent hierarchical structures of the features in the meantime. It can also be assumed that sharing embedding is safer than releasing raw one-hot encoded data, as the presence of a particular feature is represented by the value of 1, otherwise 0. With the assist of Generative Adversarial Network further alleviates sensitive information leakage issue by creating synthetic data for modeling. Our result suggests that even embedded features may more or less pose privacy flaws, deploying GAN will make a wider variety of medical datasets available by retaining its relative utility while preserving data privacy, which has been identified as a promising method for medical machine learning and prediction.
Ähnliche Arbeiten
k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY
2002 · 8.400 Zit.
Calibrating Noise to Sensitivity in Private Data Analysis
2006 · 6.884 Zit.
Deep Learning with Differential Privacy
2016 · 5.608 Zit.
Communication-Efficient Learning of Deep Networks from Decentralized\n Data
2016 · 5.592 Zit.
Large-Scale Machine Learning with Stochastic Gradient Descent
2010 · 5.570 Zit.