Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study
24
Zitationen
4
Autoren
2020
Jahr
Abstract
BACKGROUND: Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE: This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS: We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS: We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient's name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS: Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.688 Zit.
Coding Algorithms for Defining Comorbidities in ICD-9-CM and ICD-10 Administrative Data
2005 · 10.544 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.925 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.504 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.025 Zit.