Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Inside the Black Box: Detecting Data Leakage in Pre-Trained Language Encoders
1
Zitationen
7
Autoren
2024
Jahr
Abstract
Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders–an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed, highlighting a privacy risk far greater than previously assumed. Alongside, we present in-depth analysis and insights toward guiding future researchers and practitioners in addressing the privacy considerations in developing pre-trained language models.
Ähnliche Arbeiten
Determining Sample Size for Research Activities
1970 · 17.647 Zit.
Scale Development : Theory and Applications
1991 · 14.735 Zit.
Online Learning: A Panacea in the Time of COVID-19 Crisis
2020 · 4.915 Zit.
Systematic review of research on artificial intelligence applications in higher education – where are the educators?
2019 · 4.423 Zit.
Blended learning: Uncovering its transformative potential in higher education
2004 · 4.405 Zit.