Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models
0
Zitationen
3
Autoren
2026
Jahr
Abstract
With the widespread adoption of Large Language Models (LLMs) and increasingly stringent privacy regulations, protecting data privacy in LLMs has become essential, especially for privacy-sensitive applications. Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in the model training/fine-tuning dataset, posing serious privacy risks. However, most existing MIA techniques against LLMs rely on sequence-level aggregated prediction statistics, which fail to distinguish prediction improvements caused by generalization from those caused by memorization, leading to low attack effectiveness. To address this limitation, we propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens, where membership signals are more pronounced. By comparing token-level probability improvements at hard tokens between a fine-tuned target model and a pre-trained reference model, HT-MIA isolates strong and robust membership signals that are obscured by prior MIA approaches. Extensive experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines. We further investigate differentially private training as an effective defense mechanism against MIAs in LLMs. Overall, our HT-MIA framework establishes hard-token based analysis as a state-of-the-art foundation for advancing membership inference attacks and defenses for LLMs.
Ähnliche Arbeiten
Rethinking the Inception Architecture for Computer Vision
2016 · 30.638 Zit.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
2018 · 24.884 Zit.
CBAM: Convolutional Block Attention Module
2018 · 21.747 Zit.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020 · 21.478 Zit.
Xception: Deep Learning with Depthwise Separable Convolutions
2017 · 18.662 Zit.