Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Ethical treatment of language models against harmful inference-time interventions
0
Zitationen
4
Autoren
2026
Jahr
Abstract
Open-weights large language models and low-cost steering methods are strongly democratising the crafting of custom artificial intelligence-based assistants. This benefit comes with the side effect of expanding the potential risks associated with the harmful, toxic, or other undesired uses of neural language models. Language model immunisation is a quite novel research area that seeks to mitigate these risks. Immunised models are pre-trained models whose weights are hard to fine-tune toward harmful or dual tasks. While existing works on immunisation focus on resistance against full-parameter or parameter-efficient fine-tuning, this paper proposes a candidate strategy to neutralise models against low-cost attacks based on Inference-Time interventions (ITI). The proposed approach is called Ethical Treatment ( E.T. ), 1 1 The term ‘Ethical Treatment’ refers to the technical process of immunising models, not to solving normative ethical questions. and consists of training layer-wise low-rank adaptors to locally neutralise attacks at the decoder-block level of Transformer-based models. Pilot experiments on Llama-3-8B-Instruct demonstrate E.T. ’s effectiveness in reducing ITI-attack success rates while preserving utility on general-purpose tasks. Evaluation across the TinyBenchmarks suite shows that E.T. maintains strong performance on commonsense reasoning, and world knowledge, with primary degradation limited to mathematical reasoning. While not solving the broader immunisation challenge, these results position E.T. as a promising step toward structurally robust open-weight models. 2 2 Code, reproducibility scripts, and compute requirement specification are available at https://github.com/DISTA-HCAI/ET .
Ähnliche Arbeiten
Rethinking the Inception Architecture for Computer Vision
2016 · 30.439 Zit.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
2018 · 24.589 Zit.
CBAM: Convolutional Block Attention Module
2018 · 21.484 Zit.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020 · 21.362 Zit.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
2015 · 18.544 Zit.