Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Ethical treatment of language models against harmful inference-time interventions

2026·0 Zitationen·Engineering Applications of Artificial IntelligenceOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Open-weights large language models and low-cost steering methods are strongly democratising the crafting of custom artificial intelligence-based assistants. This benefit comes with the side effect of expanding the potential risks associated with the harmful, toxic, or other undesired uses of neural language models. Language model immunisation is a quite novel research area that seeks to mitigate these risks. Immunised models are pre-trained models whose weights are hard to fine-tune toward harmful or dual tasks. While existing works on immunisation focus on resistance against full-parameter or parameter-efficient fine-tuning, this paper proposes a candidate strategy to neutralise models against low-cost attacks based on Inference-Time interventions (ITI). The proposed approach is called Ethical Treatment ( E.T. ), 1 1 The term ‘Ethical Treatment’ refers to the technical process of immunising models, not to solving normative ethical questions. and consists of training layer-wise low-rank adaptors to locally neutralise attacks at the decoder-block level of Transformer-based models. Pilot experiments on Llama-3-8B-Instruct demonstrate E.T. ’s effectiveness in reducing ITI-attack success rates while preserving utility on general-purpose tasks. Evaluation across the TinyBenchmarks suite shows that E.T. maintains strong performance on commonsense reasoning, and world knowledge, with primary degradation limited to mathematical reasoning. While not solving the broader immunisation challenge, these results position E.T. as a promising step toward structurally robust open-weight models. 2 2 Code, reproducibility scripts, and compute requirement specification are available at https://github.com/DISTA-HCAI/ET .

Autoren

Institutionen

University of Insubria(IT)

Themen

Adversarial Robustness in Machine LearningArtificial Intelligence in Healthcare and EducationEthics and Social Impacts of AI

Volltext beim Verlag öffnen

Ethical treatment of language models against harmful inference-time interventions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen