Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

2025·0 Zitationen·IEEE AccessOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Activation steering provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific (<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n = 190). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">r = 0.776, range 0.157–0.985), indicating automatic scoring can proxy perceived quality. Moderate steering strengths (λ ≈ 0.15) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust (η2p = 0.616) and fear (η2 p = 0.540), and minimal effects for surprise (η2 p = 0.042). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">p < 0.001). Inter-rater reliability was high (ICC = 0.71–0.87), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationText Readability and Simplification

Volltext beim Verlag öffnen

The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen