Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
0
Zitationen
29
Autoren
2026
Jahr
Abstract
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.
Ähnliche Arbeiten
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2017 · 20.676 Zit.
Generative Adversarial Nets
2023 · 19.895 Zit.
Visualizing and Understanding Convolutional Networks
2014 · 15.318 Zit.
"Why Should I Trust You?"
2016 · 14.522 Zit.
On a Method to Measure Supervised Multiclass Model’s Interpretability: Application to Degradation Diagnosis (Short Paper)
2024 · 13.191 Zit.
Autoren
- Hengyuan Zhang
- Zhihao Zhang
- Mingyang Wang
- Zunhai Su
- Yiwei Wang
- Qianli Wang
- Shuzhou Yuan
- Ercong Nie
- Xufeng Duan
- Qibo Xue
- Zeping Yu
- Chenming Shang
- Xiao Liang
- Jing Xiong
- Hui Shen
- Chaofan Tao
- Zhengwu Liu
- Senjie Jin
- Senjie Jin
- Dongdong Zhang
- Sophia Ananiadou
- Tao Gui
- Ruobing Xie
- Hayden Kwok-Hay So
- Hinrich Schütze
- Xuanjing Huang
- Qi Zhang
- Ngai Wong
- Ngai Wong