Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Towards Uncovering How Large Language ModelsWork: An Interpretability Perspective
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Large language models (LLMs) have shown re2 markable performance in tackling natural language 3 tasks, yet the internal mechanisms that enable their 4 impressive generalization and reasoning abilities 5 remain opaque. This lack of transparency presents 6 significant challenges in fundamentally eliminating 7 undesirable behaviors such as hallucinations and 8 toxicity, hindering the safe and beneficial deploy9 ment of LLMs. This survey paper aims to un10 cover the internal working mechanisms underly11 ing LLM functionality through the lens of explain12 ability. First, we review how knowledge is en13 coded within LLMs via mechanistic interpretabil14 ity techniques. Then, we summarize what knowl15 edge is embedded in LLMrepresentations by lever16 aging probing techniques and representation engi17 neering. Additionally, we investigate the training 18 dynamics to explore models' generalization abili19 ties through grokking and memorization. Finally, 20 we explore how the insights gained from these ex21 planations can further enhance LLM performance 22 through model editing, improve efficiency through 23 pruning, and better align with human values.
Ähnliche Arbeiten
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2017 · 20.639 Zit.
Generative Adversarial Nets
2023 · 19.894 Zit.
Visualizing and Understanding Convolutional Networks
2014 · 15.312 Zit.
"Why Should I Trust You?"
2016 · 14.486 Zit.
On a Method to Measure Supervised Multiclass Model’s Interpretability: Application to Degradation Diagnosis (Short Paper)
2024 · 13.181 Zit.