Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A High-Performance Knowledge Distillation Framework Based on Temperature Decoupling

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Knowledge Distillation (KD) has become a widely used model compression technique in large language models (LLMs). Most mainstream KD methods adopt a temperature-sharing mechanism, where both teacher and student models use a common softmax temperature to smooth predictions. However, this shared-temperature setting often results in distribution misalignment between the two models, limiting distillation effectiveness. To address this issue, we propose temperature-decoupled knowledge distillation(TDKD), a distillation framework that allows independent temperature control for teacher and student models. We theoretically justify this decoupling using the Lagrange principle, and introduce ExpStep-TS, a generalized exponential-stepwise temperature schedule that enhances flexibility in temperature tuning. Furthermore, we propose an Asymmetric Temperature Correction (ATC) mechanism to analyze the impact of temperature Scaling factor on forward and reverse KL divergences, and develop the TDKL loss accordingly. Experiments conducted on the GPT-2 model family across four datasets—Dolly, Self-Instruct, Sinst, and Vicuna—demonstrate that our framework is compatible with multiple loss functions and achieves consistent ROUGE-L improvements of $\mathbf{6 \%} \boldsymbol{-} \mathbf{1 5 \%}$, validating its effectiveness in natural language generation and summarization.

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare

Volltext beim Verlag öffnen

A High-Performance Knowledge Distillation Framework Based on Temperature Decoupling

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen