~ similar to 2605.30833· 20 results
The paper introduces Trajectory-aware OPD (TOPD), a method that uses near-future trajectory information to improve On-Policy Distillation by accurately identifying and guiding true reasoning divergenc…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
The paper introduces Trust-Region behavior Blending (TRB), a warmup method that improves on-policy distillation by replacing poor early student rollouts with teacher-aligned behavior policies, leading…
Kun Liang, Chenming Tang, Clive Bai, Weijie Liu +2 more
ADWIN introduces an adaptive window framework for on-policy distillation (OPD) that efficiently manages the supervision horizon by training on short, teacher-anchored prefixes while using delayed full…
Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai +2 more
The paper proposes Skill-Conditioned Gated Self-Distillation (SGSD), a novel framework that uses retrieved, potentially noisy skills to guide LLM reasoning, achieving state-of-the-art performance on m…
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more
OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…
The paper identifies a linear predictive law linking the initial performance gap in on-policy self-distillation (OPSD) to the final performance improvement, allowing researchers to anticipate and tune…
Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal +1 more
The paper introduces Feedback Distillation, a novel training method that uses a language model's privileged feedback to provide token-level supervision, significantly improving complex reasoning tasks…
Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li +1 more
The paper introduces Trust Region On-Policy Distillation (TrOPD), a robust method that stabilizes the on-policy distillation of large language models by restricting training to regions where teacher s…
Max Hartman, Vidhata Jayaraman, Moulik Choraria, Yash Savani +1 more
The paper introduces TraceGuard, a detectability-aware antidistillation method that identifies and poisons 'thought anchors'—sparsely critical sentences—to degrade student model learning without makin…
Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more
The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
The paper proposes Distribution-Aligned Self-Distillation (DASD) to improve self-distillation by dynamically filtering high-perplexity tokens, thereby preserving useful logical knowledge while suppres…
This paper investigates the phenomenon of 'copying' in Distribution Matching Distillation (DMD), finding that high-dimensional distillation causes student models to spontaneously reproduce the teacher…
Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more
The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…
ThinkSwitch introduces a low-compute co-training procedure that distills the reasoning benefit of large language models into weights, significantly improving performance on specific reasoning tasks.
Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan +2 more
The paper proposes CAST, an answer-free self-distillation method that enhances Group Relative Policy Optimization (GRPO) for verifiable rewards, allowing token-level advantage signals even when all sa…
Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu +6 more
This paper proposes two horizon-control strategies, Progressive OPD (POPD) and Truncated OPD (TOPD), demonstrating that full rollouts are often unnecessary for On-Policy Distillation, leading to signi…
The paper introduces and evaluates bounded behavioral indistinguishability, showing that while LoRA distillation improves semantic similarity, it does not guarantee that the student model is behaviora…