~ similar to 2606.01249· 20 results
The paper introduces Trust-Region behavior Blending (TRB), a warmup method that improves on-policy distillation by replacing poor early student rollouts with teacher-aligned behavior policies, leading…
Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata +2 more
The paper introduces OPD+, a corrected on-policy distillation framework that mathematically proves the bias of standard stop-gradient methods and improves the stability and performance of knowledge tr…
The paper introduces Trajectory-aware OPD (TOPD), a method that uses near-future trajectory information to improve On-Policy Distillation by accurately identifying and guiding true reasoning divergenc…
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more
OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…
Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu +6 more
This paper proposes two horizon-control strategies, Progressive OPD (POPD) and Truncated OPD (TOPD), demonstrating that full rollouts are often unnecessary for On-Policy Distillation, leading to signi…
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
The paper identifies a linear predictive law linking the initial performance gap in on-policy self-distillation (OPSD) to the final performance improvement, allowing researchers to anticipate and tune…
Kun Liang, Chenming Tang, Clive Bai, Weijie Liu +2 more
ADWIN introduces an adaptive window framework for on-policy distillation (OPD) that efficiently manages the supervision horizon by training on short, teacher-anchored prefixes while using delayed full…
Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more
The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.
Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji +6 more
The paper introduces Lookahead Group Reward (&) to combat Supervision Fidelity Decay (SFD) in on-policy distillation, significantly improving student model performance on long reasoning tasks.
Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more
The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…
Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal +1 more
The paper introduces Feedback Distillation, a novel training method that uses a language model's privileged feedback to provide token-level supervision, significantly improving complex reasoning tasks…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan +2 more
The paper proposes CAST, an answer-free self-distillation method that enhances Group Relative Policy Optimization (GRPO) for verifiable rewards, allowing token-level advantage signals even when all sa…
Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao +3 more
The paper proposes Guided Denoiser Self-Distillation (GDSD), a novel method that bypasses the use of likelihood surrogates (like ELBO) in RL for diffusion language models, achieving state-of-the-art p…
The paper demonstrates that using on-policy distillation from a strong teacher model significantly improves the performance of compact Automatic Speech Recognition (ASR) models, achieving competitive…
Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai +2 more
The paper proposes Skill-Conditioned Gated Self-Distillation (SGSD), a novel framework that uses retrieved, potentially noisy skills to guide LLM reasoning, achieving state-of-the-art performance on m…
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…
DASH introduces a dual-branch distillation framework to effectively compress class-conditional diffusion models by independently supervising both score branches, significantly preserving guidance fide…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…