Papers similar to 2606.00305

~ similar to 2606.00305· 20 results

cs.LGcs.CLRecentMay 31, 2026

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more

OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…

View →

cs.LGcs.AIRecentMay 31, 2026

OPD+: Rethinking the Advantage Design for On-Policy Distillation

Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata +2 more

The paper introduces OPD+, a corrected on-policy distillation framework that mathematically proves the bias of standard stop-gradient methods and improves the stability and performance of knowledge tr…

View →

cs.CVcs.CLRecentMay 30, 2026

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more

The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…

View →

cs.CLcs.AIRecentMay 28, 2026

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more

The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…

View →

cs.LGcs.AIRecentMay 27, 2026

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

Kun Liang, Chenming Tang, Clive Bai, Weijie Liu +2 more

ADWIN introduces an adaptive window framework for on-policy distillation (OPD) that efficiently manages the supervision horizon by training on short, teacher-anchored prefixes while using delayed full…

View →

cs.AIRecentMay 29, 2026

Distilling LLM Feedback for Lean Theorem Proving

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal +1 more

The paper introduces Feedback Distillation, a novel training method that uses a language model's privileged feedback to provide token-level supervision, significantly improving complex reasoning tasks…

View →

cs.AIRecentMay 29, 2026

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan +2 more

The paper proposes CAST, an answer-free self-distillation method that enhances Group Relative Policy Optimization (GRPO) for verifiable rewards, allowing token-level advantage signals even when all sa…

View →

cs.LGcs.CLRecentMay 31, 2026

Trust Region On-Policy Distillation

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li +1 more

The paper introduces Trust Region On-Policy Distillation (TrOPD), a robust method that stabilizes the on-policy distillation of large language models by restricting training to regions where teacher s…

View →

cs.LGcs.AIcs.CVRecentMay 27, 2026

OISD: On-Policy Internal Self-Distillation of Language Models

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more

The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…

View →

cs.CLcs.AIRecentMay 29, 2026

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji +6 more

The paper introduces Lookahead Group Reward (&) to combat Supervision Fidelity Decay (SFD) in on-policy distillation, significantly improving student model performance on long reasoning tasks.

View →

cs.CLcs.AIRecentMay 27, 2026

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai +2 more

The paper proposes Skill-Conditioned Gated Self-Distillation (SGSD), a novel framework that uses retrieved, potentially noisy skills to guide LLM reasoning, achieving state-of-the-art performance on m…

View →

cs.LGcs.AIRecentMay 28, 2026

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes +4 more

LARK introduces a novel learnability-grounded approach for selecting reasoning trajectories, significantly improving the efficiency of reasoning distillation by prioritizing trajectories that the stud…

View →

cs.LGcs.AIRecentMay 28, 2026

A Predictive Law for On-Policy Self-Distillation From World Feedback

Tommy He, Jerome Sieber, Matteo Saponati

The paper identifies a linear predictive law linking the initial performance gap in on-policy self-distillation (OPSD) to the final performance improvement, allowing researchers to anticipate and tune…

View →

cs.CLRecentMay 30, 2026

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang Zhiming Zheng

The paper proposes Distribution-Aligned Self-Distillation (DASD) to improve self-distillation by dynamically filtering high-perplexity tokens, thereby preserving useful logical knowledge while suppres…

View →

cs.LGcs.AIRecentMay 31, 2026

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

Dhruv Saini, Rohan Pandey

ThinkSwitch introduces a low-compute co-training procedure that distills the reasoning benefit of large language models into weights, significantly improving performance on specific reasoning tasks.

View →

cs.CLRecentMay 29, 2026

Are Full Rollouts Necessary for On-Policy Distillation?

Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu +6 more

This paper proposes two horizon-control strategies, Progressive OPD (POPD) and Truncated OPD (TOPD), demonstrating that full rollouts are often unnecessary for On-Policy Distillation, leading to signi…

View →

cs.LGcs.AIRecentMay 29, 2026

Trust-Region Behavior Blending for On-Policy Distillation

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky +3 more

The paper introduces Trust-Region behavior Blending (TRB), a warmup method that improves on-policy distillation by replacing poor early student rollouts with teacher-aligned behavior policies, leading…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…

View →

cs.CLcs.LGRecentMay 30, 2026

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more

The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…

View →