ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.28791· 20 results

cs.AIRecentMay 29, 2026

Distilling LLM Feedback for Lean Theorem Proving

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal +1 more

The paper introduces Feedback Distillation, a novel training method that uses a language model's privileged feedback to provide token-level supervision, significantly improving complex reasoning tasks…

View →
cs.LGcs.AIcs.CVRecentMay 27, 2026

OISD: On-Policy Internal Self-Distillation of Language Models

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more

The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…

View →
cs.AIRecentMay 29, 2026

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan +2 more

The paper proposes CAST, an answer-free self-distillation method that enhances Group Relative Policy Optimization (GRPO) for verifiable rewards, allowing token-level advantage signals even when all sa…

View →
cs.CLRecentMay 30, 2026

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang Zhiming Zheng

The paper proposes Distribution-Aligned Self-Distillation (DASD) to improve self-distillation by dynamically filtering high-perplexity tokens, thereby preserving useful logical knowledge while suppres…

View →
cs.CLcs.AIRecentMay 29, 2026

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji +6 more

The paper introduces Lookahead Group Reward (&) to combat Supervision Fidelity Decay (SFD) in on-policy distillation, significantly improving student model performance on long reasoning tasks.

View →
cs.CLcs.AIRecentMay 28, 2026

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more

The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…

View →
cs.CVcs.CLRecentMay 30, 2026

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more

The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…

View →
cs.CLcs.LGRecentMay 30, 2026

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more

The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…

View →
cs.CLcs.AIRecentMay 29, 2026

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

Yuxuan Jiang, Francis Ferraro

The paper introduces Trajectory-aware OPD (TOPD), a method that uses near-future trajectory information to improve On-Policy Distillation by accurately identifying and guiding true reasoning divergenc…

View →
cs.LGcs.AIcs.CLRecentJun 3, 2026

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

This paper proposes a new imitation learning algorithm called DistIL that uses distributional feedback to improve policy improvement and regret guarantees.

View →
cs.LGcs.CLRecentMay 31, 2026

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more

OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…

View →
cs.LGcs.AIRecentMay 29, 2026

Trust-Region Behavior Blending for On-Policy Distillation

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky +3 more

The paper introduces Trust-Region behavior Blending (TRB), a warmup method that improves on-policy distillation by replacing poor early student rollouts with teacher-aligned behavior policies, leading…

View →
cs.AIRecentMay 27, 2026

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant +2 more

The paper introduces Contrastive Reflection (CORE), a novel non-parametric method that rapidly improves language model reasoning by distilling contrasts between successful and unsuccessful problem att…

View →
cs.LGcs.AIRecentMay 27, 2026

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

Kun Liang, Chenming Tang, Clive Bai, Weijie Liu +2 more

ADWIN introduces an adaptive window framework for on-policy distillation (OPD) that efficiently manages the supervision horizon by training on short, teacher-anchored prefixes while using delayed full…

View →
cs.LGcs.AIRecentMay 28, 2026

A Predictive Law for On-Policy Self-Distillation From World Feedback

Tommy He, Jerome Sieber, Matteo Saponati

The paper identifies a linear predictive law linking the initial performance gap in on-policy self-distillation (OPSD) to the final performance improvement, allowing researchers to anticipate and tune…

View →
cs.AIcs.CLcs.LGRecentMay 28, 2026

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Yang Ouyang, Shuhang Lin, Jung-Eun Kim

DenseSteer is a training-free inference-time framework that improves the math reasoning capabilities of small language models by steering their internal representations toward a 'Dense Reasoning' patt…

View →
cs.AIRecentMay 30, 2026

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang +7 more

The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework that implicitly improves the reasoning ability of LLMs by guiding the model's internal latent states based on a…

View →
cs.AIcs.CLRecentMay 28, 2026

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky +3 more

GRASP introduces a gated, regression-aware framework for improving LLM agents by ensuring that every proposed skill edit improves performance on a balanced probe without degrading previously learned c…

View →
cs.CLRecentMay 31, 2026

On the Generalization Gap in Self-Evolving Language Model Reasoning

Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan +4 more

The paper investigates the limits of self-evolution in LLM reasoning under closed-loop settings, finding that while self-improvement is significant, it consistently falls short of perfect oracle super…

View →