~ similar to 2605.31455· 20 results
DriftQL introduces a novel, efficient offline RL method that combines a drift-based behavioral regularizer with critic-driven policy improvement, achieving state-of-the-art performance while maintaini…
The paper introduces Drifting Preference Optimization (DrPO), an efficient online method for preference finetuning one-step text-to-image generators that avoids complex gradient calculations and model…
This paper introduces Anchored Weight Decay (AWD), a regularization technique that effectively prevents prior-task forgetting during LLM fine-tuning with Evolution Strategies (ES), positioning ES as a…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more
EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…
The paper introduces Q-ALIGN DT, a novel framework that improves conditioned sequence models by enforcing alignment between the input return-to-go (RTG) signal and the output policy's expected Q-value…
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more
The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…
SCOPE introduces a data-free self-play framework that co-evolves a task-generating Challenger and a document-answering Solver, significantly improving open-ended performance on language models without…
The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…
Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more
The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…
Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang +6 more
The paper introduces AdvCL, a framework that repurposes adversarial perturbations as a geometric control signal to stabilize continual learning in large language models, significantly reducing forgett…
Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao +3 more
The paper proposes Guided Denoiser Self-Distillation (GDSD), a novel method that bypasses the use of likelihood surrogates (like ELBO) in RL for diffusion language models, achieving state-of-the-art p…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper introduces NaRA, a noise-aware LoRA technique that dynamically adapts fine-tuning parameters based on the noise level during diffusion, significantly improving the performance of Diffusion L…
Yixiu Mao, Yun Qu, Qi Wang, Heming Zou +1 more
The paper introduces Group Prioritized Off-Policy Optimization (POPO), a novel framework that efficiently accelerates RL finetuning for LLM reasoning by leveraging effective off-policy training batche…
The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…
The paper proposes a novel, explicitly exploratory iterative Nash Learning from Human Feedback (NLHF) algorithm that achieves strong regret bounds for optimizing LLMs based on complex, non-scalar huma…
Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao +10 more
MLEvolve is a novel self-evolving multi-agent framework that enables LLM agents to discover and optimize machine learning algorithms for complex, long-horizon tasks.
Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan +8 more
FLARE is a systematic conversion framework that enables a single checkpoint to support both autoregressive (AR) and diffusion-style parallel decoding for hybrid-attention large language models, achiev…
The paper introduces CoRP, a gradient-free operator that consolidates the benefits of ensemble-based post-training methods into a single, deployable model update, significantly improving performance w…