~ similar to 2605.28293· 19 results
Yifei He, Rui Yang, Hao Bai, Tong Zhang +1 more
PRO-CUA introduces a process-reward optimization framework that enables efficient, step-level reinforcement learning for training computer use agents by decoupling environment interaction from policy…
Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more
ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…
Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more
Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
Weizhi Zhang, Wooseong Yang, Yuxin Cui, Zhaohui Guo +8 more
The paper advocates for integrating explicit contextual feedback (like reviews and comments) into LLM-based recommender systems to achieve more personalized, transparent, and semantically aligned reco…
Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen +4 more
The paper introduces Graph-Distance Contribution Reward (GDCR) and Step Advantage Policy Optimization (SAPO) to provide fine-grained, step-level credit assignment for agentic search by modeling world…
Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more
The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…
Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more
The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…
Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang +3 more
The paper proposes SAAS, a novel RL framework that equips LLM agents with self-awareness to precisely regulate search behavior, significantly mitigating costly over-search without sacrificing accuracy…
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
The paper introduces a quotient-DAG view to accurately estimate unordered slate propensities for off-policy evaluation, solving the nuisance variance and computational gap inherent in standard importa…
This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…
Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng +1 more
SPADER is a novel reinforcement learning framework that addresses the challenges of Multi-Answer Question Answering by improving credit assignment and promoting diverse exploration during long-horizon…
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more
The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
Bin Chen, Xinye Liao, Yiming Liu, Xin Liao +1 more
The paper proposes Credit-Attenuated Privileged Feedback (CAPF), a training-time mechanism that uses verifier-side information to guide LLM search agents, significantly improving their performance on…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai +6 more
The paper introduces OpenWebRL, an open framework that enables training visual web agents using online multi-turn Reinforcement Learning directly on live websites, achieving state-of-the-art performan…
Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more
The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…