~ similar to 2605.30903· 20 results
Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek +2 more
The paper proposes a coherent inverse reinforcement learning (IRL) method to improve large behavior models for robotic control, achieving superior sample efficiency and performance on complex sparse m…
Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more
The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…
This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…
The paper proposes a novel, explicitly exploratory iterative Nash Learning from Human Feedback (NLHF) algorithm that achieves strong regret bounds for optimizing LLMs based on complex, non-scalar huma…
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
The paper introduces the Terminal Representation (TR), a novel, lower-dimensional, and structurally distinct formulation for encoding reward-weighted trajectories in RL that bypasses the need for eige…
The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…
The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more
The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
This paper proposes a new imitation learning algorithm called DistIL that uses distributional feedback to improve policy improvement and regret guarantees.
Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang +3 more
This paper introduces Numca and Hista, two novel techniques that significantly improve state value estimation for LLM reinforcement learning, addressing the instability of standard critic approaches.
Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu +1 more
The paper introduces a diagnostic-driven iterative refinement process for improving LLM-generated reward functions in sparse, structured reinforcement learning tasks, significantly boosting agent perf…
The paper introduces Cross-Model Entropy (CME), a novel label-free reward signal that uses an independent verifier model to assess the quality of a generator's output, significantly improving LLM perf…
The paper proposes a scalable, distributed approach for constrained Multi-Agent Reinforcement Learning by using local consensus over dual variables to ensure global constraint satisfaction without cen…
The paper proposes a novel framework combining behavior-invariant task representation learning and a Transformer-based world model to achieve robust generalization in offline meta-reinforcement learni…
Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more
Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.