~ similar to 2606.00151· 20 results
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.
The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
The paper introduces REFT, a novel method that diversifies rollouts by sampling the first token after the reasoning marker, significantly improving performance in Reinforcement Learning with Verifiabl…
Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more
The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
The paper proposes a feasible-reward-set framework to perform Inverse Reinforcement Learning (IRL) when data comes from multiple imperfect demonstrators, providing theoretical guarantees and practical…
Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more
ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…
The paper introduces an uncertainty-aware framework that uses regulated expert advice to guide safe and efficient exploration for autonomous driving policies, significantly improving performance in co…
The paper introduces a novel shielding framework for Robust MDPs (RMDPs) that guarantees safety under worst-case transition probabilities, enabling safe reinforcement learning even when transition dyn…
Zelin He, Haotian Lin, Boran Han, Wei Zhu +5 more
ReSkill is an RL-in-the-loop framework that reconciles skill creation and policy optimization by automatically creating, testing, and refining modular skills alongside the agent's policy learning, lea…
Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang +1 more
The paper introduces Global PSRO, a novel deep reinforcement learning framework that efficiently approximates Nash equilibria in large two-player zero-sum games by intelligently expanding the strategy…
The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…
This paper proposes a new imitation learning algorithm called DistIL that uses distributional feedback to improve policy improvement and regret guarantees.
Chenhao Bai, Liqin Lu, Kaijun Wang, Hui Chen +4 more
This paper studies how to scale robust robot policies by expanding physical domains in a recoverable way.