20 results for “policy learning”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper develops a policy-learning framework to optimally assign prediction tasks to multiple agents, considering individual agent expertise and capacity constraints, achieving systematic performan…
Zelin He, Haotian Lin, Boran Han, Wei Zhu +5 more
ReSkill is an RL-in-the-loop framework that reconciles skill creation and policy optimization by automatically creating, testing, and refining modular skills alongside the agent's policy learning, lea…
Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek +2 more
The paper proposes a coherent inverse reinforcement learning (IRL) method to improve large behavior models for robotic control, achieving superior sample efficiency and performance on complex sparse m…
The paper proposes a novel framework combining behavior-invariant task representation learning and a Transformer-based world model to achieve robust generalization in offline meta-reinforcement learni…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
SPAR introduces a novel framework that rectifies action policies by performing local fine-tuning in a residual space anchored to a pure behavior cloning policy, achieving state-of-the-art performance…
The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…
Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris +2 more
The Parameterized Diffusion Policy (PDP) framework transforms diffusion models from general stochastic generators into precise, steerable tools for learning and adapting complex robotic behaviors by e…
The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…
Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu +8 more
The paper proposes PaW, a co-training framework that uses standard RL rollouts to provide auxiliary world model supervision directly during policy training, significantly improving language agent perf…
The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.
DriftQL introduces a novel, efficient offline RL method that combines a drift-based behavioral regularizer with critic-driven policy improvement, achieving state-of-the-art performance while maintaini…
Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang +1 more
The paper introduces Global PSRO, a novel deep reinforcement learning framework that efficiently approximates Nash equilibria in large two-player zero-sum games by intelligently expanding the strategy…
Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more
The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
The paper identifies a linear predictive law linking the initial performance gap in on-policy self-distillation (OPSD) to the final performance improvement, allowing researchers to anticipate and tune…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
The paper introduces Latent Policy Guardrail (LPG), a novel framework that efficiently enforces dynamic safety policies for LLMs by compressing complex policy deliberation into a small set of latent t…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…