~ similar to 2606.02218· 20 results
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
Yixiu Mao, Yun Qu, Qi Wang, Heming Zou +1 more
The paper introduces Group Prioritized Off-Policy Optimization (POPO), a novel framework that efficiently accelerates RL finetuning for LLM reasoning by leveraging effective off-policy training batche…
Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu +6 more
This paper proposes two horizon-control strategies, Progressive OPD (POPD) and Truncated OPD (TOPD), demonstrating that full rollouts are often unnecessary for On-Policy Distillation, leading to signi…
Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung +1 more
The paper proposes a two-timescale governance framework using a multi-agent LLM to safely update and guide RL agents for SDN-IoT defense, significantly improving performance and stability under advers…
Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu +3 more
The paper proposes DARTS, a distribution-aware active rollout trajectory shaping method that fundamentally accelerates LLM reinforcement learning by actively shaping the long-tail response distributio…
This paper demonstrates that a complex deep reinforcement learning policy for power grid control can be successfully distilled into a lightweight, auditable decision tree and random forest surrogate t…
The paper proposes a scalable, distributed approach for constrained Multi-Agent Reinforcement Learning by using local consensus over dual variables to ensure global constraint satisfaction without cen…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
Zelin He, Haotian Lin, Boran Han, Wei Zhu +5 more
ReSkill is an RL-in-the-loop framework that reconciles skill creation and policy optimization by automatically creating, testing, and refining modular skills alongside the agent's policy learning, lea…
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
The paper introduces REFT, a novel method that diversifies rollouts by sampling the first token after the reasoning marker, significantly improving performance in Reinforcement Learning with Verifiabl…
Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin +7 more
HMPO introduces a single-stage, cost-effective reinforcement learning framework that achieves significant token compression of Chain-of-Thought reasoning with minimal loss of accuracy, applicable acro…
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…
This paper investigates how on-policy Reinforcement Learning (RL) affects LLM safety, finding that safety training modulates harmful misalignment, but the direction of this effect is highly dependent…
Abhijit Chakraborty, Suddhasvatta Das, Yash Shah, Vivek Gupta +1 more
TIMEGATE introduces a resource-aware policy layer that manages continual ML adaptation by dynamically budgeting time and evaluation resources, achieving significant compute and energy savings without…
Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more
The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…
Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek +2 more
The paper proposes a coherent inverse reinforcement learning (IRL) method to improve large behavior models for robotic control, achieving superior sample efficiency and performance on complex sparse m…