~ similar to 2606.00561· 20 results
This paper proposes an Explainable Deep Reinforcement Learning (XRL) framework to optimize energy management in complex buildings, demonstrating that on-policy algorithms provide superior cost reducti…
The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.
Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung +1 more
The paper proposes a two-timescale governance framework using a multi-agent LLM to safely update and guide RL agents for SDN-IoT defense, significantly improving performance and stability under advers…
Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu +6 more
This paper proposes two horizon-control strategies, Progressive OPD (POPD) and Truncated OPD (TOPD), demonstrating that full rollouts are often unnecessary for On-Policy Distillation, leading to signi…
Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more
The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…
Yifei He, Rui Yang, Hao Bai, Tong Zhang +1 more
PRO-CUA introduces a process-reward optimization framework that enables efficient, step-level reinforcement learning for training computer use agents by decoupling environment interaction from policy…
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
The paper introduces Trust-Region behavior Blending (TRB), a warmup method that improves on-policy distillation by replacing poor early student rollouts with teacher-aligned behavior policies, leading…
The paper proposes a scalable, distributed approach for constrained Multi-Agent Reinforcement Learning by using local consensus over dual variables to ensure global constraint satisfaction without cen…
The paper proposes S3TS, a novel tree search algorithm that simultaneously handles both non-linear system models and explicit uncertainties (scenarios) for advanced energy planning, achieving near-opt…
Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji +3 more
SafeMCP is a server-side defense plugin that uses look-ahead reasoning to proactively filter and constrain tool acquisition for LLM agents, thereby mitigating catastrophic risks associated with expand…
Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata +2 more
The paper introduces OPD+, a corrected on-policy distillation framework that mathematically proves the bias of standard stop-gradient methods and improves the stability and performance of knowledge tr…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li +1 more
The paper introduces Trust Region On-Policy Distillation (TrOPD), a robust method that stabilizes the on-policy distillation of large language models by restricting training to regions where teacher s…
EnergyMamba proposes an uncertainty-aware, graph-enhanced selective state space model to significantly improve both the accuracy and reliability of energy consumption prediction by explicitly modeling…
The paper introduces C-MADF, a causally constrained multi-agent framework that significantly reduces false positives in autonomous cyber defense by restricting response actions to structurally consist…
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more
OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…
The paper proposes a theoretical framework, called constraint-coupled reasoning, to make AI models less susceptible to knowledge distillation by coupling high-level capabilities to internal stability…
Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more
ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…
The paper introduces Aethelgard, a novel four-layer adaptive governance framework that enforces least privilege by learning the minimum necessary capabilities for autonomous AI agents based on their i…