~ similar to 2605.28273· 20 results
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.
The paper proposes DNQ, a scalable solver-in-the-loop framework for training agents in multi-turn simultaneous bidding games by leveraging pairwise payoff estimation to approximate complex equilibrium…
The paper introduces NASimJax, a GPU-accelerated framework that significantly speeds up network simulation for reinforcement learning, enabling large-scale, realistic training for penetration testing.
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
The paper proposes a unified framework that maps the geometry of games to effective solver dynamics, suggesting that solvability is governed by continuous structural properties rather than discrete cl…
Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more
S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao +2 more
The paper introduces PTCG-Bench, a new benchmark using the Pokémon TCG to evaluate LLM agents' strategic decision-making and ability to self-evolve, finding that sustained self-evolution remains chall…
The paper proposes D-BOS, a novel differentiable method that shapes opponent behavior by directly manipulating the opponent's inferred belief state, outperforming existing techniques in multi-agent ga…
The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…
This paper demonstrates that a complex deep reinforcement learning policy for power grid control can be successfully distilled into a lightweight, auditable decision tree and random forest surrogate t…
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
Philip Huff, Dakota Dale, Harshith Guduru, Rohan Singh +1 more
The paper proposes a system that operationalizes cybersecurity governance frameworks by integrating them with attack-path modeling and Deep Reinforcement Learning to generate practical, resource-const…
The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…
Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao +1 more
The paper proposes MERIT, a dual-level, multi-horizon memory retrieval framework that significantly improves the performance of interactive text-to-SQL agents by providing both global and local memory…
Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen +1 more
This paper introduces MCTS-Guided Group Relative Policy Optimization (M-GRPO) to enhance LLM spatial reasoning by improving the decomposition of complex tasks into optimal sub-tasks.
Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem +1 more
The paper introduces a reinforcement learning framework, inspired by AlphaZero, to automate and improve the proof search process within the Tamarin protocol analysis tool, resulting in shorter and mor…
This paper analyzes differential privacy auditing as a bilevel game, showing that naive audit designs fail to detect true harm when developers strategically respond, and proposes an optimal, single-le…