Papers similar to 2605.28273

~ similar to 2605.28273· 20 results

cs.MAcs.AIRecentMay 29, 2026

Safe Equilibrium Policy Optimization for Strategic Agent Policies

Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda

The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.

View →

cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →

cs.AIRecentMay 28, 2026

Structure-Induced Information for Rerooting Levin Tree Search

Jake Tuero, Michael Buro, Laurent Orseau, Levi H. S. Lelis

The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.

View →

cs.GTcs.LGRecentJun 4, 2026

DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin

The paper proposes DNQ, a scalable solver-in-the-loop framework for training agents in multi-turn simultaneous bidding games by leveraging pairwise payoff estimation to approximate complex equilibrium…

View →

cs.LGcs.CRRecentMar 20, 2026

NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing

Raphael Simon, José Carrasquel, Wim Mees, Pieter Libin

The paper introduces NASimJax, a GPU-accelerated framework that significantly speeds up network simulation for reinforcement learning, enabling large-scale, realistic training for penetration testing.

View →

cs.LGcs.AIRecentMay 28, 2026

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed +1 more

The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…

View →

cs.AIcs.MARecentMay 28, 2026

On the Geometry of Games and their Solvers

Yaqi Sun, Julian Ma, David Mguni

The paper proposes a unified framework that maps the geometry of games to effective solver dynamics, suggesting that solvability is governed by continuous structural properties rather than discrete cl…

View →

cs.AIcs.LGRecentJun 1, 2026

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more

S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…

View →

cs.LGstat.MLRecentJun 1, 2026

Minimax-Optimal Policy Regret in Partially Observable Markov Games

Raman Arora

The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…

View →

cs.AIRecentMay 28, 2026

PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?

Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao +2 more

The paper introduces PTCG-Bench, a new benchmark using the Pokémon TCG to evaluate LLM agents' strategic decision-making and ability to self-evolve, finding that sustained self-evolution remains chall…

View →

cs.AIcs.LGRecentMay 27, 2026

Differentiable Belief-based Opponent Shaping

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

The paper proposes D-BOS, a novel differentiable method that shapes opponent behavior by directly manipulating the opponent's inferred belief state, outperforming existing techniques in multi-agent ga…

View →

cs.MAcs.AIcs.LGRecentMay 28, 2026

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

Víctor Gallego

The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…

View →

cs.LGcs.AIRecentMay 30, 2026

Interpretable Policy Distillation for Power Grid Topology Control

Aleksandra Dmitruka, Karlis Freivalds

This paper demonstrates that a complex deep reinforcement learning policy for power grid control can be successfully distilled into a lightweight, auditable decision tree and random forest surrogate t…

View →

cs.LGcs.AIRecentMay 29, 2026

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

Stephane Hatgis-Kessell, Emma Brunskill

The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…

View →

cs.CRRecentMay 10, 2026

Operationalizing Cybersecurity Governance for Mitigation Planning with Attack-Path Modeling and Reinforcement Learning

Philip Huff, Dakota Dale, Harshith Guduru, Rohan Singh +1 more

The paper proposes a system that operationalizes cybersecurity governance frameworks by integrating them with attack-path modeling and Deep Reinforcement Learning to generate practical, resource-const…

View →

cs.AIRecentMay 30, 2026

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

Vignesh Subramanian, Subhajit Roy, Suguman Bansal

The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…

View →

cs.CLRecentMay 30, 2026

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao +1 more

The paper proposes MERIT, a dual-level, multi-horizon memory retrieval framework that significantly improves the performance of interactive text-to-SQL agents by providing both global and local memory…

View →

cs.AIRecentMay 27, 2026

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen +1 more

This paper introduces MCTS-Guided Group Relative Policy Optimization (M-GRPO) to enhance LLM spatial reasoning by improving the decomposition of complex tasks into optimal sub-tasks.

View →

cs.CRcs.LGRecentMay 22, 2026

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem +1 more

The paper introduces a reinforcement learning framework, inspired by AlphaZero, to automate and improve the proof search process within the Tamarin protocol analysis tool, resulting in shorter and mor…

View →

cs.GTcs.CRcs.LGRecentMay 8, 2026

Differentially Private Auditing Under Strategic Response

Florian A. D. Burnat

This paper analyzes differential privacy auditing as a bilevel game, showing that naive audit designs fail to detect true harm when developers strategically respond, and proposes an optimal, single-le…

View →