ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.00970· 19 results

cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →
cs.MAcs.AIRecentMay 29, 2026

Safe Equilibrium Policy Optimization for Strategic Agent Policies

Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda

The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.

View →
cs.LGcs.AIRecentMay 28, 2026

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

The paper proposes using distributional Reinforcement Learning (RL) to stabilize learning in chaotic dynamical systems by optimizing the smooth evolution of the return distribution rather than individ…

View →
cs.LGcs.AIRecentMay 29, 2026

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno +3 more

The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…

View →
cs.LGstat.MLRecentJun 1, 2026

Minimax-Optimal Policy Regret in Partially Observable Markov Games

Raman Arora

The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…

View →
stat.MLcs.LGmath.STRecentJun 3, 2026

Bayesian learning for the stochastic shortest path problem

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…

View →
cs.AIRecentMay 27, 2026

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang +1 more

The paper introduces Global PSRO, a novel deep reinforcement learning framework that efficiently approximates Nash equilibria in large two-player zero-sum games by intelligently expanding the strategy…

View →
cs.AIcs.LGRecentMay 27, 2026

Differentiable Belief-based Opponent Shaping

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

The paper proposes D-BOS, a novel differentiable method that shapes opponent behavior by directly manipulating the opponent's inferred belief state, outperforming existing techniques in multi-agent ga…

View →
cs.CYcs.AIRecentMay 28, 2026

AI Loss of Control Incident Management: Response & Resilience

Ross Gruetzemacher

This paper introduces a foundational framework and taxonomy for managing catastrophic AI loss of control (LOC) incidents, providing a proportional guide for response based on the severity and recovera…

View →
cs.AIcs.LGcs.LORecentMay 29, 2026

Robust Shielding for Safe Reinforcement Learning

Edwin Hamel-De le Court, Thom Badings, Alessandro Abate, Francesco Belardinelli +1 more

The paper introduces a novel shielding framework for Robust MDPs (RMDPs) that guarantees safety under worst-case transition probabilities, enabling safe reinforcement learning even when transition dyn…

View →
cs.LGcs.CRRecentApr 14, 2026

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

Leon Eshuijs, Shihan Wang, Antske Fokkens

This paper investigates how on-policy Reinforcement Learning (RL) affects LLM safety, finding that safety training modulates harmful misalignment, but the direction of this effect is highly dependent…

View →
cs.MAcs.AIcs.GTRecentMay 28, 2026

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Francisco León Zúñiga Bolívar

The study extends cooperative bias testing across diverse, next-generation LLMs, finding that provider identity is a stronger predictor of cooperative equilibrium than model generation, and that noise…

View →
cs.CRcs.MARecentMay 26, 2026

Control Physiology: An Agent-Based Model of FAIR-CAM Dynamics

Jack Jones, Laura Voicu

This paper introduces the first agent-based model for the FAIR-CAM framework, demonstrating that complex, dynamic control degradation and resource constraints lead to emergent security vulnerabilities…

View →
stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →
q-fin.GNcs.CRRecentApr 30, 2026

The Satoshi Overhang: Why the Bear Case is Bounded

Karl T. Ulrich

The paper analyzes the potential market impact of a large, unknown Bitcoin holder (the Satoshi overhang) and concludes that the mechanical downside risk is bounded, suggesting the terminal states are…

View →
cs.CERecentMay 29, 2026

When Certainty Is Not Worth It: Capital Lock-Up and Settlement Discounting in Prediction Markets

Jonas Gebele, Florian Matthes

This paper shows that the pricing of outcomes in prediction markets is significantly influenced by the financial friction of delayed settlement, quantifying this effect using an annualized settlement…

View →
cs.LGcs.AIRecentMay 29, 2026

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

William Overman, Mohsen Bayati

The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…

View →
cs.LGcs.AIRecentMay 29, 2026

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

The paper proposes a feasible-reward-set framework to perform Inverse Reinforcement Learning (IRL) when data comes from multiple imperfect demonstrators, providing theoretical guarantees and practical…

View →
cs.LGcs.AIRecentJun 2, 2026

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more

The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…

View →