ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.27811· 19 results

cs.GTcs.LGRecentJun 4, 2026

DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin

The paper proposes DNQ, a scalable solver-in-the-loop framework for training agents in multi-turn simultaneous bidding games by leveraging pairwise payoff estimation to approximate complex equilibrium…

View →
cs.LGmath.OCmath.PREmpiricalRecentJun 9, 2026

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…

View →
cs.LGmath.OCmath.PREmpiricalRecentJun 9, 2026

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…

View →
cs.LGcs.AIRecentMay 29, 2026

The Terminal Representation in Reinforcement Learning

Amir Esterhuysen, Anders Jonsson

The paper introduces the Terminal Representation (TR), a novel, lower-dimensional, and structurally distinct formulation for encoding reward-weighted trajectories in RL that bypasses the need for eige…

View →
stat.MLcs.LGRecentJun 2, 2026

Resource-Constrained Adaptive Inference for Sequential Pricing

Ruicheng Ao, Jiashuo Jiang, David Simchi-Levi

The paper addresses the failure of fixed-price inference in resource-constrained pricing controllers by developing a target-aware controller that tracks local densities and provides certified, shrinki…

View →
cs.MAcs.AIRecentMay 28, 2026

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin +1 more

The paper proposes a novel temporal and structural credit assignment framework to efficiently optimize multi-agent LLM systems by decomposing the error signal and using targeted, discrete gradient upd…

View →
cs.LGcs.AIRecentMay 28, 2026

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

The paper proposes a scalable, distributed approach for constrained Multi-Agent Reinforcement Learning by using local consensus over dual variables to ensure global constraint satisfaction without cen…

View →
cs.IRcs.AIRecentMay 27, 2026

Fine-Tuned LLM as a Complementary Predictor Improving Ads System

Hui Yang, Daiwei He, Kevin Jiang, Taejin Park +19 more

The paper introduces a novel paradigm where a fine-tuned LLM acts as an ancillary predictor to forecast likely advertisers, significantly improving ad recommendation systems by augmenting candidate ge…

View →
cs.AIRecentMay 27, 2026

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang +1 more

The paper introduces Global PSRO, a novel deep reinforcement learning framework that efficiently approximates Nash equilibria in large two-player zero-sum games by intelligently expanding the strategy…

View →
stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →
cs.CLcs.AIRecentJun 1, 2026

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

Liu Qing, Ou Wu, Yi Du

AlphaToken is a novel response token valuation framework that improves LLM post-training by decoupling token selection into task-specific adaptation and stability preservation, leading to better perfo…

View →
cs.LGcs.AIRecentMay 27, 2026

Learning Theory of the SVRG: Generalization and Convergence Analysis

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

This paper provides the first non-vacuous generalization analysis for the Stochastic Variance Reduced Gradient (SVRG) method by establishing sharp, data-dependent algorithmic stability bounds, thereby…

View →
cs.LGcs.AIRecentMay 29, 2026

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu +3 more

The paper proposes DARTS, a distribution-aware active rollout trajectory shaping method that fundamentally accelerates LLM reinforcement learning by actively shaping the long-tail response distributio…

View →
stat.MLcs.LGmath.STRecentJun 3, 2026

Bayesian learning for the stochastic shortest path problem

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…

View →
cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →
cs.AIcs.LGecon.THRecentMay 31, 2026

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

Yujiao Chen

This paper shows that standard optimal control in Markov Decision Processes (MDPs) with an absorbing catastrophic state naturally generates behavioral signatures mimicking prospect theory, even withou…

View →
cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →
cs.AIRecentMay 27, 2026

AlphaTransit: Learning to Design City-scale Transit Routes

Bibek Poudel, Sai Swaminathan, Weizi Li

AlphaTransit introduces a novel search-based planning framework that combines Monte Carlo Tree Search (MCTS) with a neural policy-value network to efficiently design high-quality, city-scale bus trans…

View →
cs.LGcs.AIRecentMay 29, 2026

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

Stephane Hatgis-Kessell, Emma Brunskill

The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…

View →