ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

20 results for “Bandit algorithms”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.LGmath.OCmath.PREmpiricalRecentJun 9, 2026

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…

View →
cs.LGcs.AIstat.MLRecentMay 28, 2026

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren +3 more

The paper analyzes the sample complexity of contextual bandits in the $s$-sparse setting, achieving optimal sample bounds for identifying an $\epsilon$-optimal policy.

View →
math.OCcs.AIcs.LGRecentJun 1, 2026

MINTS: Minimalist Thompson Sampling

Kaizheng Wang

The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…

View →
cs.CRcs.LGRecentMay 6, 2026

Differential Privacy in the Extensive-Form Bandit Problem

Stephen Pasteris, Rahul Savani, Theodore Turocy

The paper proposes an algorithm for the extensive-form bandit problem that achieves $ ilde{O}( rac{ ext{total actions} imes ext{strategies} imes ext{trials}}{ ext{epsilon}})$ regret while satisfyi…

View →
cs.LGcs.AIRecentMay 29, 2026

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

William Overman, Mohsen Bayati

The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…

View →
cs.LGstat.MLRecentJun 1, 2026

Minimax-Optimal Policy Regret in Partially Observable Markov Games

Raman Arora

The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…

View →
cs.CLcs.AIcs.LGRecentMay 28, 2026

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more

The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…

View →
cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →
cs.LGstat.MLRecentJun 1, 2026

Local Preferential Bayesian Optimization

Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more

The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…

View →
cs.LGcs.AIRecentJun 1, 2026

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

Peter Chen, Xi Chen

The paper proposes 2FFS, a two-fidelity tree-search algorithm that efficiently identifies the best action in stochastic minimax trees by adaptively combining cheap, biased heuristic evaluations with e…

View →
stat.MLcs.AIcs.LGRecentMay 28, 2026

Improved Distribution Estimation in $\ell_\infty$

Doron Cohen, Aryeh Kontorovich, Yonatan Livshitz

This paper improves the theoretical bounds for estimating discrete probability distributions using the $\ell_\infty$ norm, resolving several open questions in the field.

View →
cs.LGcs.AIcs.CLRecentMay 29, 2026

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen +2 more

OrcaRouter is a production-ready LLM router that uses a hybrid offline-online learning approach to efficiently select the best large language model for an incoming query, achieving high accuracy at lo…

View →
stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →
cs.LGcs.AIRecentMay 29, 2026

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno +3 more

The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…

View →
cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →
cs.AIcs.LGRecentMay 28, 2026

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

Tim Woydt, Paul-David Zuercher

The paper introduces Nested Contextual Causal Bandits (NCCBs) to model multi-timescale sequential decisions and proposes a certified policy optimization method, NCTS, that provides quantifiable risk b…

View →
cs.LGcs.AIcs.GTRecentJun 4, 2026

Regret Minimization with Adaptive Opponents in Repeated Games

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, Kaiqing Zhang

This paper introduces Repeated Policy Regret (RP-Regret), a novel game-theoretic metric for analyzing regret in repeated games with adaptive opponents, and proposes algorithms to minimize it.

View →
stat.MLcs.LGmath.STRecentJun 3, 2026

Bayesian learning for the stochastic shortest path problem

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…

View →
cs.HCcs.AIRecentMay 27, 2026

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

Shang Wu, Saatvik Kher, Padhraic Smyth

This paper develops a policy-learning framework to optimally assign prediction tasks to multiple agents, considering individual agent expertise and capacity constraints, achieving systematic performan…

View →
cs.LGstat.MLRecentJun 2, 2026

Online Learning with Gradient-Variation Interval Regret

Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou

The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.

View →