20 results for “Bandit algorithms”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…
Liad Erez, Fan Chen, Alon Cohen, Tomer Koren +3 more
The paper analyzes the sample complexity of contextual bandits in the $s$-sparse setting, achieving optimal sample bounds for identifying an $\epsilon$-optimal policy.
The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…
The paper proposes an algorithm for the extensive-form bandit problem that achieves $ ilde{O}(rac{ ext{total actions} imes ext{strategies} imes ext{trials}}{ ext{epsilon}})$ regret while satisfyi…
The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more
The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…
The paper proposes 2FFS, a two-fidelity tree-search algorithm that efficiently identifies the best action in stochastic minimax trees by adaptively combining cheap, biased heuristic evaluations with e…
This paper improves the theoretical bounds for estimating discrete probability distributions using the $\ell_\infty$ norm, resolving several open questions in the field.
Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen +2 more
OrcaRouter is a production-ready LLM router that uses a hybrid offline-online learning approach to efficiently select the best large language model for an incoming query, achieving high accuracy at lo…
This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…
The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper introduces Nested Contextual Causal Bandits (NCCBs) to model multi-timescale sequential decisions and proposes a certified policy optimization method, NCTS, that provides quantifiable risk b…
This paper introduces Repeated Policy Regret (RP-Regret), a novel game-theoretic metric for analyzing regret in repeated games with adaptive opponents, and proposes algorithms to minimize it.
The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…
This paper develops a policy-learning framework to optimally assign prediction tasks to multiple agents, considering individual agent expertise and capacity constraints, achieving systematic performan…
The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.