20 results for “non-stationary multi-armed bandit”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…
Liad Erez, Fan Chen, Alon Cohen, Tomer Koren +3 more
The paper analyzes the sample complexity of contextual bandits in the $s$-sparse setting, achieving optimal sample bounds for identifying an $\epsilon$-optimal policy.
The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…
The paper proposes an algorithm for the extensive-form bandit problem that achieves $ ilde{O}(rac{ ext{total actions} imes ext{strategies} imes ext{trials}}{ ext{epsilon}})$ regret while satisfyi…
The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
The paper proposes 2FFS, a two-fidelity tree-search algorithm that efficiently identifies the best action in stochastic minimax trees by adaptively combining cheap, biased heuristic evaluations with e…
The paper introduces Nested Contextual Causal Bandits (NCCBs) to model multi-timescale sequential decisions and proposes a certified policy optimization method, NCTS, that provides quantifiable risk b…
Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more
The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…
This paper introduces Repeated Policy Regret (RP-Regret), a novel game-theoretic metric for analyzing regret in repeated games with adaptive opponents, and proposes algorithms to minimize it.
The paper introduces the Terminal Representation (TR), a novel, lower-dimensional, and structurally distinct formulation for encoding reward-weighted trajectories in RL that bypasses the need for eige…
Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more
The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…
Yike Zhao, Onno Eberhard, Malek Khammassi, Ali H. Sayed +1 more
This paper theoretically justifies the strong performance of linear recurrent neural networks as memory units in partially observable reinforcement learning by constructing specific linear filters tha…
The paper introduces a new anytime-valid inference method to correct split selection in online decision trees, providing robust statistical guarantees for streaming data that existing methods lack.
AdaKoop introduces an efficient streaming algorithm that models complex nonlinear dynamics from nonstationary data streams by leveraging the Koopman operator theory, achieving state-of-the-art accurac…
The paper analyzes a new class of asynchronous adaptive first-order optimization methods and proves their stochastic convergence rate is O(1/sqrt{t}) for non-convex functions.
Zhi Zhou, Ming Yang, Shi-Yu Tian, Kun-Yang Yu +2 more
The paper establishes the first theoretical framework for analyzing the learnability of Test-Time Adaptation (TTA) under non-stationary data streams by introducing Recovery Complexity, which quantifie…