~ similar to 2605.29645· 19 results
The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…
The paper proposes an algorithm for the extensive-form bandit problem that achieves $ ilde{O}(rac{ ext{total actions} imes ext{strategies} imes ext{trials}}{ ext{epsilon}})$ regret while satisfyi…
The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.
The paper introduces Nested Contextual Causal Bandits (NCCBs) to model multi-timescale sequential decisions and proposes a certified policy optimization method, NCTS, that provides quantifiable risk b…
This paper introduces survey sampling techniques to estimate or minimize empirical pairwise loss functions, showing that targeting informative pairs significantly reduces computational cost while main…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…
The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
The paper proposes 2FFS, a two-fidelity tree-search algorithm that efficiently identifies the best action in stochastic minimax trees by adaptively combining cheap, biased heuristic evaluations with e…
This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…
This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…
This paper introduces Repeated Policy Regret (RP-Regret), a novel game-theoretic metric for analyzing regret in repeated games with adaptive opponents, and proposes algorithms to minimize it.
Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer +1 more
The paper introduces PRAXIS, a novel algorithm that efficiently approximates the computation of 'Rashomon sets' for decision trees, significantly reducing memory and runtime complexity.
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
The paper provides a tight, transparent, and closed-form analysis of the trade-off function for Differentially Private SGD using random shuffling, significantly improving upon previous methods and est…
The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…
Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more
The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…