~ similar to 2606.01708· 19 results
The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.
The paper introduces a new anytime-valid inference method to correct split selection in online decision trees, providing robust statistical guarantees for streaming data that existing methods lack.
The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…
The paper introduces Nested Contextual Causal Bandits (NCCBs) to model multi-timescale sequential decisions and proposes a certified policy optimization method, NCTS, that provides quantifiable risk b…
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
The paper proposes an objective-wise reputation-market mechanism to dynamically calibrate and gate LLM-generated expert priors in multi-objective Bayesian optimization, showing that dynamic calibratio…
Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more
The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…
The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…
Liad Erez, Fan Chen, Alon Cohen, Tomer Koren +3 more
The paper analyzes the sample complexity of contextual bandits in the $s$-sparse setting, achieving optimal sample bounds for identifying an $\epsilon$-optimal policy.
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…
The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer +1 more
The paper introduces PRAXIS, a novel algorithm that efficiently approximates the computation of 'Rashomon sets' for decision trees, significantly reducing memory and runtime complexity.
The paper introduces Entropy-Cut Metropolis-Hastings, an efficient sampling method that uses next-token entropy to identify and resample from critical decision points in a reasoning trace, significant…
The paper introduces Tree structured Injection for Payloads (TIP), a novel black-box attack framework that reliably generates stealthy injection payloads to seize control of LLM agents utilizing the M…
Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem +1 more
The paper introduces a reinforcement learning framework, inspired by AlphaZero, to automate and improve the proof search process within the Tamarin protocol analysis tool, resulting in shorter and mor…
Ting Xu, Xu He, Yupu Lu, Jiankai Sun +3 more
The paper analyzes the entropy dynamics of Chain-of-Thought (CoT) reasoning, identifying a transition from an exploratory Uncertainty Region to a stable Confidence Region, which enables superior early…