Papers similar to 2606.02351

~ similar to 2606.02351· 18 results

cs.AIcs.LGRecentJun 1, 2026

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

The paper proposes an objective-wise reputation-market mechanism to dynamically calibrate and gate LLM-generated expert priors in multi-objective Bayesian optimization, showing that dynamic calibratio…

View →

cs.AIcs.LGRecentMay 30, 2026

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

Hongqiang Lin, Pengfei Wang, Nenggan Zheng

The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…

View →

cs.LGcs.CVRecentJun 1, 2026

Drifting Preference Optimization for One-Step Generative Models

Zhou Jiang, Yandong Wen, Zhen Liu

The paper introduces Drifting Preference Optimization (DrPO), an efficient online method for preference finetuning one-step text-to-image generators that avoids complex gradient calculations and model…

View →

cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →

cs.CLcs.AIcs.LGRecentMay 28, 2026

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more

The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…

View →

stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →

cs.AIcs.LGstat.MLRecentMay 31, 2026

Transferring Information Across Interventions in Causal Bayesian Optimization

Mohammad Ali Javidian

The paper proposes graph-coupled causal Bayesian optimization, a method that improves efficiency by sharing information across related interventions through a shared set of causal parameters.

View →

cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →

stat.MLcs.LGmath.STRecentJun 3, 2026

Bayesian learning for the stochastic shortest path problem

Chon Wai Ho, Sumeetpal S. Singh, Jiaqi Guo

The paper proposes a novel Bayesian framework to learn the optimal decision strategy for the stochastic shortest path problem by directly constructing the posterior beliefs for the action-value functi…

View →

cs.LGcs.AIRecentMay 27, 2026

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang +4 more

The paper proposes ProRL, an effective Reinforcement Learning framework that rectifies gradient estimation deficiencies to optimize proactive recommendation paths, significantly outperforming existing…

View →

cs.IRcs.AIcs.CLRecentJun 2, 2026

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more

Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.

View →

cs.LGstat.MLRecentJun 2, 2026

Online Learning with Gradient-Variation Interval Regret

Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou

The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.

View →

cs.LGcs.AIcs.NERecentJun 3, 2026

ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion

Ruiqing Sun, Sen Yang, Dawei Feng, Bo Ding +2 more

ParetoPilot introduces a novel zero-surrogate diffusion framework for offline multi-objective optimization, achieving state-of-the-art performance by directly guiding the generation process without re…

View →

cs.LGcs.AIRecentMay 31, 2026

Efficient Exploration for Iterative Nash Preference Optimization

Tianlong Nan, Xiaopeng Li, Christian Kroer, Tianyi Lin

The paper proposes a novel, explicitly exploratory iterative Nash Learning from Human Feedback (NLHF) algorithm that achieves strong regret bounds for optimizing LLMs based on complex, non-scalar huma…

View →

cs.LGcs.AIRecentMay 28, 2026

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed +1 more

The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…

View →

cs.LGcs.AIcs.DCRecentMay 29, 2026

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok

The paper proposes FedVPA-GP, a federated learning framework that uses a Gumbel-Softmax prior and orthogonal loss to personalize LLM alignment by disentangling conflicting user preferences while maint…

View →

cs.LGcs.AIRecentJun 1, 2026

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

Peter Chen, Xi Chen

The paper proposes 2FFS, a two-fidelity tree-search algorithm that efficiently identifies the best action in stochastic minimax trees by adaptively combining cheap, biased heuristic evaluations with e…

View →

cs.LGcs.AIRecentMay 29, 2026

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

William Overman, Mohsen Bayati

The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…

View →