ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.30619· 19 results

cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →
stat.MLcs.LGRecentJun 1, 2026

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

Louise Davy, Stephan Clémençon, Charlotte Laclau

This paper introduces survey sampling techniques to estimate or minimize empirical pairwise loss functions, showing that targeting informative pairs significantly reduces computational cost while main…

View →
cs.LGmath.OCmath.PREmpiricalRecentJun 9, 2026

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…

View →
cs.LGmath.OCmath.PREmpiricalRecentJun 9, 2026

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

This paper studies a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers, and develops a data-driven algorithm to learn parameters and op…

View →
cs.LGcs.AIRecentMay 28, 2026

In-Context Reward Adaptation for Robust Preference Modeling

Zhenyu Sun, Zheng Xu, Ermin Wei

The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…

View →
cs.LGstat.MLRecentJun 1, 2026

Local Preferential Bayesian Optimization

Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more

The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…

View →
stat.MLcs.LGRecentJun 1, 2026

ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation

David Rundel, Fabian Fumagalli, Maximilian Muschalik, Bernd Bischl +1 more

ShaplEIG introduces a Bayesian experimental design framework to efficiently and adaptively estimate Shapley values by minimizing the number of required costly function evaluations.

View →
cs.LGcs.AIstat.MLRecentMay 28, 2026

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren +3 more

The paper analyzes the sample complexity of contextual bandits in the $s$-sparse setting, achieving optimal sample bounds for identifying an $\epsilon$-optimal policy.

View →
cs.LGcs.AIRecentMay 31, 2026

Efficient Exploration for Iterative Nash Preference Optimization

Tianlong Nan, Xiaopeng Li, Christian Kroer, Tianyi Lin

The paper proposes a novel, explicitly exploratory iterative Nash Learning from Human Feedback (NLHF) algorithm that achieves strong regret bounds for optimizing LLMs based on complex, non-scalar huma…

View →
cs.CLRecentMay 28, 2026

Auditing LLM Benchmarks with Item Response Theory

Sander Land, Daniel M. Bikel

The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…

View →
cs.LGcs.AIRecentMay 29, 2026

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

The paper proposes a feasible-reward-set framework to perform Inverse Reinforcement Learning (IRL) when data comes from multiple imperfect demonstrators, providing theoretical guarantees and practical…

View →
cs.LGcs.CVRecentJun 1, 2026

Drifting Preference Optimization for One-Step Generative Models

Zhou Jiang, Yandong Wen, Zhen Liu

The paper introduces Drifting Preference Optimization (DrPO), an efficient online method for preference finetuning one-step text-to-image generators that avoids complex gradient calculations and model…

View →
cs.LGcs.AIstat.MLRecentMay 28, 2026

Calibrated Preference Learning: The Case of Label Ranking

Santo M. A. R. Thies, Viktor Bengs, Timo Kaufmann, Sebastian J. Vollmer +1 more

The paper formalizes the concept of calibration for probabilistic label ranking, demonstrating that popular models are often poorly calibrated and that calibration captures a meaningful quality dimens…

View →
cs.LGcs.AIRecentJun 2, 2026

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more

The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…

View →
math.OCcs.AIcs.LGRecentJun 1, 2026

MINTS: Minimalist Thompson Sampling

Kaizheng Wang

The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…

View →
cs.LGstat.MLRecentJun 2, 2026

Online Learning with Gradient-Variation Interval Regret

Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou

The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.

View →
cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →
cs.AIRecentMay 27, 2026

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more

The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…

View →