~ similar to 2606.01382· 20 results
Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more
S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…
The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
The paper proposes FedVPA-GP, a federated learning framework that uses a Gumbel-Softmax prior and orthogonal loss to personalize LLM alignment by disentangling conflicting user preferences while maint…
The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…
The paper introduces Drifting Preference Optimization (DrPO), an efficient online method for preference finetuning one-step text-to-image generators that avoids complex gradient calculations and model…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more
The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…
Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai +1 more
DRIFT proposes a novel framework that efficiently optimizes LLMs for multi-turn interactions by decoupling rollout from optimization, allowing the use of weighted supervised fine-tuning to match the p…
Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng +1 more
SPADER is a novel reinforcement learning framework that addresses the challenges of Multi-Answer Question Answering by improving credit assignment and promoting diverse exploration during long-horizon…
Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more
The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
The paper introduces Q-ALIGN DT, a novel framework that improves conditioned sequence models by enforcing alignment between the input return-to-go (RTG) signal and the output policy's expected Q-value…
Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more
Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.
The paper introduces DPPrefSyn, a novel algorithm that generates differentially private synthetic preference data, enabling privacy-preserving alignment of large language models.
The paper introduces DPPrefSyn, a novel algorithm that generates differentially private synthetic preference data, enabling privacy-preserving alignment of large language models.
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more
The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…
The paper proposes an objective-wise reputation-market mechanism to dynamically calibrate and gate LLM-generated expert priors in multi-objective Bayesian optimization, showing that dynamic calibratio…
This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…