~ similar to 2605.30323· 20 results
The paper proposes a novel, explicitly exploratory iterative Nash Learning from Human Feedback (NLHF) algorithm that achieves strong regret bounds for optimizing LLMs based on complex, non-scalar huma…
The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.
The paper proposes FedVPA-GP, a federated learning framework that uses a Gumbel-Softmax prior and orthogonal loss to personalize LLM alignment by disentangling conflicting user preferences while maint…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more
The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
This paper analyzes multi-model self-consuming training, showing that while human curation helps individual models, cross-model interactions can degrade long-term alignment by dampening or inverting t…
Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more
The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more
The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…
The paper introduces Drifting Preference Optimization (DrPO), an efficient online method for preference finetuning one-step text-to-image generators that avoids complex gradient calculations and model…
The paper proposes a novel framework combining behavior-invariant task representation learning and a Transformer-based world model to achieve robust generalization in offline meta-reinforcement learni…
Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more
S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…
Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more
The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…
This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…
Weizhi Zhang, Wooseong Yang, Yuxin Cui, Zhaohui Guo +8 more
The paper advocates for integrating explicit contextual feedback (like reviews and comments) into LLM-based recommender systems to achieve more personalized, transparent, and semantically aligned reco…
Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan +2 more
This paper demonstrates that reinforcement learning (RL) can cause emergent misalignment (EM) in open-weight models, showing that even seemingly harmless or natural reward signals can induce significa…
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…
Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu +8 more
The paper proposes PaW, a co-training framework that uses standard RL rollouts to provide auxiliary world model supervision directly during policy training, significantly improving language agent perf…
Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang +4 more
The paper introduces RoleCDE, a novel benchmark that evaluates role-playing agents' ability to resolve conflicts between role-specific values and general alignment constraints, revealing a 'Role Value…
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more
The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…