~ similar to 2605.27996· 20 results
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more
S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…
The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…
Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more
EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…
The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserv…
Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more
The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
The paper proposes Detector Evasion Policy Optimization (DEPO), a constrained reinforcement learning method that effectively evades AI text detectors while strictly maintaining the original text's sem…
Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding +1 more
The paper theoretically explains that optimizing LLMs solely on outcomes leads to brittle reasoning (Reward-Induced Manifold Collapse) by favoring low-complexity shortcuts, and proposes process-based…
Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi +1 more
The paper establishes a benchmark based on the cheap-talk model to test LLM honesty when their incentives conflict with the user's, finding that models consistently over-reveal information regardless…
The paper introduces Multi-Response Training (MRT) to combat the 'mode lottery' problem in language model fine-tuning, showing that retaining multiple valid responses significantly improves distributi…
Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more
The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…
Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more
ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…
Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu +1 more
The paper introduces a diagnostic-driven iterative refinement process for improving LLM-generated reward functions in sparse, structured reinforcement learning tasks, significantly boosting agent perf…
Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more
The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.
Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more
The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…
The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…
Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao +3 more
The paper decomposes LLM debate convergence into three mechanisms (instability, conformity, persuasion) and finds that much observed convergence is harmful social compliance rather than genuine reason…