ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.28293· 19 results

cs.AIRecentMay 27, 2026

PRO-CUA: Process-Reward Optimization for Computer Use Agents

Yifei He, Rui Yang, Hao Bai, Tong Zhang +1 more

PRO-CUA introduces a process-reward optimization framework that enables efficient, step-level reinforcement learning for training computer use agents by decoupling environment interaction from policy…

View →
cs.LGcs.AIRecentMay 28, 2026

ESPO: Early-Stopping Proximal Policy Optimization

Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more

ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…

View →
cs.IRcs.AIcs.CLRecentJun 2, 2026

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more

Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.

View →
cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →
cs.IRcs.AIRecentMay 27, 2026

Toward User Preference Alignment in LLM Recommendation via Explicit Context Feedback

Weizhi Zhang, Wooseong Yang, Yuxin Cui, Zhaohui Guo +8 more

The paper advocates for integrating explicit contextual feedback (like reviews and comments) into LLM-based recommender systems to achieve more personalized, transparent, and semantically aligned reco…

View →
cs.AIRecentMay 28, 2026

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen +4 more

The paper introduces Graph-Distance Contribution Reward (GDCR) and Step Advantage Policy Optimization (SAPO) to provide fine-grained, step-level credit assignment for agentic search by modeling world…

View →
cs.LGstat.MLRecentJun 1, 2026

Local Preferential Bayesian Optimization

Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger +1 more

The paper introduces local Preferential Bayesian Optimization (PBO) methods that adapt high-dimensional Bayesian Optimization techniques, such as trust-region and derivative-informed local search, to…

View →
cs.CLRecentMay 29, 2026

Preference-Aware Rubric Learning for Personalized Evaluation

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more

The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…

View →
cs.AIcs.CLcs.LGRecentMay 28, 2026

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang +3 more

The paper proposes SAAS, a novel RL framework that equips LLM agents with self-awareness to precisely regulate search behavior, significantly mitigating costly over-search without sacrificing accuracy…

View →
cs.LGcs.AIRecentMay 29, 2026

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

Stephane Hatgis-Kessell, Emma Brunskill

The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…

View →
cs.LGcs.AIRecentMay 28, 2026

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Ziwen Xie, Shaowen Xiang, Hongyu He, Dianbo Liu

The paper introduces a quotient-DAG view to accurately estimate unordered slate propensities for off-policy evaluation, solving the nuisance variance and computational gap inherent in standard importa…

View →
stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →
cs.CLcs.AIRecentMay 30, 2026

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng +1 more

SPADER is a novel reinforcement learning framework that addresses the challenges of Multi-Answer Question Answering by improving credit assignment and promoting diverse exploration during long-horizon…

View →
cs.AIRecentMay 27, 2026

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more

The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…

View →
cs.CLRecentMay 29, 2026

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more

The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…

View →
cs.AIRecentJun 1, 2026

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

Bin Chen, Xinye Liao, Yiming Liu, Xin Liao +1 more

The paper proposes Credit-Attenuated Privileged Feedback (CAPF), a training-time mechanism that uses verifier-side information to guide LLM search agents, significantly improving their performance on…

View →
cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →
cs.LGcs.AIcs.CLRecentJun 1, 2026

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai +6 more

The paper introduces OpenWebRL, an open framework that enables training visual web agents using online multi-turn Reinforcement Learning directly on live websites, achieving state-of-the-art performan…

View →
cs.LGcs.AIRecentMay 29, 2026

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more

The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…

View →