~ similar to 2605.28168· 20 results
The paper introduces PIRS, a physics-informed reward shaping method that replaces ad-hoc comfort proxies with the ISO 7730 PMV formulation, enabling deep reinforcement learning agents to achieve energ…
This paper proposes an Explainable Deep Reinforcement Learning (XRL) framework to optimize energy management in complex buildings, demonstrating that on-policy algorithms provide superior cost reducti…
Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more
The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…
Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more
The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…
Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu +1 more
The paper introduces a diagnostic-driven iterative refinement process for improving LLM-generated reward functions in sparse, structured reinforcement learning tasks, significantly boosting agent perf…
The paper demonstrates that using Reinforcement Learning from Verifiable Rewards (RLVR) significantly improves small language models' functional correctness in code generation, particularly when combi…
The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.
Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng +1 more
SPADER is a novel reinforcement learning framework that addresses the challenges of Multi-Answer Question Answering by improving credit assignment and promoting diverse exploration during long-horizon…
CARE-RL introduces a framework combining protocol-aware reward generation and capability-aware optimization to effectively mitigate cross-domain conflicts in multi-domain reinforcement learning for LL…
Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu +3 more
The paper proposes DARTS, a distribution-aware active rollout trajectory shaping method that fundamentally accelerates LLM reinforcement learning by actively shaping the long-tail response distributio…
The paper demonstrates that explicit gender cues systematically affect LLM value trade-offs, causing decision flips that are often masked or misattributed by the models themselves.
The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…
Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen +4 more
The paper introduces Graph-Distance Contribution Reward (GDCR) and Step Advantage Policy Optimization (SAPO) to provide fine-grained, step-level credit assignment for agentic search by modeling world…
The paper introduces the Configurable Safety Reward Model (CSRM), a novel reward model that can be jointly optimized for calibrated safety compliance and reward modeling, significantly improving LLM s…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu +11 more
The paper introduces SMH-Bench, a comprehensive benchmark built on a simulator to rigorously test LLM agents' ability to perform complex, environment-grounded reasoning and actions in realistic smart-…
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more
Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.
Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang +7 more
The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework that implicitly improves the reasoning ability of LLMs by guiding the model's internal latent states based on a…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…