~ similar to 2606.00609· 19 results
The paper demonstrates that using Reinforcement Learning from Verifiable Rewards (RLVR) significantly improves small language models' functional correctness in code generation, particularly when combi…
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…
Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu +14 more
The paper introduces $ ext{RLR}^3$, a novel framework that extends verifiable rewards in Reinforcement Learning to handle partially verifiable, multi-criteria vision-language tasks by integrating robu…
The paper introduces Aethelgard, a novel four-layer adaptive governance framework that enforces least privilege by learning the minimum necessary capabilities for autonomous AI agents based on their i…
The paper introduces a novel shielding framework for Robust MDPs (RMDPs) that guarantees safety under worst-case transition probabilities, enabling safe reinforcement learning even when transition dyn…
Bin Chen, Xinye Liao, Yiming Liu, Xin Liao +1 more
The paper proposes Credit-Attenuated Privileged Feedback (CAPF), a training-time mechanism that uses verifier-side information to guide LLM search agents, significantly improving their performance on…
Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more
The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…
Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem +1 more
The paper introduces a reinforcement learning framework, inspired by AlphaZero, to automate and improve the proof search process within the Tamarin protocol analysis tool, resulting in shorter and mor…
The paper introduces REFT, a novel method that diversifies rollouts by sampling the first token after the reasoning marker, significantly improving performance in Reinforcement Learning with Verifiabl…
Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang +7 more
QUBRIC introduces a co-design framework that simultaneously optimizes queries and rubrics, overcoming the bottleneck of vague rubrics derived from open-ended questions, leading to significant gains in…
The paper introduces the Configurable Safety Reward Model (CSRM), a novel reward model that can be jointly optimized for calibrated safety compliance and reward modeling, significantly improving LLM s…
Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji +3 more
SafeMCP is a server-side defense plugin that uses look-ahead reasoning to proactively filter and constrain tool acquisition for LLM agents, thereby mitigating catastrophic risks associated with expand…
Haoyan Yang, Reza Shirkavand, Yukai Jin, Jiawei Zhou +2 more
This paper introduces Capability Self-Assessment (CSA), a crucial ability for LLMs to recognize their limitations, and demonstrates that reinforcement learning is an effective method for teaching this…
Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung +1 more
The paper proposes a two-timescale governance framework using a multi-agent LLM to safely update and guide RL agents for SDN-IoT defense, significantly improving performance and stability under advers…
Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang +5 more
The paper introduces Atomic Decomposition and Recombination (ADR), a novel framework that generates genuinely novel and challenging verifiable code tasks, significantly improving the scalability of Re…
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
The paper proposes a two-stage post-training pipeline to create a small, local LLM agent (PrivEsc-LLM) capable of performing Linux privilege escalation, achieving high success rates while drastically…
The paper proposes the Policy-Execution-Authorization (PEA) architecture, a separation-of-powers system designed to structurally enforce goal integrity in AI agents, moving safety from a probabilistic…
The paper proposes a local perturbation theory showing that cross-domain interference in multi-domain RL occurs via a low-dimensional shared conflict subspace, which can be selectively mitigated by sh…