Reinforcement Learning
RL algorithms, policy gradient, Q-learning, RLHF
20 papers indexed
Interpretable Policy Distillation for Power Grid Topology Control
This paper demonstrates that a complex deep reinforcement learning policy for power grid control can be successfully distilled into a lightweight, auditable decision tree and random forest surrogate t…
Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense
Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung +1 more
The paper proposes a two-timescale governance framework using a multi-agent LLM to safely update and guide RL agents for SDN-IoT defense, significantly improving performance and stability under advers…
Privacy-Aware Machine Unlearning with SISA for Reinforcement Learning-Based Ransomware Detection
The paper proposes a privacy-aware machine unlearning framework using SISA training to efficiently remove the influence of specific training data from RL-based ransomware detectors with minimal perfor…
Sequential Data Poisoning in LLM Post-Training
Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath +1 more
The paper introduces the threat model of sequential data poisoning, demonstrating that multiple, collaborating attackers can exploit compound vulnerabilities in LLM post-training pipelines that are in…
Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation
The paper introduces AgenticRL, a self-refining reinforcement learning framework that uses a multimodal GPT agent to automatically design, refine, and deploy reward functions for complex UAV navigatio…
Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning
The paper develops and validates a novel Deep Reinforcement Learning (DRL) framework to enhance pair trading in volatile cryptocurrency markets, demonstrating statistically significant outperformance…
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang +3 more
This paper introduces Numca and Hista, two novel techniques that significantly improve state value estimation for LLM reinforcement learning, addressing the instability of standard critic approaches.
Differentiable Belief-based Opponent Shaping
The paper proposes D-BOS, a novel differentiable method that shapes opponent behavior by directly manipulating the opponent's inferred belief state, outperforming existing techniques in multi-agent ga…
Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng +2 more
This paper introduces CHERRL, a controllable hacking environment for rubric-based reinforcement learning to study and mitigate reward hacking.
Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more
The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…
PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning
Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more
The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…
ESPO: Early-Stopping Proximal Policy Optimization
Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more
ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…
Policy and World Modeling Co-Training for Language Agents
Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu +8 more
The paper proposes PaW, a co-training framework that uses standard RL rollouts to provide auxiliary world model supervision directly during policy training, significantly improving language agent perf…
Agentic Transformers Provably Learn to Search via Reinforcement Learning
This paper demonstrates that transformer-based policies can provably learn complex tree search mechanisms, such as depth-first search, purely through reinforcement learning in a stochastic environment…
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu +1 more
The paper introduces a diagnostic-driven iterative refinement process for improving LLM-generated reward functions in sparse, structured reinforcement learning tasks, significantly boosting agent perf…
Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents
The paper introduces Aethelgard, a novel four-layer adaptive governance framework that enforces least privilege by learning the minimum necessary capabilities for autonomous AI agents based on their i…
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek +2 more
The paper proposes a coherent inverse reinforcement learning (IRL) method to improve large behavior models for robotic control, achieving superior sample efficiency and performance on complex sparse m…
Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications
The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…