Reinforcement Learning

RL algorithms, policy gradient, Q-learning, RLHF

20 papers indexed

cs.LGcs.AIcs.CVEmpiricalRecentJul 8, 2026

Selective Timestep Weighting and Advantage-Based Replay for Sample-Efficient Diffusion RLHF

Eric Zhu, Abhinav Shrivastava, Soumik Mukhopadhyay

This paper proposes two strategies to improve feedback efficiency of reinforcement learning from human feedback (RLHF) in diffusion models.

View →

cs.LGcs.AIRecentMay 30, 2026

Interpretable Policy Distillation for Power Grid Topology Control

Aleksandra Dmitruka, Karlis Freivalds

This paper demonstrates that a complex deep reinforcement learning policy for power grid control can be successfully distilled into a lightweight, auditable decision tree and random forest surrogate t…

View →

cs.CRRecentApr 1, 2026

Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung +1 more

The paper proposes a two-timescale governance framework using a multi-agent LLM to safely update and guide RL agents for SDN-IoT defense, significantly improving performance and stability under advers…

View →

cs.CRRecentApr 18, 2026

Privacy-Aware Machine Unlearning with SISA for Reinforcement Learning-Based Ransomware Detection

Jannatul Ferdous, Rafiqul Islam, Md Zahidul Islam

The paper proposes a privacy-aware machine unlearning framework using SISA training to efficiently remove the influence of specific training data from RL-based ransomware detectors with minimal perfor…

View →

cs.LGcs.CRRecentJun 3, 2026

Sequential Data Poisoning in LLM Post-Training

Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath +1 more

The paper introduces the threat model of sequential data poisoning, demonstrating that multiple, collaborating attackers can exploit compound vulnerabilities in LLM post-training pipelines that are in…

View →

cs.ROcs.AIRecentJun 2, 2026

Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

The paper introduces AgenticRL, a self-refining reinforcement learning framework that uses a multimodal GPT agent to automatically design, refine, and deploy reward functions for complex UAV navigatio…

View →

cs.LGcs.NEq-fin.STRecentJun 3, 2026

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

Damian Lebiedź, Robert Ślepaczuk

The paper develops and validates a novel Deep Reinforcement Learning (DRL) framework to enhance pair trading in volatile cryptocurrency markets, demonstrating statistically significant outperformance…

View →

cs.LGcs.AIEmpiricalRecentJul 10, 2026

Semantic Pareto-DQN: A Multi-Objective Reinforcement Learning Framework for Financial Anomaly Detection

Cláudio Lúcio do Val Lopes, Lucca Machado da Silva

The paper proposes Semantic Pareto-DQN, a multi-objective reinforcement learning framework for financial anomaly detection using large language models and natural-language narratives, achieving superi…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang +3 more

This paper introduces Numca and Hista, two novel techniques that significantly improve state value estimation for LLM reinforcement learning, addressing the instability of standard critic approaches.

View →

cs.AIcs.LGRecentMay 27, 2026

Differentiable Belief-based Opponent Shaping

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

The paper proposes D-BOS, a novel differentiable method that shapes opponent behavior by directly manipulating the opponent's inferred belief state, outperforming existing techniques in multi-agent ga…

View →

cs.LGcs.AIRecentMay 29, 2026

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno +3 more

The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…

View →

cs.AIcs.CRcs.LGRecentApr 20, 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more

ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…

View →

cs.LGcs.AIcs.CLRecentJun 3, 2026

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng +2 more

This paper introduces CHERRL, a controllable hacking environment for rubric-based reinforcement learning to study and mitigate reward hacking.

View →

cs.LGcs.AIRecentJun 2, 2026

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more

The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…

View →

cs.LGcs.AIRecentJun 1, 2026

Policy and World Modeling Co-Training for Language Agents

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu +8 more

The paper proposes PaW, a co-training framework that uses standard RL rollouts to provide auxiliary world model supervision directly during policy training, significantly improving language agent perf…

View →

cs.LGcs.AIRecentMay 29, 2026

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more

The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…

View →

cs.LGcs.AIRecentMay 28, 2026

ESPO: Early-Stopping Proximal Policy Optimization

Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu +7 more

ESPO is a novel reinforcement learning algorithm that detects trajectory failure in large language models and terminates rollouts early, significantly improving performance on mathematical reasoning b…

View →

cs.LGcs.AImath.OCRecentMay 29, 2026

Agentic Transformers Provably Learn to Search via Reinforcement Learning

Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

This paper demonstrates that transformer-based policies can provably learn complex tree search mechanisms, such as depth-first search, purely through reinforcement learning in a stochastic environment…

View →

cs.LGcs.AIcs.IRRecentMay 27, 2026

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu +1 more

The paper introduces a diagnostic-driven iterative refinement process for improving LLM-generated reward functions in sparse, structured reinforcement learning tasks, significantly boosting agent perf…

View →

cs.AIcs.CLcs.CVEmpiricalRecentJun 22, 2026

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Haoling Li, Kai Zheng, Jie Wu, Can Xu +3 more

This paper proposes VeriEvol, a framework for scaling reinforcement learning for visual mathematical reasoning by decoupling prompt difficulty and answer reliability, and verifying data construction.

View →