ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.30160· 19 results

cs.LGcs.AIcs.CLRecentJun 3, 2026

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

This paper proposes a new imitation learning algorithm called DistIL that uses distributional feedback to improve policy improvement and regret guarantees.

View →
cs.LGcs.AIRecentMay 29, 2026

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu +3 more

The paper proposes DARTS, a distribution-aware active rollout trajectory shaping method that fundamentally accelerates LLM reinforcement learning by actively shaping the long-tail response distributio…

View →
cs.AIcs.LGRecentMay 29, 2026

From Noise to Control: Parameterized Diffusion Policies

Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris +2 more

The Parameterized Diffusion Policy (PDP) framework transforms diffusion models from general stochastic generators into precise, steerable tools for learning and adapting complex robotic behaviors by e…

View →
cs.AIcs.LGecon.THRecentMay 31, 2026

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

Yujiao Chen

This paper shows that standard optimal control in Markov Decision Processes (MDPs) with an absorbing catastrophic state naturally generates behavioral signatures mimicking prospect theory, even withou…

View →
cs.LGcs.AImath.DSRecentMay 27, 2026

The Hamilton-Jacobi Theory of Deep Learning

Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

This paper establishes an exact mathematical correspondence between training and inference in deep learning and the solution of Hamilton-Jacobi partial differential equations, unifying multiple theore…

View →
cs.LGcs.AIRecentMay 29, 2026

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

The paper introduces the Markov decision contest, a new framework for reinforcement learning using pairwise preferences, and proves that stationary Markov policies are optimal and solvable efficiently…

View →
cs.AIcs.LGRecentMay 30, 2026

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

Hongqiang Lin, Pengfei Wang, Nenggan Zheng

The paper introduces Posterior Hybrid Bayesian Belief (PhyB), a novel framework that reformulates policy optimization in Bayesian Offline RL by approximating expectations as a convex combination over…

View →
cs.LGcs.AIRecentMay 29, 2026

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno +3 more

The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…

View →
cs.LGcs.AIRecentMay 29, 2026

The Terminal Representation in Reinforcement Learning

Amir Esterhuysen, Anders Jonsson

The paper introduces the Terminal Representation (TR), a novel, lower-dimensional, and structurally distinct formulation for encoding reward-weighted trajectories in RL that bypasses the need for eige…

View →
cs.AIcs.LGRecentMay 27, 2026

Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting

Shadmehr Zaregarizi, Khashayar Yavari

The paper introduces an adaptive reservoir computing framework that tailors Echo State Networks (ESNs) to specific evaluation scenarios, achieving a high score on the CTF-4-Science Lorenz benchmark fo…

View →
cond-mat.dis-nnquant-phstat.MLRecentJun 4, 2026

Nonreversible Gauge Fields in Fokker--Planck Dynamics: Supersymmetric Hamiltonians and Learned Finite Forces

Masayuki Ohzeki

The paper reformulates nonreversible perturbations of Fokker--Planck dynamics as gauge fields, providing a unified operator viewpoint to analyze relaxation processes and develop methods for learning o…

View →
cs.LGcs.AIRecentMay 28, 2026

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed +1 more

The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…

View →
cs.LGcs.AIRecentJun 2, 2026

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas +6 more

The paper proposes a novel RL framework that naturally induces diverse agent behavior by reformulating the objective to treat the reward as a distribution over functions, making diversity a rational r…

View →
cs.CLcs.LGRecentMay 30, 2026

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun +2 more

The paper introduces Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a novel method that internalizes temperature-based policy reheating into model parameters to combat entropy collapse in r…

View →
cs.LGRecentJun 1, 2026

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein

This paper investigates the phenomenon of 'copying' in Distribution Matching Distillation (DMD), finding that high-dimensional distillation causes student models to spontaneously reproduce the teacher…

View →
cs.LGcs.AIRecentMay 30, 2026

Interpretable Policy Distillation for Power Grid Topology Control

Aleksandra Dmitruka, Karlis Freivalds

This paper demonstrates that a complex deep reinforcement learning policy for power grid control can be successfully distilled into a lightweight, auditable decision tree and random forest surrogate t…

View →
stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →
cs.AIRecentMay 27, 2026

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang +1 more

The paper introduces Global PSRO, a novel deep reinforcement learning framework that efficiently approximates Nash equilibria in large two-player zero-sum games by intelligently expanding the strategy…

View →
cs.AIRecentMay 27, 2026

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

Shadmehr Zaregarizi, Khashayar Yavari

The paper introduces PIRS, a physics-informed reward shaping method that replaces ad-hoc comfort proxies with the ISO 7730 PMV formulation, enabling deep reinforcement learning agents to achieve energ…

View →