ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.00838· 18 results

cs.LGRecentJun 1, 2026

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek +2 more

The paper proposes a coherent inverse reinforcement learning (IRL) method to improve large behavior models for robotic control, achieving superior sample efficiency and performance on complex sparse m…

View →
cs.AIRecentMay 30, 2026

Certificate-Guided Evaluation of Reinforcement Learning Generalization

Vignesh Subramanian, Đorđe Žikelić, Suguman Bansal

The paper introduces a logic-driven framework using a neural certificate function to rigorously evaluate and benchmark the generalization capabilities of reinforcement learning algorithms on unseen ta…

View →
cs.LGcs.AIcs.CVRecentMay 27, 2026

OISD: On-Policy Internal Self-Distillation of Language Models

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more

The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…

View →
cs.AIcs.LGRecentMay 29, 2026

From Noise to Control: Parameterized Diffusion Policies

Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris +2 more

The Parameterized Diffusion Policy (PDP) framework transforms diffusion models from general stochastic generators into precise, steerable tools for learning and adapting complex robotic behaviors by e…

View →
cs.LGcs.AIRecentMay 30, 2026

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu

The paper proposes a novel framework combining behavior-invariant task representation learning and a Transformer-based world model to achieve robust generalization in offline meta-reinforcement learni…

View →
cs.ROcs.AIcs.CVRecentMay 27, 2026

Turning Video Models into Generalist Robot Policies

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao +3 more

The paper proposes VERA, a decoupled policy that uses an action-free video world model combined with an embodiment-specific Inverse Dynamics Model (IDM) to achieve generalizable, zero-shot robot contr…

View →
cs.AIcs.LGstat.MLRecentJun 1, 2026

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Zelin He, Haotian Lin, Boran Han, Wei Zhu +5 more

ReSkill is an RL-in-the-loop framework that reconciles skill creation and policy optimization by automatically creating, testing, and refining modular skills alongside the agent's policy learning, lea…

View →
cs.RORecentJun 3, 2026

HORIZON: Recoverability-Governed Curriculum for Physical-Domain Scaling

Chenhao Bai, Liqin Lu, Kaijun Wang, Hui Chen +4 more

This paper studies how to scale robust robot policies by expanding physical domains in a recoverable way.

View →
cs.LGcs.AIRecentMay 29, 2026

EchoRL: Reinforcement Learning via Rollout Echoing

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more

EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…

View →
cs.LGcs.AIRecentMay 30, 2026

Task diversity produces systematic transfer but inhibits continual reinforcement learning

Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman +2 more

The paper introduces Banyan, a new continual reinforcement learning benchmark, demonstrating that while task diversity enables local transfer across distribution shifts, it does not guarantee sustaine…

View →
cs.CLRecentMay 29, 2026

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini

SCOPE introduces a data-free self-play framework that co-evolves a task-generating Challenger and a document-answering Solver, significantly improving open-ended performance on language models without…

View →
cs.AIRecentMay 27, 2026

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

SkillC introduces a Contrastive Skill Credit Assignment (CSCA) framework to enable LLM agents to autonomously internalize skills during training, significantly outperforming existing methods without r…

View →
cs.CLRecentMay 31, 2026

On the Generalization Gap in Self-Evolving Language Model Reasoning

Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan +4 more

The paper investigates the limits of self-evolution in LLM reasoning under closed-loop settings, finding that while self-improvement is significant, it consistently falls short of perfect oracle super…

View →
cs.LGcs.AIcs.CLRecentJun 3, 2026

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

This paper proposes a new imitation learning algorithm called DistIL that uses distributional feedback to improve policy improvement and regret guarantees.

View →
cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →
cs.ROcs.AIRecentMay 31, 2026

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang +4 more

The Implicit Drifting Policy (IDP) is a novel one-step action generation framework that implicitly enforces trajectory correction constraints by analyzing local expert action geometry, overcoming the…

View →
cs.LGcs.AIRecentMay 29, 2026

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

The paper proposes a feasible-reward-set framework to perform Inverse Reinforcement Learning (IRL) when data comes from multiple imperfect demonstrators, providing theoretical guarantees and practical…

View →
cs.CLRecentMay 31, 2026

Deep Research as Rubric for Reinforcement Learning

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more

The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…

View →