ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

20 results for “Reinforce learning”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.HCcs.AIRecentMay 27, 2026

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

Shang Wu, Saatvik Kher, Padhraic Smyth

This paper develops a policy-learning framework to optimally assign prediction tasks to multiple agents, considering individual agent expertise and capacity constraints, achieving systematic performan…

View →
cs.CLcs.AIEmpiricalRecentJun 11, 2026

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen +3 more

This paper proposes a post-training framework called Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT) to teach language models to reason by analogy.

View →
cs.AIRecentMay 27, 2026

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

SkillC introduces a Contrastive Skill Credit Assignment (CSCA) framework to enable LLM agents to autonomously internalize skills during training, significantly outperforming existing methods without r…

View →
cs.LGcs.CLRecentMay 31, 2026

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Arda Uzunoglu, Alvin Zhang, Daniel Khashabi

The paper introduces trust functions to filter weak supervision labels, enabling near-lossless weak-to-strong generalization by selectively training a strong student using only the most reliable weak…

View →
cs.AIRecentMay 30, 2026

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

Vignesh Subramanian, Subhajit Roy, Suguman Bansal

The paper proposes DIBS, a decoupled behavioral cloning approach that stabilizes inductive generalization in RL by separating task-specific policy learning from the evolution function, leading to impr…

View →
cs.CLcs.AIRecentJun 2, 2026

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang +7 more

QUBRIC introduces a co-design framework that simultaneously optimizes queries and rubrics, overcoming the bottleneck of vague rubrics derived from open-ended questions, leading to significant gains in…

View →
cs.CLcs.AIRecentMay 28, 2026

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu +3 more

CRITIC-R1 introduces a structured critic framework that treats RAG critique as an explicit error diagnosis problem using reinforcement learning, significantly improving answer quality over strong RAG…

View →
cs.CLcs.AIRecentJun 1, 2026

A Primer in Post-Training Reasoning Data: What We Know About How It Works

Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun +2 more

This paper synthesizes over 150 scattered studies and reports to provide the first comprehensive primer on post-training reasoning data, organizing the field around data objects, utility, construction…

View →
cs.AIRecentMay 29, 2026

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more

The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.

View →
cs.LGcs.AIRecentJun 2, 2026

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

This paper introduces a 'Sleep' paradigm for machine learning models to continually learn and transfer knowledge.

View →
cs.CLRecentMay 31, 2026

Deep Research as Rubric for Reinforcement Learning

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more

The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…

View →
cs.AIRecentJun 1, 2026

Beyond One-shot: AI Agents for Learning in Field Experiments

Junjie Luo, Ritu Agarwal, Gordon Gao

The paper demonstrates that tool-augmented agentic AI can learn from prior field experiment data to automatically generate superior, domain-specific interventions, transforming one-shot A/B testing in…

View →
cs.LGcs.AIcs.CLRecentJun 3, 2026

Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

This paper proposes a new imitation learning algorithm called DistIL that uses distributional feedback to improve policy improvement and regret guarantees.

View →
cs.LGcs.AIRecentMay 29, 2026

EchoRL: Reinforcement Learning via Rollout Echoing

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more

EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…

View →
cs.CLcs.AIRecentMay 29, 2026

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji +6 more

The paper introduces Lookahead Group Reward (&) to combat Supervision Fidelity Decay (SFD) in on-policy distillation, significantly improving student model performance on long reasoning tasks.

View →
cs.CLRecentMay 29, 2026

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more

The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…

View →
cs.AIRecentMay 30, 2026

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang +7 more

The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework that implicitly improves the reasoning ability of LLMs by guiding the model's internal latent states based on a…

View →
cs.LGcs.CLRecentJun 2, 2026

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more

The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…

View →
cs.AIRecentMay 27, 2026

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more

The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…

View →