Papers similar to 2605.30227

~ similar to 2605.30227· 20 results

cs.AIRecentMay 27, 2026

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

TRACER introduces a novel turn-level reinforcement framework that enables cooperative multi-LLM reasoning by separating decision-making into a regret-matching controller and a generation-credit layer.

View →

cs.AIRecentMay 28, 2026

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen +4 more

The paper introduces Graph-Distance Contribution Reward (GDCR) and Step Advantage Policy Optimization (SAPO) to provide fine-grained, step-level credit assignment for agentic search by modeling world…

View →

cs.HCcs.AIRecentMay 27, 2026

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

Shang Wu, Saatvik Kher, Padhraic Smyth

This paper develops a policy-learning framework to optimally assign prediction tasks to multiple agents, considering individual agent expertise and capacity constraints, achieving systematic performan…

View →

cs.LGcs.AIRecentMay 29, 2026

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

Stephane Hatgis-Kessell, Emma Brunskill

The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…

View →

cs.LGcs.AIEmpiricalRecentJun 10, 2026

APPO: Agentic Procedural Policy Optimization

Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji +4 more

This paper proposes a new method for agentic Reinforcement Learning called Agentic Procedural Policy Optimization (APPO) that improves tool-use capabilities by assigning credit to fine-grained decisio…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

Tianyang Zhou, Wenbo Chen, Pierre Jinghong Liang, Leman Akoglu

The paper introduces eXTC, a novel framework that combines structured prompt optimization, knowledge distillation, and reinforcement learning to create a highly performant and fully interpretable text…

View →

cs.AIRecentMay 30, 2026

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang +7 more

The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework that implicitly improves the reasoning ability of LLMs by guiding the model's internal latent states based on a…

View →

cs.AIRecentMay 28, 2026

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang +6 more

The paper introduces Metacognitive Memory Policy Optimization (MMPO), a novel memory training approach that optimizes LLM memory not based on final task success, but on minimizing epistemic uncertaint…

View →

cs.CLRecentMay 31, 2026

Deep Research as Rubric for Reinforcement Learning

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more

The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…

View →

cs.LGcs.AIEmpiricalRecentJun 4, 2026

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.

View →

cs.LGcs.AIEmpiricalRecentJun 4, 2026

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.

View →

cs.AIRecentMay 30, 2026

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang +1 more

FALAT is a diagnostic framework that treats failure attribution in complex LLM agent trajectories as a dependency-guided search problem, successfully identifying both the responsible agent and the dec…

View →

cs.AIRecentMay 27, 2026

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

SkillC introduces a Contrastive Skill Credit Assignment (CSCA) framework to enable LLM agents to autonomously internalize skills during training, significantly outperforming existing methods without r…

View →

cs.LGcs.AIRecentMay 29, 2026

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

Rodney Lafuente-Mercado

The paper introduces ARCA, a novel credit assignment method that measures token salience directly from the adapter's residual hidden state, addressing the degeneracy of standard intrinsic signals when…

View →

cs.AIRecentMay 29, 2026

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Liwei Kang, Yee Whye Teh, Wee Sun Lee

The paper introduces LinTree, a method that explicitly structures the search history of LLM reasoning traces using parent pointers, significantly improving task performance and search efficiency compa…

View →

cs.AIRecentMay 27, 2026

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen +1 more

This paper introduces MCTS-Guided Group Relative Policy Optimization (M-GRPO) to enhance LLM spatial reasoning by improving the decomposition of complex tasks into optimal sub-tasks.

View →

cs.AIRecentMay 27, 2026

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall

Prompt Codebooks (PCO) introduces a compositional framework that treats prompt optimization as discrete learning over reusable instruction units, significantly improving LLM performance while drastica…

View →

cs.LGcs.AIRecentMay 30, 2026

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding +1 more

The paper theoretically explains that optimizing LLMs solely on outcomes leads to brittle reasoning (Reward-Induced Manifold Collapse) by favoring low-complexity shortcuts, and proposes process-based…

View →

cs.AIRecentMay 27, 2026

Plan Before Search: Search Agents Need Plan

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen +6 more

The paper introduces Plan, a structured agentic behavior that decomposes multi-hop questions into ordered sub-questions before retrieval, and proposes a self-bootstrapping paradigm to train it without…

View →

cs.AIRecentMay 27, 2026

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

Zhenyu Cui, Xiangzhong Luo

The paper investigates how LLMs allocate their internal computational depth during multi-turn agentic planning, finding that agents progressively recruit deeper layers and shift toward corrective upda…

View →