~ similar to 2606.12384· 20 results
Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen +3 more
The paper proposes EAPO, a framework that enables agentic models to learn when to forgo using external tools, thereby mitigating tool abuse while maintaining high reasoning accuracy.
Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more
This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…
Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more
The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…
Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin +1 more
The paper proposes a novel temporal and structural credit assignment framework to efficiently optimize multi-agent LLM systems by decomposing the error signal and using targeted, discrete gradient upd…
Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more
The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
The paper proposes Hysteretic Policy Optimization (HPO) and its adaptive variant (A-HPO) to stabilize reinforcement learning training in sparse-reward environments by better balancing positive and neg…
The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…
Yifei He, Rui Yang, Hao Bai, Tong Zhang +1 more
PRO-CUA introduces a process-reward optimization framework that enables efficient, step-level reinforcement learning for training computer use agents by decoupling environment interaction from policy…
Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang +5 more
SelSkill introduces a dual-granularity preference learning framework that treats skill use as a 'skill-or-skip' decision, significantly improving agent performance and execution precision in complex a…
The paper introduces Prompted Policy Optimization (PromptPO), an LLM-based method that successfully optimizes policies for various sequential RL tasks, demonstrating that LLMs can replace classical RL…
Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen +1 more
The paper proposes RoRo, a rubric-guided process reward framework that improves stepwise model routing by evaluating the quality of intermediate reasoning steps, leading to better performance and cost…
Zelin He, Haotian Lin, Boran Han, Wei Zhu +5 more
ReSkill is an RL-in-the-loop framework that reconciles skill creation and policy optimization by automatically creating, testing, and refining modular skills alongside the agent's policy learning, lea…
Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai +8 more
The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven framework that treats rubric construction itself as a research problem to generate fine-grained, scalable reward signals…
The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.
Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu +7 more
ExpGraph is a model-agnostic framework that uses a self-evolving experience graph to enable LLM agents to reuse past successful strategies and failure lessons, significantly improving performance acro…
Prompt Codebooks (PCO) introduces a compositional framework that treats prompt optimization as discrete learning over reusable instruction units, significantly improving LLM performance while drastica…
Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao +2 more
The paper introduces PTCG-Bench, a new benchmark using the Pokémon TCG to evaluate LLM agents' strategic decision-making and ability to self-evolve, finding that sustained self-evolution remains chall…
Mingju Chen, Can Lv, Guibin Zhang, Heng Chang +1 more
HarnessForge introduces a meta-adaptive framework that jointly evolves the execution structure (harness) and the reasoning policy of LLM agents, significantly improving overall system performance acro…
The paper introduces Calibrated Collective Oversight (CCO), a novel framework that uses aggregated auxiliary scoring functions and Conformal Decision Theory to provide statistically guaranteed, scalab…