~ similar to 2605.27995· 20 results
Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more
This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
The paper introduces new benchmarks for complex asynchronous planning and demonstrates that general constraint satisfaction formalizers (like CP-SAT) significantly outperform direct LLM planning or tr…
The paper introduces FORGE, a feedback-driven execution system that improves LLM-based binary analysis by interleaving reasoning and tool interaction, achieving high-quality vulnerability discovery on…
Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou +3 more
The paper introduces TaskWeave, a hierarchical agentic framework that successfully simulates long-horizon organizational dynamics by treating coordination as a memory-centered problem, demonstrating t…
The paper proposes Multi-Agent Computer Use (MACU) systems, which significantly improve performance on complex, long-horizon tasks by enabling parallel execution and dynamic task decomposition compare…
Agent libOS introduces a library-OS-inspired runtime substrate that treats LLM agents as schedulable processes, providing explicit capability control and robust auditing for long-running, stateful age…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie +5 more
The paper introduces Adaptive Context Management (AdaCoM), an external context manager that uses reinforcement learning to improve the performance of frozen LLM agents on long-horizon tasks by intelli…
Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang +5 more
The paper argues that observed gains in multimodal agents using tools may be due to learning tool-calling patterns rather than genuine capability expansion, finding that tool access provides little co…
The paper introduces MAVEN, a lightweight symbolic reasoning scaffold that significantly improves the generalization and end-to-end success rate of large language models in complex, multi-step tool-ca…
The paper introduces Momento, a new benchmark that evaluates agentic AI's ability to maintain state and reason across multiple, disconnected sessions, revealing that current agents struggle with integ…
Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more
The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…
Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao +3 more
The paper introduces Grounded Agentic Interaction Synthesis (GAIS), a framework that generates high-quality, diverse, and complex agentic training data by anchoring tasks to real-world protocols, sign…
The paper introduces a self-healing agentic orchestrator that significantly improves the reliability of tool-augmented LLM systems by treating failure as a bounded runtime control problem, achieving h…
Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more
The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.
Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang +5 more
SelSkill introduces a dual-granularity preference learning framework that treats skill use as a 'skill-or-skip' decision, significantly improving agent performance and execution precision in complex a…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more
The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…
Tool Forge is a validation-carrying toolchain that converts natural language capability intent into governed, sandbox-verified tool artifacts, significantly improving agent efficiency and reliability.