Papers similar to 2605.27995

~ similar to 2605.27995· 20 results

cs.LGcs.AIRecentMay 28, 2026

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more

This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…

View →

cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →

cs.CLRecentMay 31, 2026

Robust Asynchronous Planning via Auto-Formalization

Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang

The paper introduces new benchmarks for complex asynchronous planning and demonstrates that general constraint satisfaction formalizers (like CP-SAT) significantly outperform direct LLM planning or tr…

View →

cs.CRRecentApr 16, 2026

Feedback-Driven Execution for LLM-Based Binary Analysis

XiangRui Zhang, Qiang Li, Haining Wang

The paper introduces FORGE, a feedback-driven execution system that improves LLM-based binary analysis by interleaving reasoning and tool interaction, achieving high-quality vulnerability discovery on…

View →

cs.AIRecentMay 31, 2026

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou +3 more

The paper introduces TaskWeave, a hierarchical agentic framework that successfully simulates long-horizon organizational dynamics by treating coordination as a memory-centered problem, demonstrating t…

View →

cs.MAcs.CLcs.LGRecentJun 1, 2026

Multi-Agent Computer Use

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

The paper proposes Multi-Agent Computer Use (MACU) systems, which significantly improve performance on complex, long-horizon tasks by enabling parallel execution and dynamic task decomposition compare…

View →

cs.OScs.AIcs.CRRecentJun 2, 2026

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Yingqi Zhang

Agent libOS introduces a library-OS-inspired runtime substrate that treats LLM agents as schedulable processes, providing explicit capability control and robust auditing for long-running, stateful age…

View →

cs.AIRecentMay 31, 2026

"Skill issues'': data-centric optimization of lakehouse agents

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…

View →

cs.AIRecentMay 29, 2026

Learning Agent-Compatible Context Management for Long-Horizon Tasks

Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie +5 more

The paper introduces Adaptive Context Management (AdaCoM), an external context manager that uses reinforcement learning to improve the performance of frozen LLM agents on long-horizon tasks by intelli…

View →

cs.CVcs.AIRecentJun 1, 2026

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang +5 more

The paper argues that observed gains in multimodal agents using tools may be due to learning tool-calling patterns rather than genuine capability expansion, finding that tool access provides little co…

View →

cs.AIRecentMay 29, 2026

MAVEN: Improving Generalization in Agentic Tool Calling

Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

The paper introduces MAVEN, a lightweight symbolic reasoning scaffold that significantly improves the generalization and end-to-end success rate of large language models in complex, multi-step tool-ca…

View →

cs.CLRecentMay 30, 2026

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

The paper introduces Momento, a new benchmark that evaluates agentic AI's ability to maintain state and reason across multiple, disconnected sessions, revealing that current agents struggle with integ…

View →

cs.AIRecentMay 27, 2026

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more

The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…

View →

cs.CLRecentJun 1, 2026

Scaling Agentic Capabilities via Grounded Interaction Synthesis

Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao +3 more

The paper introduces Grounded Agentic Interaction Synthesis (GAIS), a framework that generates high-quality, diverse, and complex agentic training data by anchoring tasks to real-world protocols, sign…

View →

cs.AIRecentMay 31, 2026

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Rahul Suresh Babu, Adarsh Agrawal

The paper introduces a self-healing agentic orchestrator that significantly improves the reliability of tool-augmented LLM systems by treating failure as a bounded runtime control problem, achieving h…

View →

cs.LGcs.AIRecentMay 29, 2026

Learning to Construct Practical Agentic Systems

Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more

The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.

View →

cs.CLcs.AIRecentMay 30, 2026

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang +5 more

SelSkill introduces a dual-granularity preference learning framework that treats skill use as a 'skill-or-skip' decision, significantly improving agent performance and execution precision in complex a…

View →

cs.AIcs.CLRecentJun 1, 2026

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao +2 more

The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…

View →

cs.AIRecentMay 28, 2026

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more

The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…

View →

cs.SEcs.AIRecentMay 27, 2026

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Swanand Rao

Tool Forge is a validation-carrying toolchain that converts natural language capability intent into governed, sandbox-verified tool artifacts, significantly improving agent efficiency and reliability.

View →