~ similar to 2606.00189· 20 results
The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.
Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu +6 more
The paper proposes Meta-Team, an experience-driven framework that enables multi-agent systems (MAS) to collaboratively self-evolve by transforming complex execution experiences into reusable improveme…
Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more
This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…
Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao +3 more
The paper introduces Grounded Agentic Interaction Synthesis (GAIS), a framework that generates high-quality, diverse, and complex agentic training data by anchoring tasks to real-world protocols, sign…
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
Mingju Chen, Can Lv, Guibin Zhang, Heng Chang +1 more
HarnessForge introduces a meta-adaptive framework that jointly evolves the execution structure (harness) and the reasoning policy of LLM agents, significantly improving overall system performance acro…
This paper investigates the scaling behavior of homogeneous LLM-driven Multi-Agent Systems (MAS) and finds that performance exhibits diminishing returns due to coordination overhead, rather than scali…
Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang +9 more
The paper introduces Autonomous Agentic Data Engineering, demonstrating that LLMs can autonomously plan and optimize end-to-end data curation pipelines, leading to substantial performance gains in spe…
The paper introduces MAVEN, a lightweight symbolic reasoning scaffold that significantly improves the generalization and end-to-end success rate of large language models in complex, multi-step tool-ca…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more
The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…
Ruiyin Li, Yiran Zhang, Xiyu Zhou, Yangxiao Cai +5 more
The paper introduces MAAD, a multi-agent framework that autonomously transforms software requirements into comprehensive, multi-view architectural blueprints, significantly improving completeness and…
Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao +10 more
MLEvolve is a novel self-evolving multi-agent framework that enables LLM agents to discover and optimize machine learning algorithms for complex, long-horizon tasks.
Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji +4 more
This paper proposes a new method for agentic Reinforcement Learning called Agentic Procedural Policy Optimization (APPO) that improves tool-use capabilities by assigning credit to fine-grained decisio…
This paper systematically analyzes the complex design space of hybrid multi-agent systems combining on-device and cloud AI models, finding that the optimal architecture is highly task-dependent and th…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
The paper experimentally evaluates 12 multi-agent LLM collaboration topologies for software design, finding that structural adversarial prompting and cross-model review are the most effective approach…
Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao +4 more
This paper presents EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery.
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang +10 more
SkillRevise is an execution-grounded framework that iteratively refines initial, imperfect LLM agent skills by diagnosing defects from execution evidence and applying empirically validated edits, sign…