~ similar to 2605.28158· 20 results
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
Haoyang Liu, Jie Wang, Boxuan Niu, Xiongwei Han +7 more
The paper introduces Opt-Verifier, a novel LLM-based framework that significantly improves the accuracy of automated optimization model generation by implementing dual-side verification from both stru…
Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu +5 more
The paper introduces SkillBrew, a multi-objective framework that treats skill bank curation as a constrained optimization problem to build efficient and well-curated skill repositories for LLM agents.
Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li +8 more
The paper introduces Harness-Bench, a diagnostic benchmark that measures how different system 'harnesses' affect LLM agent performance in realistic workflows, showing that agent capability must be rep…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang +3 more
The paper introduces RedundancyBench, a new benchmark for detecting unnecessary steps in LLM agent trajectories, finding that this task is highly complex and difficult to solve.
RACE-Sched is an asynchronous agentic framework that successfully integrates low-latency, real-time scheduling decisions with advanced, long-horizon reasoning provided by Large Language Models.
The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…
This paper introduces the first LLM-generated, domain-independent heuristics for symbolic AI planning, using evolutionary search to surpass the performance of hand-engineered state-of-the-art methods.
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
This paper empirically demonstrates that the choice of plan representation (e.g., checklist vs. narrative) significantly impacts the robustness and success rate of LLM-based web agents.
The paper evaluates dynamic coordination strategy selection for enterprise multi-agent systems, finding that a calibrated default routing approach is effective, even if a deterministic winner-selectio…
Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more
The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.
Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more
The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou +3 more
The paper introduces TaskWeave, a hierarchical agentic framework that successfully simulates long-horizon organizational dynamics by treating coordination as a memory-centered problem, demonstrating t…
The paper introduces Hyperparam, a set of lightweight JavaScript libraries designed to enable direct, model-aware querying of unstructured data (like agent traces) within client-side AI applications.
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more
The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…