~ similar to 2605.28258· 20 results
Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei +9 more
The paper introduces MMG2Skill, a closed-loop framework that converts noisy, human-oriented web guides into editable, executable skills, significantly improving agent performance across diverse tasks.
This paper proposes a multi-agent framework using LLMs to improve collaborative story generation, demonstrating that an iterative Writer-Editor process significantly enhances narrative quality for you…
The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao +2 more
The paper introduces PTCG-Bench, a new benchmark using the Pokémon TCG to evaluate LLM agents' strategic decision-making and ability to self-evolve, finding that sustained self-evolution remains chall…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng +6 more
The paper proposes HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from task execution traces, lead…
Xiaoyi Chen, Yifei Gao, Yang Xu, Xingxing Song +2 more
The paper introduces GUITestScape, a comprehensive benchmark for exploratory GUI testing, and GUIJudge, an open-set evaluator that significantly improves the assessment of AI agents' defect detection…
Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more
The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…
Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang +6 more
The paper introduces MineExplorer, a new benchmark in Minecraft, to evaluate the sustained open-world exploration capabilities of MLLM agents, finding that long-horizon coordination remains a signific…
Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang +10 more
SkillRevise is an execution-grounded framework that iteratively refines initial, imperfect LLM agent skills by diagnosing defects from execution evidence and applying empirically validated edits, sign…
Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou +7 more
The paper introduces Cookie-Bench, a novel, autonomous, and reference-free evaluation framework that significantly improves the assessment of interactive web generation capabilities for frontier LLMs.
The paper introduces GTA, a scalable framework for generating realistic, multi-hop web-agent tasks with dense, executable trajectories, addressing the current lack of process-level supervision in web…
The paper introduces 'layered mutability,' a framework for analyzing how persistent self-modifying AI agents drift away from intended behavior due to the accumulation of locally reasonable, uncoordina…
Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie +4 more
SkillsInjector proposes a two-stage adaptive method to dynamically optimize skill selection, quantity, and presentation for LLM agents, significantly improving task performance over static injection m…
Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang +2 more
The paper reports on penetration tests conducted on proprietary, large-scale AI agent systems, finding that security vulnerabilities persist despite stricter development standards.
Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao +1 more
COLLEAGUE.SKILL introduces an automated system that distills heterogeneous traces of human expertise and role-specific knowledge into portable, inspectable, and usable AI skill packages.
The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…