~ similar to 2606.01912· 20 results
Yi Gu, Huacan Wang, Shuo Zhang, Yuqing Hou +9 more
The paper introduces HomeFlow, a verifiable data flywheel that procedurally generates high-quality, multi-turn training data for smart home agents, achieving state-of-the-art performance on smart home…
Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang +5 more
The paper introduces MiCU, a domain-specific LLM that significantly improves smart home command understanding, especially for ambiguous commands, by synthesizing training data and optimizing the model…
Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang +8 more
The paper introduces MCP-Persona, a novel benchmark designed to evaluate LLM agents' performance on real-world, personalized applications using the Model Context Protocol (MCP), revealing that current…
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more
The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more
The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.
Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more
The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…
Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more
The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.
Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li +20 more
The paper introduces PhoneWorld, a scalable pipeline that automatically converts real-world GUI trajectories and screenshots into controllable, reproducible phone-use environments, significantly impro…
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
The paper introduces HIDBench, a new benchmark for evaluating LLMs' ability to perform host-based intrusion detection using complex, noisy system logs, finding that model performance degrades signific…
This paper systematically analyzes the complex design space of hybrid multi-agent systems combining on-device and cloud AI models, finding that the optimal architecture is highly task-dependent and th…
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…
The paper introduces CosmicFish-HRM, a compact language model that achieves adaptive reasoning by dynamically allocating computational effort through a Hierarchical Reasoning Module (HRM), showing tha…
The paper introduces VibeSearchBench, a new benchmark designed to evaluate long-horizon, proactive search capabilities, demonstrating that current state-of-the-art LLM agents are still significantly i…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more
The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…