Papers similar to 2605.30434

~ similar to 2605.30434· 20 results

cs.CLcs.AIRecentMay 31, 2026

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more

The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…

View →

cs.AIRecentMay 31, 2026

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou +3 more

The paper introduces TaskWeave, a hierarchical agentic framework that successfully simulates long-horizon organizational dynamics by treating coordination as a memory-centered problem, demonstrating t…

View →

cs.AIcs.LGRecentMay 30, 2026

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge +7 more

MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…

View →

cs.AIcs.CLRecentJun 1, 2026

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao +2 more

The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…

View →

cs.CLRecentJun 1, 2026

Scaling Agentic Capabilities via Grounded Interaction Synthesis

Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao +3 more

The paper introduces Grounded Agentic Interaction Synthesis (GAIS), a framework that generates high-quality, diverse, and complex agentic training data by anchoring tasks to real-world protocols, sign…

View →

cs.CLRecentMay 30, 2026

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

The paper introduces Momento, a new benchmark that evaluates agentic AI's ability to maintain state and reason across multiple, disconnected sessions, revealing that current agents struggle with integ…

View →

cs.AIRecentMay 31, 2026

"Skill issues'': data-centric optimization of lakehouse agents

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…

View →

cs.AIRecentMay 29, 2026

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

Junjie Nian, Kang Chen, Ge Zhang, Yixin Cao +1 more

TraceGraph introduces a graph-based framework to map agent decision-making across pooled trajectories, revealing hidden differences in agent behavior and improving performance by targeting known failu…

View →

cs.AIRecentMay 28, 2026

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more

The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…

View →

cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →

cs.AIRecentMay 31, 2026

The Case for Model Science: Verify, Explore, Steer, Refine

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more

The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.

View →

cs.CLcs.AIcs.LGRecentMay 29, 2026

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

LongTraceRL addresses long-context reasoning challenges by generating highly challenging training data and introducing a fine-grained rubric reward, significantly improving evidence-grounded reasoning…

View →

cs.AIRecentMay 27, 2026

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie +6 more

The paper introduces AsyncTool, a new benchmark designed to evaluate LLM agents' ability to handle multiple, concurrent tasks with delayed tool feedback, demonstrating that asynchronous coordination i…

View →

cs.MAcs.CLcs.LGRecentJun 1, 2026

Multi-Agent Computer Use

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

The paper proposes Multi-Agent Computer Use (MACU) systems, which significantly improve performance on complex, long-horizon tasks by enabling parallel execution and dynamic task decomposition compare…

View →

cs.CRcs.LGRecentApr 24, 2026

Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Jun He, Deying Yu

The paper introduces Sovereign Agentic Loops (SAL), a control-plane architecture that decouples LLM reasoning from system execution to enhance safety and reliability in real-world AI agents.

View →

cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →

cs.AIcs.CLRecentMay 28, 2026

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Lorenz Kutschka, Bernhard Geiger

This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…

View →

cs.AIRecentMay 27, 2026

Verifiable Benchmarking of Long-Horizon Spatial Biology

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

The paper introduces SpatialBench-Long, a comprehensive benchmark designed to test AI agents' ability to perform end-to-end scientific reasoning and derive biological claims from complex, raw spatial…

View →

cs.AIRecentMay 28, 2026

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang +3 more

The paper introduces RedundancyBench, a new benchmark for detecting unnecessary steps in LLM agent trajectories, finding that this task is highly complex and difficult to solve.

View →

cs.CLRecentMay 30, 2026

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao +1 more

The paper proposes MERIT, a dual-level, multi-horizon memory retrieval framework that significantly improves the performance of interactive text-to-SQL agents by providing both global and local memory…

View →