Papers similar to 2606.01046

~ similar to 2606.01046· 20 results

cs.AIRecentMay 27, 2026

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong +3 more

The paper introduces VeriTrip, a new verifiable benchmark that evaluates travel planning agents' ability to perform evidence-grounded reasoning over complex, unstructured, and multimodal web data, rev…

View →

cs.CLcs.AIcs.LGRecentMay 28, 2026

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù +1 more

This paper empirically demonstrates that the choice of plan representation (e.g., checklist vs. narrative) significantly impacts the robustness and success rate of LLM-based web agents.

View →

cs.LGcs.AIRecentJun 1, 2026

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

Shibo Zhu, Xiaodan Shi, Dayin Chen, Yuntian Chen +3 more

The paper introduces CityTrajBench, a unified benchmark framework that standardizes the evaluation of city-scale vehicle trajectory generation, demonstrating that no single generation model dominates…

View →

cs.CLcs.AIRecentMay 31, 2026

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more

The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…

View →

cs.CLRecentJun 1, 2026

CARTE: A Benchmark for Mapping Language Model Knowledge Across France

Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang +1 more

The paper introduces CARTE, a new benchmark designed to test how well large language models understand fine-grained, regionally differentiated knowledge across the 13 metropolitan regions of France, r…

View →

cs.CLRecentJun 1, 2026

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

Danqing Wang, Akshay Sivaraman, Lei Li

The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…

View →

cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →

cs.AIRecentMay 27, 2026

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more

The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…

View →

cs.AIRecentMay 28, 2026

GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

Yifan Liu, Yanling Sang, Xishun Liao, Morgan Sun +5 more

The paper proposes a novel four-stage simulation framework that uses GPS-derived seasonal spatial priors and LLMs to generate demographically accurate, synthetic tourist mobility schedules for urban p…

View →

cs.AIRecentMay 31, 2026

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

Siyan Li, Zehao Wang, Jiachen Li, Kanok Boriboonsomsin +2 more

This survey reviews how Large and Multi-modal Language Models (LLMs/MM-LLMs) are being applied to integrate diverse data sources for enhanced decision support in transportation systems management and…

View →

cs.AIRecentJun 1, 2026

Bridging the Last Mile of Time Series Forecasting with LLM Agents

Yuhua Liao, Zetian Wang, Qiangqiang Nie, Zhenhua Zhang

The paper introduces an LLM-agent framework to solve the 'last-mile forecasting' problem, bridging the gap between raw statistical predictions and business-ready forecasts by incorporating weakly stru…

View →

cs.CLcs.AIRecentMay 29, 2026

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian +4 more

The paper proposes a dual-interventional framework to characterize how linguistic structures and contextual cues influence LLMs' spatial reasoning for navigation, finding that topological information…

View →

cs.AIRecentMay 27, 2026

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen +1 more

This paper introduces MCTS-Guided Group Relative Policy Optimization (M-GRPO) to enhance LLM spatial reasoning by improving the decomposition of complex tasks into optimal sub-tasks.

View →

cs.AIRecentMay 31, 2026

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou +3 more

The paper introduces TaskWeave, a hierarchical agentic framework that successfully simulates long-horizon organizational dynamics by treating coordination as a memory-centered problem, demonstrating t…

View →

cs.CLcs.IRRecentMay 29, 2026

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more

The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.

View →

cs.AIRecentMay 27, 2026

A Unified Framework for the Evaluation of LLM Agentic Capabilities

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more

The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…

View →

cs.CLcs.AIRecentMay 28, 2026

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Ruoxi Su, Yuhan Liu, Jingyu Hu

The paper introduces an adaptive interview framework to gather rich persona context, demonstrating that LLMs improve decision alignment in moral dilemmas only when they selectively ground their decisi…

View →

cs.AIcs.CLRecentJun 1, 2026

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao +2 more

The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…

View →

cs.AIRecentJun 1, 2026

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang +8 more

The paper introduces MCP-Persona, a novel benchmark designed to evaluate LLM agents' performance on real-world, personalized applications using the Model Context Protocol (MCP), revealing that current…

View →

cs.CLcs.AIRecentMay 27, 2026

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Xiaohongshu Inc

The paper introduces VibeSearchBench, a new benchmark designed to evaluate long-horizon, proactive search capabilities, demonstrating that current state-of-the-art LLM agents are still significantly i…

View →