The paper introduces TravelEval, a comprehensive, six-dimensional benchmarking framework that evaluates LLM-powered travel plans using realistic spatio-temporal simulation, revealing that current LLMs struggle with globally-optimized, multi-dimensional planning.
The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora
The paper introduces VeriTrip, a new verifiable benchmark that evaluates travel…
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
The paper introduces Honeyval, a comprehensive evaluation framework, to rigorous…
Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents
This paper empirically demonstrates that the choice of plan representation (e.g.…
From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs
The paper proposes HTP, a novel framework that leverages Large Language Models (…
Planning with the Views via Scene Self-Exploration
The paper addresses the challenge of multi-turn view planning for VLMs by propos…
An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning
The paper proposes a hybrid LLM-based assistance system that enhances traditiona…
A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models
The paper proposes a multi-dimensional evaluation framework to assess EEG founda…
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the re…