TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

The paper introduces VeriTrip, a new verifiable benchmark that evaluates travel…

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

The paper introduces Honeyval, a comprehensive evaluation framework, to rigorous…

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

This paper empirically demonstrates that the choice of plan representation (e.g.…

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

The paper proposes HTP, a novel framework that leverages Large Language Models (…

Planning with the Views via Scene Self-Exploration

The paper addresses the challenge of multi-turn view planning for VLMs by propos…

An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning

The paper proposes a hybrid LLM-based assistance system that enhances traditiona…

A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

The paper proposes a multi-dimensional evaluation framework to assess EEG founda…

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the re…