Agents
Autonomous agents, planning, tool use, and agentic workflows
20 papers indexed
Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
The paper investigates how LLMs allocate their internal computational depth during multi-turn agentic planning, finding that agents progressively recruit deeper layers and shift toward corrective upda…
AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?
The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on th…
Acting with AI: An Interaction-Based Framework for Agentic Tort Liability
The paper proposes an interaction-based legal framework for assigning tort liability when autonomous AI systems cause harm, categorizing liability based on the nature of the human-AI interaction.
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao +1 more
The paper introduces extsc{Ptah}, a multi-agent harness designed to improve verifiable multimodal deep research by orchestrating the entire report generation process, ensuring factual grounding and v…
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu +8 more
This survey provides a comprehensive, practical guide to ensuring the trustworthiness of complex, autonomous agentic AI systems by focusing on safety, robustness, privacy, and system security.
Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety
Muhammad Bilal, Jon Crowcroft, Ruizhi Wang, Xiaolong Xu +1 more
The paper surveys the use of LLMs for agentic NetOps and AIOps, arguing that operational reliability depends not on the model itself, but on robust surrounding machinery and workflow-centered evaluati…
Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations
The paper introduces Momento, a new benchmark that evaluates agentic AI's ability to maintain state and reason across multiple, disconnected sessions, revealing that current agents struggle with integ…
Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation
Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim +2 more
The paper introduces GAIATrace, a comprehensive token-level dataset, and Vidur-Agent, a simulator, to enable reproducible and detailed system-level characterization of complex multi-model agentic AI s…
When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents
Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin +3 more
This paper introduces a new benchmark to test Tool Description Poisoning (TDP) attacks on LLM agents, demonstrating that even advanced models like GPT-4o are highly vulnerable and that current defense…
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora
Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong +3 more
The paper introduces VeriTrip, a new verifiable benchmark that evaluates travel planning agents' ability to perform evidence-grounded reasoning over complex, unstructured, and multimodal web data, rev…
Architecture Matters for Multi-Agent Security
This paper empirically demonstrates that the architectural design of multi-agent systems significantly impacts their security, finding that coordination mechanisms can introduce vulnerabilities greate…
DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval
Siyuan Qi, Xinyuan Wang, Yingxuan Yang, Haochuan Guo +4 more
DynaTree introduces a two-stage framework that pre-constructs a reusable retrieval tree offline using coordinated agents, allowing for efficient, structure-aware, and highly effective time-sensitive n…
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie +2 more
The paper introduces ARGUS, a defense mechanism that uses provenance-aware decision auditing to protect LLM agents from sophisticated, context-aware prompt injection attacks, significantly reducing th…
SUDP: Secret-Use Delegation Protocol for Agentic Systems
The paper proposes the Secret-Use Delegation Protocol (SUDP) to solve the Agent Secret Use (ASU) problem, ensuring that autonomous agents can perform user-authorized operations without gaining reusabl…
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Engineering Robustness into Personal Agents with the AI Workflow Store
The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.
Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems
The paper introduces a self-healing agentic orchestrator that significantly improves the reliability of tool-augmented LLM systems by treating failure as a bounded runtime control problem, achieving h…
An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning
The paper proposes a hybrid LLM-based assistance system that enhances traditional capability-based planning by providing natural language interaction, interpretability, and flexible knowledge model ad…
STRIATUM-CTF: A Protocol-Driven Agentic Framework for General-Purpose CTF Solving
The paper introduces STRIATUM-CTF, a modular agentic framework that uses a standardized context protocol to enable LLMs to perform multi-step, stateful reasoning for general-purpose CTF solving, achie…
COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents
Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian +4 more
COMPASS introduces a Cognitive MCTS-Guided Process Alignment framework to ensure robust safety for LLM search agents by identifying and supervising risky intermediate steps in multi-step reasoning.