~ similar to 2605.29534· 20 results
Yutao Luo, Haotian Zhu, Shuchao Pang, Zhigang Lu +3 more
The paper introduces AgentRAE, a novel backdoor attack that successfully forces mobile GUI agents to execute remote actions using visually natural triggers found in system notifications, achieving hig…
Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu +7 more
ExpGraph is a model-agnostic framework that uses a self-evolving experience graph to enable LLM agents to reuse past successful strategies and failure lessons, significantly improving performance acro…
Yanqiu Zhao, Dongying Zheng, Kaibo Huang, Yukun Wei +2 more
MaskClaw is an edge-side privacy arbitrator that protects sensitive data in GUI agent screenshots by combining local visual evidence, task-specific policies, and a skill-evolution mechanism.
Weile Chen, Bingchen Miao, Qifan Yu, Wendong Bu +5 more
The paper proposes SCALE, a self-improving web agent framework that uses adversarial roles and graph exploration to autonomously discover agent limitations and enhance adaptability in complex web envi…
Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai +6 more
The paper introduces OpenWebRL, an open framework that enables training visual web agents using online multi-turn Reinforcement Learning directly on live websites, achieving state-of-the-art performan…
Agentproof is a system that provides static, pre-deployment verification of safety properties in agent workflow graphs by automatically extracting a unified graph model and applying structural and tem…
The paper proposes a novel multimodal multi-agent framework that uses a topological knowledge graph to enable robust, adaptive automatic workflow execution, overcoming the limitations of treating task…
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
Xiaoyi Chen, Yifei Gao, Yang Xu, Xingxing Song +2 more
The paper introduces GUITestScape, a comprehensive benchmark for exploratory GUI testing, and GUIJudge, an open-set evaluator that significantly improves the assessment of AI agents' defect detection…
Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim +5 more
EMBGuard introduces a novel, MLLM-based safety guardrail that explicitly identifies and explains physical hazards from (visual observation, action) pairs, enabling safer planning for embodied agents.
Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim +11 more
The paper introduces K-BrowseComp, a new web-browsing agent benchmark of 400 problems grounded in Korean contexts, demonstrating that current frontier LLMs struggle significantly with complex, context…
Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu +4 more
The paper argues that web agents should abandon the reactive ReAct paradigm in favor of a plan-then-execute approach, which requires developing typed, task-level APIs to properly structure web interac…
Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui +4 more
The paper introduces RUBAS, a rubric-based reinforcement learning framework that improves agent safety by providing fine-grained, multi-dimensional rewards for complex tool-use scenarios.
Zixuan Zhu, Yitong Hu, Yong Dai, Junfeng Fang +3 more
The paper introduces Unified Context Evolution (UCE), a gradient-free framework that externalizes and manages agent experience into a typed, evolving library, significantly improving performance on mu…
AgentWall is a runtime safety layer that intercepts and evaluates all proposed actions from local AI agents against a declarative policy, ensuring safety before execution.
Junjie Nian, Kang Chen, Ge Zhang, Yixin Cao +1 more
TraceGraph introduces a graph-based framework to map agent decision-making across pooled trajectories, revealing hidden differences in agent behavior and improving performance by targeting known failu…
The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…
Zhixin Lin, Jungang Li, Dongliang Xu, Shidong Pan +4 more
The paper proposes Trajectory Induced Preference Optimization (TIPO) to improve mobile GUI agent personalization by explicitly modeling and optimizing for privacy-related behavioral differences in exe…
The paper introduces VibeSearchBench, a new benchmark designed to evaluate long-horizon, proactive search capabilities, demonstrating that current state-of-the-art LLM agents are still significantly i…
This paper empirically demonstrates that the choice of plan representation (e.g., checklist vs. narrative) significantly impacts the robustness and success rate of LLM-based web agents.