~ similar to 2605.10779v1· 20 results
Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian +7 more
The paper introduces a multi-dimensional evasion framework and a new benchmark (A3S-Bench) to test autonomous agents, demonstrating that stateful, multi-turn attacks significantly increase system risk…
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…
The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…
The paper introduces BOA, a novel framework that measures agent safety by exhaustively searching the entire in-budget trajectory space, thereby identifying unsafe behaviors missed by traditional sampl…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
Xixun Lin, Yang Liu, Yancheng Chen, Yongxuan Wu +7 more
The paper introduces SafeHarness, a novel, lifecycle-integrated security architecture that significantly reduces unsafe behavior and attack success rates in LLM agents by weaving multiple defense laye…
Churui Zeng, Weiwei Qi, Kedong Xiu, Tianhang Zheng +4 more
The paper proposes TRACE, a novel agentic jailbreaking framework that successfully bypasses safety mechanisms of advanced LLM agents by decomposing malicious tasks and disguising harmful subtasks with…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin +3 more
This paper introduces a new benchmark to test Tool Description Poisoning (TDP) attacks on LLM agents, demonstrating that even advanced models like GPT-4o are highly vulnerable and that current defense…
The paper introduces Session Risk Memory (SRM), a lightweight module that enhances per-action authorization gates with trajectory-level risk assessment, significantly improving detection of distribute…
Automation-Exploit is a multi-agent LLM framework that enables adaptive offensive security by using a digital twin to safely test and execute high-risk memory-corruption exploits on live targets.
The paper introduces and measures 'accidental meltdown,' a new type of unsafe agent behavior triggered by benign environmental errors, finding that such meltdowns occur frequently and often involve hi…
Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more
The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…
The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achievi…
The paper introduces SafeAudit, a meta-audit framework that systematically enumerates test cases and uses a quantitative metric to uncover significant residual unsafe behaviors in LLM agents that exis…
Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu +1 more
The paper proposes a novel attack paradigm demonstrating how compromising a single robot in an LLM-controlled multi-robot system can rapidly propagate malicious intent to cause coordinated unsafe acti…
Yujie Ma, Jialin Rong, Chenxi Yang, Lili Quan +3 more
The paper addresses the gap in understanding real-world LLM-in-the-loop vulnerabilities by creating the LLMCVE dataset and demonstrating that these vulnerabilities are significantly harder to repair t…
AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…