~ similar to 2604.18658v1· 20 results
AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…
Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more
The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…
Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang +2 more
The paper reports on penetration tests conducted on proprietary, large-scale AI agent systems, finding that security vulnerabilities persist despite stricter development standards.
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
AgentWall is a runtime safety layer that intercepts and evaluates all proposed actions from local AI agents against a declarative policy, ensuring safety before execution.
This paper introduces AgentREVEAL, a diagnostic framework showing that the utility of web retrieval in LLM agents creates a safety-utility trade-off, as relevance itself can degrade safety alignment a…
This paper introduces AgentREVEAL, a diagnostic framework that demonstrates that the utility of web retrieval in LLM agents creates a safety-utility trade-off, as relevance itself can degrade safety a…
Baoyuan Wu, Qingshan Liu, Adel Bibi, Irwin King +1 more
The paper argues that the Authorization-Execution Gap (AEG)—the divergence between intended authorization and actual execution—is a critical safety and security flaw in open-world agents, requiring so…
This paper systematically analyzes security risks in cloud-hosted, tool-enabled AI agents, concluding that most risks stem from over-privileged tools and capability-intent mismatches rather than novel…
Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more
The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more
The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, unad…
Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more
The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, expl…
The paper introduces BOA, a novel framework that measures agent safety by exhaustively searching the entire in-budget trajectory space, thereby identifying unsafe behaviors missed by traditional sampl…
The paper introduces DeepTrap, an automated framework that evaluates security vulnerabilities in agentic language models by manipulating their internal execution contexts, demonstrating that task comp…
Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang +46 more
The paper introduces AgentDoG 1.5, a lightweight and scalable alignment framework that significantly improves AI agent safety and security for complex open-world agent deployments.
Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang +46 more
The paper introduces AgentDoG 1.5, a lightweight and scalable alignment framework that significantly improves AI agent safety and security for complex, open-world agentic scenarios.
The paper proposes a novel, empirical methodology called 'backchaining' to derive and prioritize Loss of Control (LoC) mitigations by analyzing the errors an AI system makes on mission-specific nation…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…