~ similar to 2603.18245v1· 20 results
Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more
The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…
Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more
The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…
AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…
The paper analyzes how runtime safety enforcement impacts the performance of multi-step LLM agents, finding that while safety mechanisms can block unsafe actions, they impose a significant performance…
The paper introduces BOA, a novel framework that measures agent safety by exhaustively searching the entire in-budget trajectory space, thereby identifying unsafe behaviors missed by traditional sampl…
Xixun Lin, Yang Liu, Yancheng Chen, Yongxuan Wu +7 more
The paper introduces SafeHarness, a novel, lifecycle-integrated security architecture that significantly reduces unsafe behavior and attack success rates in LLM agents by weaving multiple defense laye…
Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more
The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, unad…
Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more
The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, expl…
Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu +3 more
The paper introduces POLARIS, a novel framework that systematically generates comprehensive and verifiable safety tests for LLMs by formalizing natural language policies into First-Order Logic and exp…
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
Haoyu Wang, Zibo Xiao, Yedi Zhang, Christopher M. Poskitt +1 more
The paper proposes SafeClaw-R, a novel framework that enforces safety as a system-level invariant over the execution graph to mitigate the high safety and security risks inherent in autonomous multi-a…
Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang +7 more
The paper introduces LITMUS, a novel benchmark that rigorously tests LLM agents for dangerous, physical-layer behavioral jailbreaks in real OS environments, revealing that current agents frequently ex…
Agentproof is a system that provides static, pre-deployment verification of safety properties in agent workflow graphs by automatically extracting a unified graph model and applying structural and tem…
Agent Audit is a novel security analysis system that comprehensively audits LLM agent applications by examining the entire software stack—including tool code, configuration, and prompts—to detect a wi…
QASecClaw, a multi-agent LLM system, significantly improves the accuracy of Static Application Security Testing (SAST) by using specialized LLM agents to filter out false positives, achieving an F1 sc…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more
The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more
The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…
Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham +2 more
The paper introduces LACUNA, a novel programming model that allows LLM agents to write code that shapes the runtime environment while maintaining strong type-checking safety guarantees.
Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li +4 more
The paper introduces a new security benchmark and framework to defend LLM agents against 'cognitive poisoning,' where malicious tools build trust through benign feedback before executing a harmful fin…