Papers similar to 2603.19328v1

~ similar to 2603.19328v1· 20 results

cs.CRRecentMar 28, 2026

SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants

Haoyu Wang, Zibo Xiao, Yedi Zhang, Christopher M. Poskitt +1 more

The paper proposes SafeClaw-R, a novel framework that enforces safety as a system-level invariant over the execution graph to mitigate the high safety and security risks inherent in autonomous multi-a…

View →

cs.SEcs.CRRecentMar 18, 2026

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Xuan Chen, Lu Yan, Ruqi Zhang, Xiangyu Zhang

The paper introduces SafeAudit, a meta-audit framework that systematically enumerates test cases and uses a quantitative metric to uncover significant residual unsafe behaviors in LLM agents that exis…

View →

cs.SEcs.CRRecentMay 31, 2026

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more

The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…

View →

cs.CRcs.AIcs.CLRecentMay 12, 2026

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…

View →

cs.AIcs.CRRecentMay 6, 2026

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Chenglin Yang

AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…

View →

cs.CRRecentMay 2, 2026

Toward a Principled Framework for Agent Safety Measurement

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

The paper introduces BOA, a novel framework that measures agent safety by exhaustively searching the entire in-budget trajectory space, thereby identifying unsafe behaviors missed by traditional sampl…

View →

cs.CRcs.AIRecentApr 15, 2026

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen, Yongxuan Wu +7 more

The paper introduces SafeHarness, a novel, lifecycle-integrated security architecture that significantly reduces unsafe behavior and attack success rates in LLM agents by weaving multiple defense laye…

View →

cs.AIcs.CLcs.CYRecentJun 1, 2026

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji +3 more

SafeMCP is a server-side defense plugin that uses look-ahead reasoning to proactively filter and constrain tool acquisition for LLM agents, thereby mitigating catastrophic risks associated with expand…

View →

cs.CRcs.AIcs.CLRecentApr 8, 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen

The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…

View →

cs.AIcs.CLcs.CRRecentMay 17, 2026

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu +8 more

This survey provides a comprehensive, practical guide to ensuring the trustworthiness of complex, autonomous agentic AI systems by focusing on safety, robustness, privacy, and system security.

View →

cs.SEcs.AIcs.CRRecentMay 30, 2026

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more

The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, unad…

View →

cs.SEcs.AIcs.CRRecentMay 30, 2026

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more

View →

cs.LGcs.AIcs.CRRecentJun 2, 2026

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui +4 more

The paper introduces RUBAS, a rubric-based reinforcement learning framework that improves agent safety by providing fine-grained, multi-dimensional rewards for complex tool-use scenarios.

View →

cs.CRcs.AIcs.CLRecentMay 4, 2026

MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory

Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Charles Fleming +1 more

The paper introduces MAGE, a novel defensive framework that uses a dedicated 'shadow memory' to proactively detect and mitigate long-horizon threats against LLM agents during complex, multi-step inter…

View →

cs.AIcs.CRRecentMar 22, 2026

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates

Florin Adrian Chitan

The paper introduces Session Risk Memory (SRM), a lightweight module that enhances per-action authorization gates with trajectory-level risk assessment, significantly improving detection of distribute…

View →

cs.CRcs.AIRecentMar 21, 2026

Before the Tool Call: Deterministic Pre-Action Authorization for Autonomous AI Agents

Uchi Uchibeke

The paper introduces the Open Agent Passport (OAP), a deterministic pre-action authorization framework that intercepts and validates AI agent tool calls against a declarative policy, achieving a 0% su…

View →

cs.AIcs.PLRecentMay 27, 2026

LACUNA: Safe Agents as Recursive Program Holes

Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham +2 more

The paper introduces LACUNA, a novel programming model that allows LLM agents to write code that shapes the runtime environment while maintaining strong type-checking safety guarantees.

View →

cs.CRcs.AIcs.CLRecentApr 20, 2026

Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang, Yiqing Jiang

The paper introduces Owner-Harm, a formal threat model addressing the critical blind spot of AI agents harming their own deployers, demonstrating that specialized defenses are needed beyond generic sa…

View →

cs.CRcs.AIcs.CLRecentJun 3, 2026

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Nicholas Saban

The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…

View →

cs.CRcs.AIcs.LGRecentApr 22, 2026

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

Chao Pan, Yu Wu, Xin Yao

The paper introduces SafeRedirect, a system-level defense that prevents frontier LLMs from generating harmful content during legitimate tasks that structurally require it, significantly reducing unsaf…

View →