Papers similar to 2604.18658v1

~ similar to 2604.18658v1· 20 results

cs.AIcs.CRRecentMay 6, 2026

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…

View →

cs.CRcs.AIRecentApr 12, 2026

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more

The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…

View →

cs.CRcs.AIRecentMay 26, 2026

Lessons from Penetration Tests on Large-Scale Agent Systems

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang +2 more

The paper reports on penetration tests conducted on proprietary, large-scale AI agent systems, finding that security vulnerabilities persist despite stricter development standards.

View →

cs.CRcs.AIRecentApr 21, 2026

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Alankrit Chona, Igor Kozlov, Ambuj Kumar

The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…

View →

cs.AIcs.CRRecentMar 24, 2026

AgentWall: A Runtime Safety Layer for Local AI Agents

Ashwin Aravind

AgentWall is a runtime safety layer that intercepts and evaluates all proposed actions from local AI agents against a declarative policy, ensuring safety before execution.

View →

cs.CLcs.AIcs.CRRecentMay 28, 2026

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Aditya Nawal, Manit Baser, Mohan Gurusamy

This paper introduces AgentREVEAL, a diagnostic framework showing that the utility of web retrieval in LLM agents creates a safety-utility trade-off, as relevance itself can degrade safety alignment a…

View →

cs.CLcs.AIcs.CRRecentMay 28, 2026

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Aditya Nawal, Manit Baser, Mohan Gurusamy

This paper introduces AgentREVEAL, a diagnostic framework that demonstrates that the utility of web retrieval in LLM agents creates a safety-utility trade-off, as relevance itself can degrade safety a…

View →

cs.CRcs.AIRecentMay 10, 2026

The Authorization-Execution Gap Is a Major Safety and Security Problem in Open-World Agents

Baoyuan Wu, Qingshan Liu, Adel Bibi, Irwin King +1 more

The paper argues that the Authorization-Execution Gap (AEG)—the divergence between intended authorization and actual execution—is a critical safety and security flaw in open-world agents, requiring so…

View →

cs.CRcs.AIRecentMay 10, 2026

Security Risks in Tool-Enabled AI Agents: A Systematic Analysis of Privileged Execution Environments

Hardik Goel

This paper systematically analyzes security risks in cloud-hosted, tool-enabled AI agents, concluding that most risks stem from over-privileged tools and capability-intent mismatches rather than novel…

View →

cs.CRcs.AIcs.CLRecentMay 12, 2026

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…

View →

cs.CRcs.AIcs.CLRecentApr 8, 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen

The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…

View →

cs.SEcs.AIcs.CRRecentMay 30, 2026

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more

The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, unad…

View →

cs.SEcs.AIcs.CRRecentMay 30, 2026

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more

View →

cs.CRRecentMay 2, 2026

Toward a Principled Framework for Agent Safety Measurement

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

The paper introduces BOA, a novel framework that measures agent safety by exhaustively searching the entire in-budget trajectory space, thereby identifying unsafe behaviors missed by traditional sampl…

View →

cs.CRcs.AIRecentMay 11, 2026

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

Hongwei Yao, Yiming Liu, Yiling He, Bingrun Yang

The paper introduces DeepTrap, an automated framework that evaluates security vulnerabilities in agentic language models by manipulating their internal execution contexts, demonstrating that task comp…

View →

cs.AIcs.CLcs.CRRecentMay 28, 2026

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang +46 more

The paper introduces AgentDoG 1.5, a lightweight and scalable alignment framework that significantly improves AI agent safety and security for complex open-world agent deployments.

View →

cs.AIcs.CLcs.CRRecentMay 28, 2026

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang +46 more

The paper introduces AgentDoG 1.5, a lightweight and scalable alignment framework that significantly improves AI agent safety and security for complex, open-world agentic scenarios.

View →

cs.CYcs.CRRecentMay 20, 2026

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

Matteo Pistillo, Samantha Faraone, Joshua Herman

The paper proposes a novel, empirical methodology called 'backchaining' to derive and prioritize Loss of Control (LoC) mitigations by analyzing the errors an AI system makes on mission-specific nation…

View →

cs.CRcs.AIRecentMay 7, 2026

LoopTrap: Termination Poisoning Attacks on LLM Agents

Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more

The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.

View →

cs.CRcs.AIcs.CLRecentJun 3, 2026

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Nicholas Saban

The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…

View →