Papers similar to 2604.07223v1

~ similar to 2604.07223v1· 20 results

cs.CRcs.CLRecentMay 29, 2026

Triaging Threats to Specialized Guardrails

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more

The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.

View →

cs.CRcs.CLRecentMay 29, 2026

Triaging Threats to Specialized Guardrails

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more

The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple distinct unsafe categorie…

View →

cs.CRcs.AIcs.CLRecentMay 12, 2026

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…

View →

cs.CRcs.SERecentMar 18, 2026

Guardrails as Infrastructure: Policy-First Control for Tool-Orchestrated Workflows

Akshey Sigdel, Rista Baral

The paper introduces Policy-First Tooling, a model-agnostic permission layer that significantly enhances the safety and reliability of tool-orchestrated AI workflows by enforcing explicit constraints…

View →

cs.CLcs.CRRecentMay 1, 2026

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more

The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…

View →

cs.CRcs.AIcs.CLRecentMay 4, 2026

MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory

Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Charles Fleming +1 more

The paper introduces MAGE, a novel defensive framework that uses a dedicated 'shadow memory' to proactively detect and mitigate long-horizon threats against LLM agents during complex, multi-step inter…

View →

cs.SEcs.AIcs.CRRecentMay 30, 2026

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more

The paper introduces SkillReact, a framework that measures compositional risk in agent skill ecosystems, finding that even if individual skills are safe, their combination can create significant, unad…

View →

cs.SEcs.AIcs.CRRecentMay 30, 2026

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You +5 more

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more

BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen +12 more

BraveGuard is a self-evolving defense framework that improves the safety of computer-use agents by training guard models on open-world, multi-step threat trajectories rather than static benchmarks.

View →

cs.CRcs.AIRecentApr 13, 2026

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun

ClawGuard is a novel runtime security framework that deterministically enforces user-confirmed rules at tool-call boundaries to protect LLM agents from indirect prompt injection.

View →

cs.CRcs.AIcs.CLRecentApr 20, 2026

Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang, Yiqing Jiang

The paper introduces Owner-Harm, a formal threat model addressing the critical blind spot of AI agents harming their own deployers, demonstrating that specialized defenses are needed beyond generic sa…

View →

cs.CRcs.AIcs.LGRecentMar 29, 2026

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

Alexandre Cristovão Maiorano

The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…

View →

cs.SEcs.AIcs.CRRecentApr 16, 2026

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley +1 more

The paper proposes and evaluates symbolic guardrails as a practical method to provide strong, verifiable safety and security guarantees for domain-specific AI agents without compromising their utility…

View →

cs.CRcs.AIRecentMay 19, 2026

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more

The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…

View →

cs.CRcs.AIRecentApr 1, 2026

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…

View →

cs.CRRecentMar 18, 2026

The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents

Tanmay Sah, Vishal Srivastava, Dolly Sah, Kayden Jordan

The paper analyzes how runtime safety enforcement impacts the performance of multi-step LLM agents, finding that while safety mechanisms can block unsafe actions, they impose a significant performance…

View →

cs.SEcs.CRRecentMay 31, 2026

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more

The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…

View →