Papers similar to 2605.30693

~ similar to 2605.30693· 20 results

cs.CRcs.CLRecentMay 29, 2026

Triaging Threats to Specialized Guardrails

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more

The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…

View →

cs.CRcs.AIcs.CLRecentApr 8, 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen

The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…

View →

cs.CLRecentMay 29, 2026

EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents

Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim +5 more

EMBGuard introduces a novel, MLLM-based safety guardrail that explicitly identifies and explains physical hazards from (visual observation, action) pairs, enabling safer planning for embodied agents.

View →

cs.SEcs.AIcs.CRRecentApr 16, 2026

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley +1 more

The paper proposes and evaluates symbolic guardrails as a practical method to provide strong, verifiable safety and security guarantees for domain-specific AI agents without compromising their utility…

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more

BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen +12 more

BraveGuard is a self-evolving defense framework that improves the safety of computer-use agents by training guard models on open-world, multi-step threat trajectories rather than static benchmarks.

View →

cs.CLcs.CRRecentMay 1, 2026

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more

The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…

View →

cs.CRcs.AIRecentApr 27, 2026

A Comparative Evaluation of AI Agent Security Guardrails

Qi Li, Jiu Li, Pingtao Wei, Jianjun Xu +7 more

This paper comparatively evaluates DKnownAI Guard against three competitors, demonstrating that DKnownAI Guard achieves superior performance in detecting both agent-specific threats and harmful conten…

View →

cs.CRcs.CLRecentApr 17, 2026

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts

Hua-Rong Chu, Kuan-Chun Wang, Yao-Te Huang

The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Ihor Stepanov, Aleksandr Smechov

The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…

View →

cs.CRcs.AIRecentMay 11, 2026

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Saba Pourhanifeh, AbdulAziz AbdulGhaffar, Ashraf Matrawy

The paper empirically evaluates domain-adapted and general-purpose LLMs for structured threat modelling (STRIDE on 5G security), finding that domain adaptation and model size do not guarantee reliable…

View →

cs.LGcs.AIcs.CRRecentApr 8, 2026

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam +3 more

The paper demonstrates that fine-tuning safety guard models on benign data can catastrophically collapse their safety alignment, proposing Fisher-Weighted Safety Subspace Regularization (FW-SSR) to ac…

View →

cs.CLcs.CRRecentMay 8, 2026

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis

GLiGuard introduces a compact, schema-conditioned bidirectional encoder that achieves state-of-the-art performance in LLM content moderation across multiple safety dimensions while drastically reducin…

View →

cs.CRRecentMay 22, 2026

Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers

Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He +4 more

The paper introduces the Prompt Overflow Attack, demonstrating that guardrail models inspecting truncated or segmented inputs fail to detect malicious instructions that are only actionable when the fu…

View →

cs.LGcs.CLcs.CRRecentMay 14, 2026

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar +5 more

LiSA introduces a conservative policy induction framework that enhances fixed AI guardrails by converting sparse, noisy failure reports into reusable, generalized policies, significantly improving saf…

View →

cs.CLRecentJun 1, 2026

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li +3 more

SentGuard introduces a novel sentence-level streaming guardrail that operates in parallel with LLM generation, achieving high detection rates of unsafe content early in the response while maintaining…

View →

cs.CLRecentMay 29, 2026

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi +8 more

The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated in…

View →

cs.CRcs.AIcs.CLRecentMay 12, 2026

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…

View →