~ similar to 2605.29068v1· 20 results
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…
The paper introduces Latent Policy Guardrail (LPG), a novel framework that efficiently enforces dynamic safety policies for LLMs by compressing complex policy deliberation into a small set of latent t…
SentGuard introduces a novel sentence-level streaming guardrail that operates in parallel with LLM generation, achieving high detection rates of unsafe content early in the response while maintaining…
GLiGuard introduces a compact, schema-conditioned bidirectional encoder that achieves state-of-the-art performance in LLM content moderation across multiple safety dimensions while drastically reducin…
Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi +8 more
The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated in…
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple distinct unsafe categorie…
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…
Zhenhao Xu, Wenhan Chang, Yichuan Chen, Yuxin Fang +2 more
The paper proposes Safety Context Injection (SCI), an inference-time framework that prepends a structured external risk report to protect Large Reasoning Models (LRMs) against sophisticated jailbreaks…
Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more
The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…
Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim +5 more
EMBGuard introduces a novel, MLLM-based safety guardrail that explicitly identifies and explains physical hazards from (visual observation, action) pairs, enabling safer planning for embodied agents.
Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang +1 more
The paper introduces an executable Proof-Constrained Action (ePCA) framework that secures AI agents by forcing them to formalize their intentions into first-order logical constraints, achieving provab…
Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang +1 more
The paper introduces a formal, logically constrained framework, ePCA, to secure advanced AI agents by forcing them to translate natural language intentions into first-order logical constraints before…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley +1 more
The paper proposes and evaluates symbolic guardrails as a practical method to provide strong, verifiable safety and security guarantees for domain-specific AI agents without compromising their utility…
LiSA introduces a conservative policy induction framework that enhances fixed AI guardrails by converting sparse, noisy failure reports into reusable, generalized policies, significantly improving saf…
The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…
GLiNER Guard (GLiGuard) introduces a unified, efficient encoder family that simultaneously performs safety classification and PII detection in a single forward pass, offering a practical, low-cost alt…
Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim +4 more
Membrane introduces a self-evolving guardrail using Contrastive Safety Memory (CSM) that generalizes across topical jailbreak variants, achieving superior safety performance while minimizing benign re…