~ similar to 2605.05277v1· 20 results
GLiGuard introduces a compact, schema-conditioned bidirectional encoder that achieves state-of-the-art performance in LLM content moderation across multiple safety dimensions while drastically reducin…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…
The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Osama Zafar, Alexander Nemecek, Yiqian Zhang, Wenbiao Li +4 more
The paper introduces a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators to detect contextual data leakage in Retrieval-Augmented Generation (RAG) systems, achieving h…
The paper introduces Latent Policy Guardrail (LPG), a novel framework that efficiently enforces dynamic safety policies for LLMs by compressing complex policy deliberation into a small set of latent t…
Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more
The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…
Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more
SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…
Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi +8 more
The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated in…
Yitong Sun, Yao Huang, Teng Li, Ranjie Duan +4 more
MESA is a targeted alignment framework that decentralizes safety responsibilities across multiple experts in Mixture-of-Experts (MoE) LLMs using Optimal Transport theory, thereby improving safety robu…
Li Huang, Zhongxin Liu, Yifan Wu, Tao Yin +5 more
DeepGuard introduces a novel multi-layer semantic aggregation framework to enhance secure code generation by collecting vulnerability cues from multiple upper layers of LLMs, significantly improving s…
SentGuard introduces a novel sentence-level streaming guardrail that operates in parallel with LLM generation, achieving high detection rates of unsafe content early in the response while maintaining…
The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…
The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…
Bo Lv, Zhiheng Xu, KeDong Xiu, Ruyi Ding +3 more
RouteScan introduces a non-intrusive framework that audits the safety of Mixture-of-Experts (MoE) LLMs by analyzing low-level GPU expert routing telemetry, achieving high accuracy even on unseen harmf…
Zhihao Liu, Yifan Wu, Jian Lou, Di Wang +2 more
The paper proposes a novel zeroth-order optimization framework to enhance the robustness of LLM safety alignment, showing that few refinement steps can significantly improve safety while maintaining u…
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
LiSA introduces a conservative policy induction framework that enhances fixed AI guardrails by converting sparse, noisy failure reports into reusable, generalized policies, significantly improving saf…